The ELF format - how programs look from the inside
Introduction
ELF is the file format used for object files (.o
's), binaries, shared
libraries and core dumps in Linux.
It's actually pretty simple and well thought-out.
ELF has the same layout for all architectures, however endianness and word size can differ; relocation types, symbol types and the like may have platform-specific values, and of course the contained code is arch specific.
An ELF file provides 2 views on the data it contains: A linking view and an execution view. Those two views can be accessed by two headers: the section header table and the program header table.
Linking view: Section Header Table (SHT)
The SHT gives an overview on the sections contained in the ELF file. Of
particular interest are REL
sections (relocations), SYMTAB/DYNSYM
(symbol
tables), VERSYM
/VERDEF
/VERNEED
sections (symbol versioning information).
greek0@iphigenie:~$ readelf -S /bin/bash
There are 26 section headers, starting at offset 0xa4e10:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 00000 000000 00 0 0 0
[ 1] .interp PROGBITS 08048134 00134 000013 00 A 0 0 1
[ 2] .note.ABI-tag NOTE 08048148 00148 000020 00 A 0 0 4
[ 3] .hash HASH 08048168 00168 002e48 04 A 4 0 4
[ 4] .dynsym DYNSYM 0804afb0 02fb0 007890 10 A 5 1 4
[ 5] .dynstr STRTAB 08052840 0a840 0074e2 00 A 0 0 1
[ 6] .gnu.version VERSYM 08059d22 11d22 000f12 02 A 4 0 2
[ 7] .gnu.version_r VERNEED 0805ac34 12c34 000090 00 A 5 2 4
[ 8] .rel.dyn REL 0805acc4 12cc4 000040 08 A 4 0 4
[ 9] .rel.plt REL 0805ad04 12d04 0005a8 08 A 4 11 4
[10] .init PROGBITS 0805b2ac 132ac 000017 00 AX 0 0 4
[11] .plt PROGBITS 0805b2c4 132c4 000b60 04 AX 0 0 4
[12] .text PROGBITS 0805be30 13e30 077154 00 AX 0 0 16
[13] .fini PROGBITS 080d2f84 8af84 00001a 00 AX 0 0 4
[14] .rodata PROGBITS 080d2fa0 8afa0 015198 00 A 0 0 32
[15] .eh_frame_hdr PROGBITS 080e8138 a0138 00002c 00 A 0 0 4
[16] .eh_frame PROGBITS 080e8164 a0164 00009c 00 A 0 0 4
[17] .ctors PROGBITS 080e9200 a0200 000008 00 WA 0 0 4
[18] .dtors PROGBITS 080e9208 a0208 000008 00 WA 0 0 4
[19] .jcr PROGBITS 080e9210 a0210 000004 00 WA 0 0 4
[20] .dynamic DYNAMIC 080e9214 a0214 0000d8 08 WA 5 0 4
[21] .got PROGBITS 080e92ec a02ec 000004 04 WA 0 0 4
[22] .got.plt PROGBITS 080e92f0 a02f0 0002e0 04 WA 0 0 4
[23] .data PROGBITS 080e95e0 a05e0 004764 00 WA 0 0 32
[24] .bss NOBITS 080edd60 a4d44 004bc8 00 WA 0 0 32
[25] .shstrtab STRTAB 00000000 a4d44 0000cc 00 0 0 1
Execution view: Program Header Table (PHT)
The PHT contains information for the kernel on how to start the program. The
LOAD
directives determinate what parts of the ELF file get mapped into memory.
The INTERP
directive specifies an ELF interpreter, which is normally
/lib/ld-linux.so.2
on Linux systems.
The DYNAMIC
entry points to the .dynamic
section which contains information
used by the ELF interpreter to setup the binary.
greek0@iphigenie:~$ readelf -l /bin/bash
Elf file type is EXEC (Executable file)
Entry point 0x805be30
There are 8 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x08048034 0x08048034 0x00100 0x00100 R E 0x4
INTERP 0x000134 0x08048134 0x08048134 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x08048000 0x08048000 0xa0200 0xa0200 R E 0x1000
LOAD 0x0a0200 0x080e9200 0x080e9200 0x04b44 0x09728 RW 0x1000
DYNAMIC 0x0a0214 0x080e9214 0x080e9214 0x000d8 0x000d8 RW 0x4
NOTE 0x000148 0x08048148 0x08048148 0x00020 0x00020 R 0x4
GNU_EH_FRAME 0x0a0138 0x080e8138 0x080e8138 0x0002c 0x0002c R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.ABI-tag .dynsym .dynstr .gnu.version .gnu.version_r .rel.dyn .rel.plt ...
03 .ctors .dtors .jcr .dynamic .got .got.plt .data .bss
04 .dynamic
05 .note.ABI-tag
06 .eh_frame_hdr
07
Putting it all together: the ELF header
Neither the STH nor the PTH have fixed positions, they can be located anywhere in an ELF file. To find them the ELF header is used, which is located at the very start of the file.
The first bytes contain the elf magic "\x7fELF"
, followed by the class ID (32
or 64 bit ELF file), the data format ID (little endian/big endian), the machine
type, etc.
At the end of the ELF header are then pointers to the SHT and PHT.
greek0@iphigenie:~$ readelf -h /bin/bash
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Intel 80386
Version: 0x1
Entry point address: 0x805be30
Start of program headers: 52 (bytes into file)
Start of section headers: 675344 (bytes into file)
Flags: 0x0
Size of this header: 52
Size of program headers: 32
Number of program headers: 8
Size of section headers: 40
Number of section headers: 26
Section header string table index: 25
The Relocation Table
The relocation table specifies where relocations are needed in order for the program to run. In programs these are normally symbol relocations, i.e. the dynamic linker has to resolve the needed symbol by its name, and then write the symbol address to the place specified in the relocation entry.
Relocation types are architecture specific and there are usually quite a lot of
them. On i386 the most important ones are the R_386_COPY
type, meaning "just
copy the address of the symbol to that address", and R_386_JUMP_SLOT
, which
is used for the normal PLT/GOT function call relocation mechanism.
The resolution of the symbol value itself is done by the dynamic linker
(contained within /lib/ld-linux.so.2
, the ELF interpreter commonly used), and
is a pretty complex process. Basically the linker searches all loaded ELF objects
(the binary itself and the loaded libraries) and uses the first definition of
the symbol it finds.
greek0@iphigenie:~$ readelf -r /bin/bash
Relocation section '.rel.dyn' at offset 0x12cc4 contains 8 entries:
Offset Info Type Sym.Value Sym. Name
080e92ec 00078006 R_386_GLOB_DAT 00000000 __gmon_start__
080edd68 00035205 R_386_COPY 080edd68 stdout
080edd6c 00035d05 R_386_COPY 080edd6c stderr
080edd70 00046405 R_386_COPY 080edd70 PC
080edd74 00067405 R_386_COPY 080edd74 stdin
080edd78 0006e305 R_386_COPY 080edd78 UP
Relocation section '.rel.plt' at offset 0x12d04 contains 181 entries:
Offset Info Type Sym.Value Sym. Name
080e9368 00012c07 R_386_JUMP_SLOT 00000000 fileno
080e936c 00013807 R_386_JUMP_SLOT 00000000 strcmp
080e9370 00014107 R_386_JUMP_SLOT 0805b4a4 close
080e9374 00015307 R_386_JUMP_SLOT 00000000 dlsym
080e937c 00016a07 R_386_JUMP_SLOT 00000000 fprintf
080e9388 00018307 R_386_JUMP_SLOT 00000000 fflush
080e9390 00019c07 R_386_JUMP_SLOT 0805b524 unlink
080e930c 00003307 R_386_JUMP_SLOT 00000000 regexec
080e9328 00007a07 R_386_JUMP_SLOT 00000000 ferror
080e9330 00008307 R_386_JUMP_SLOT 00000000 readdir64
080e9334 00008f07 R_386_JUMP_SLOT 00000000 strchr
080e9338 0000a507 R_386_JUMP_SLOT 00000000 fdopen
080e9344 0000da07 R_386_JUMP_SLOT 00000000 getpid
080e9360 00012207 R_386_JUMP_SLOT 00000000 write
080e95cc 00078707 R_386_JUMP_SLOT 00000000 strcpy
...
...
Exported symbols
When searching for a symbol the dynamic linker looks through the dynamic symbol
table .dynsym
, so all symbols present there are usable by other programs (in other
words: exported and in case of a library, part of the ABI).
Actually the process is more complicated (involving the hashes in the .hash
section), but the end result is the same as just described.
greek0@iphigenie:~$ readelf -D -s /lib/libc.so.6
Symbol table for image:
Num Buc: Value Size Type Bind Vis Ndx Name
260 0: 0011a580 4 OBJECT GLOBAL DEFAULT 29 _nl_domain_bindings
1693 1: 000b0f60 1303 FUNC GLOBAL DEFAULT 11 fts_read
601 2: 00027df0 13 FUNC WEAK DEFAULT 11 scalbln
208 3: 000698f0 141 FUNC GLOBAL DEFAULT 11 memmove
1798 4: 000b8ae0 117 FUNC GLOBAL DEFAULT 11 lsearch
348 4: 000dfd20 189 FUNC GLOBAL DEFAULT 11 xdr_u_hyper
1675 9: 0005ad10 231 FUNC GLOBAL DEFAULT 11 fputc
381 9: 000b92f0 389 FUNC WEAK DEFAULT 11 error_at_line
166 9: 000864d0 36 FUNC GLOBAL DEFAULT 11 versionsort64
119 9: 000f2950 36 FUNC GLOBAL DEFAULT 11 versionsort64
2113 16: 000ac770 58 FUNC WEAK DEFAULT 11 mkdir
516 16: 000de9c0 677 FUNC GLOBAL DEFAULT 11 svctcp_create
979 17: 000b7040 60 FUNC GLOBAL DEFAULT 11 madvise
1815 18: 000c61f0 42 FUNC GLOBAL DEFAULT 11 pthread_mutex_lock
2018 25: 00054ac0 326 FUNC WEAK DEFAULT 11 fputs
432 30: 000ebc40 33 FUNC GLOBAL DEFAULT 11 getutxid
1879 31: 000292b0 64 FUNC GLOBAL DEFAULT 11 sigdelset
1902 33: 000ba530 107 FUNC GLOBAL DEFAULT 11 gnu_dev_makedev
1385 34: 000f3bc0 153 FUNC GLOBAL DEFAULT 11 getrlimit64
895 34: 000b2ad0 153 FUNC GLOBAL DEFAULT 11 getrlimit64
1290 37: 0009f400 319 FUNC WEAK DEFAULT 11 re_comp
82 40: 000dbd70 1653 FUNC GLOBAL DEFAULT 11 clnt_broadcast
1892 41: 0008a6c0 53 FUNC WEAK DEFAULT 11 getresgid
...
...
A more detailed look at versionsort64
The observant reader may have noticed that e.g. versionsort64
is present
twice in the dynamic symbol table shown above, and the two symbols have
different values.
The reason is pretty simple; libc.so.6
uses symbol versioning, and there are
two versions of versionsort64
available. The binutils readelf
unfortunately
doesn't show the symbol versions, eu-readelf
from the elfutils package
however does.
greek0@iphigenie:~$ readelf -D -s /lib/libc.so.6 | grep versionsort64
166 9: 000864d0 36 FUNC GLOBAL DEFAULT 11 versionsort64
119 9: 000f2950 36 FUNC GLOBAL DEFAULT 11 versionsort64
greek0@iphigenie:~$ eu-readelf -s /lib/libc.so.6 | grep versionsort64
119: 000f2950 36 FUNC GLOBAL DEFAULT 11 versionsort64@GLIBC_2.1
166: 000864d0 36 FUNC GLOBAL DEFAULT 11 versionsort64@@GLIBC_2.2
Program loading in the kernel
ELF files themselves arent't terribly interesting. How ELF files are loaded into memory, and what has to happen before the program can execute its own code, can be.
The execution of a program starts inside the kernel, in the exec system call. There the file type is looked up and the appropriate handler is called. The binfmt-elf handler then loads the ELF header and the program header table (PHT), followed by lots of sanity checks.
The kernel then loads the parts specified in the LOAD
directives in the PHT
into memory. If an INTERP
entry is present, the interpreter is loaded too.
Statically linked binaries can do without an interpreter; dynamically linked
programs always need /lib/ld-linux.so
as interpreter because it includes some
startup code, loads shared libraries needed by the binary, and performs
relocations.
Finally control can be transfered to the program, to the interpreter, if present, otherwise to the binary itself.
greek0@iphigenie:~$ readelf -l /bin/bash
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x08048034 0x08048034 0x00100 0x00100 R E 0x4
INTERP 0x000134 0x08048134 0x08048134 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x08048000 0x08048000 0xa0200 0xa0200 R E 0x1000
LOAD 0x0a0200 0x080e9200 0x080e9200 0x04b44 0x09728 RW 0x1000
DYNAMIC 0x0a0214 0x080e9214 0x080e9214 0x000d8 0x000d8 RW 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
...
Dynamic linking and the ELF interpreter
In case of a statically linked binary that's pretty much it, however with dynamically linked binaries a lot more magic has to go on.
First the dynamic linker (contained within the interpreter) looks at the
.dynamic
section, whose address is stored in the PHT.
There it finds the NEEDED
entries determining which libraries have to be
loaded before the program can be run, the *REL*
entries giving the address of
the relocation tables, the VER*
entries which contain symbol versioning
information, etc.
So the dynamic linker loads the needed libraries and performs relocations (either directly at program startup or later, as soon as the relocated symbol is needed, depending on the relocation type).
Finally control is transferred to the address given by the symbol _start
in
the binary. Normally some gcc/glibc startup code lives there, which in the end
calls main()
.
greek0@iphigenie:~$ readelf -d /bin/bash
Dynamic section at offset 0xa0214 contains 22 entries:
Tag Type Name/Value
0x00000001 (NEEDED) Shared library: [libncurses.so.5]
0x00000001 (NEEDED) Shared library: [libdl.so.2]
0x00000001 (NEEDED) Shared library: [libc.so.6]
0x0000000a (STRSZ) 29922 (bytes)
0x0000000b (SYMENT) 16 (bytes)
0x00000003 (PLTGOT) 0x80e92f0
0x00000002 (PLTRELSZ) 1448 (bytes)
0x00000014 (PLTREL) REL
0x00000017 (JMPREL) 0x805ad04
0x00000011 (REL) 0x805acc4
0x00000012 (RELSZ) 64 (bytes)
0x6ffffffe (VERNEED) 0x805ac34
0x6fffffff (VERNEEDNUM) 2
0x6ffffff0 (VERSYM) 0x8059d22
0x00000000 (NULL) 0x0
Symbol lookup by the dynamic linker
As mentioned before, symbol lookup is a complicated process, I'll give a simplified description.
For every loaded object RTLD (the runtime dynamic linker) keeps a list of loaded objects called the "lookup scope". Every scope contains pointers to all the loaded objects (the binary and all loaded libraries), but the order of objects can differ between different scopes. What is constant is that the binary is the first object in every scope.
When the RTLD has to resolve a symbol, it first checks for which object it needs to perform the relocation. Was the lookup caused in the binary itself or in one of the loaded libraries. Then it gets the lookup scope for that object, and iterates through every object in it.
For each object it looks for the needed symbol is in the dynamic symbol table. In case of a match it just uses that symbol value for the relocation, otherwise it continues its search looking at the next object in the scope.
Consequences of the symbol lookup rules
Libs can't just directly jump to functions they export (they of course know where their own functions are), but have to go through the described symbol lookup mechanism too.
This along with the fact that the binary is always first in every lookup scope means that symbols defined in the binary override symbols defined in libraries.
It's this way on purpose, to allow the binary to override library functions it doesn't like. If this happens nothing will use the library's function, not even calls by the library itself.
Normally that's a good thing, but it can lead to problems if the binary unintentionally defines a symbol that's also used by some loaded library (think program uses GTK, which pulls in different theme and input plugins depending on the user's system config).
E.g. if the program defines a function void print_error(int error_code, char*
str);
and some plugin defines a function with the same name, but another
signature, like int print_error(char* str)
, that may be problematic. If the
plugin doesn't export the print_error
symbol, there's no problem at all,
because the code in the plugin can just call the proper function directly,
without the need for a symbol lookup.
However if the plugin does export the symbol it has to lookup the symbol itself
(because that's required by the SystemV ABI spec and, consequently, by the
LSB). Then print_error
will be interposed by the symbol in the binary, which
has an incompatible signature, which will probably lead to a crash.
Solutions for this
The right solution is to just not export symbols that you don't explicitly want others to use. That's The Right Way™ for many reasons, it speeds up the library/plugin (no symbol lookup has to be performed for internal uses of that function), you don't pollute the namespace that way, you avoid the possible problems outlined above, and finally, the symbol is not part of your ABI, which means you can change it every way you like without breaking any dependent applications.
There's a quick hack that can be used to avoid the above problem. When
-Bsymbolic
is specified on the linker commandline when linking a library, the
lookup scope of that library is changed so the library itself is in the first
spot, followed by the binary.
This doesn't give you any of the other advantages though, and the program can't
interpose your symbol, even if it intentionally wants to do so. Consequently
-Bsymbolic
should be avoided, unless you're really, really lazy ;-).
Further reading
- The ELF Specification
- Definitive resource on libraries: Ulrich Drepper's DSO Howto
- Josselin Mouette's "Packaging shared libraries" talk at Debconf6
- The Debian Library Packaging Guide by Junichi Uekawa