gccnotes

/*personal notes of renzo diomedi*/ 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 a 1010 b 1011 c 1100 d 1101 e 1110 f 1111 A Transistor is a switch that can be ON or OFF. an open transistor, therefore without contact between the conductors, is not crossed by electricity, provides the binary number = 0 while a closed transistor, then with contact between conductors, is traversed by current, provides the binary number = 1 The Intel pentium4 microchip has over 43,000,000 transistors, AMD athlon has at least 37,000,000. The Oscillator, ie the Clock, adjusts the working speed of the computer, more beats = greater speed, measured in megahertz, i.e millions of beats per second. the current passing through a transistor can be used to control another transistor. It turns the switch on ON or OFF to change the status of the second transistor. This configuration is called PORT. the logic port NOT is composed of a single transistor that takes an Input from the Clock and an Input from another transistor. this Port produces only one output, which is always the opposite of the input coming from the transistor different combinations of NOT ports create other logical ports OR AND XOR using different combinations of logical ports , the microchip executes the Addition operation from which all other mathematical operations descend. the addition is executed through structures called Half-Adder and Full-Adder a half-adder is made by a port XOR and a port AND which receives both the same Bit in input eg: 2d + 3d = 10b + 11b half-adder processes the digits at right using the portd XOR and AND the resutl of XOR is the digit at right of the final result the result of AND is the input of ports XOR and AND of the full-adder also, the full-adder processes the digits at left of thr bits 10 and 10 the results are the inputs of other ports AND and XOR the results are processed with the results of the half-adder one of these results is the input of OR all the results gives the binary number 101 that is 5 on decimal numbers 8f = 8*16^1 + f*16^0 = 143 143 = 143/16 = 8,9375 ; 0,9375*16 = 15=f ; 143d = 8f 2569 = 2569/16 = 160,56 = 160+(9) ; 160/16 = 10+(0) ; = A09h binary number , TWO's Complement most dignificant digit = 0 = positive most significant digit = 1 = negative negative bunary number = inverting bits of positive number + add 1 384 0000000110000000 -384 1111111001111111 +1 = 1111111010000000 absolute value = inverted bits of negative number + add 1 Endian refers to the position where the data begins to be processed (written, read on, transmitted/received etc) hence, the beginning of the memory address. the definition: Endian, creates confusion! Big Endian - Little Endian The difference between the two systems is given by the order in which the data bytes (note: NOT BITS!!!) are stored or transmitted in memory address: big-endian: storage / transmission starting from the most significant byte (largest end) to ending with the least significant; little endian: storage / transmission starting from the least significant byte (smallest end) to ending with the most significant, is used on CISC machine such as Intel and AMD processors; The big-endian order has been chosen as the standard order in many protocols used on the Internet, it is therefore also called the network byte order. It is used by ARM (advances RISC machine) processors and others Embedded (Special Purpose) microprocessors IA-32 platform The processor contains the hardware and instruction codes that control the operation of the computer. It is connected to the other elements of the computer (the memory storage unit, input devices, and output devices) using three separate buses: a control bus, an address bus, a data bus. The control bus is used to synchronize the functions between the processor and the individual system elements. The data bus is used to move data between the processor and the external system elements. An example of this would be reading data from a memory location. The processor places the memory address to read on the address bus, and the memory storage unit responds by placing the value stored in that memory location on the data bus for the processor to access. The processor itself consists of many components. Each component has a separate function in the processor’s ability to process data. Assembly language programs have the ability to access and control each of these elements, CU At the heart of the processor is the control unit. The main purpose of the control unit is to control what is happening at any time within the processor. While the processor is running, instructions must be retrieved from memory and loaded for the processor to handle. The job of the control unit is to perform four basic functions: 1. Retrieve instructions from memory. 2. Decode instructions for operation. 3. Retrieve data from memory as needed. 4. Store the results as necessary. The instruction counter retrieves the next instruction code from memory and prepares it to be processed. The instruction decoder is used to decode the retrieved instruction code into a micro-operation. The MICRO-operation is the code that controls the specific signals within the processor chip to perform the function of the instruction code. When the prepared micro-operation is ready, the control unit passes it along to the execution unit for processing, and retrieves any results to store in an appropriate location. The NetBurst technology incorporates four separate techniques to help speed up processing in the control unit. Knowing how these techniques operate can help you optimize your assembly language programs. The NetBurst features are as follows: ❑ Instruction prefetch and decoding ❑ Branch prediction ❑ Out-of-order execution ❑ Retirement Instruction prefetch and decoding pipeline Older processors fetched instructions and data directly from system memory as they were needed by the execution unit. Because it takes considerably longer to retrieve data from memory than to process it, a backlog occurs, whereby the processor is continually waiting for instructions and data to be retrieved from memory. To solve this problem, the concept of prefetching was created. Although the name sounds odd, prefetching involves attempting to retrieve (fetch) instructions and/or data before they are actually needed by the execution unit. To incorporate prefetching, a special storage area is needed on the processor chip itself—one that can be easily accessed by the processor, quicker than normal memory access. This was solved using pipelining. double pipeline Pipelining involves creating a memory cache on the processor chip from which both instructions and data elements can be retrieved and stored ahead of the time that they are required for processing. When the execution unit is ready for the next instruction, that instruction is already available in the cache and can be quickly processed. The IA-32 platform implements pipelining by utilizing two (or more) layers of cache. The first cache layer (called L1) attempts to prefetch both instruction code and data from memory as it thinks it will be needed by the processor. As the instruction pointer moves along in memory, the prefetch algorithm determines which instruction codes should be read and placed in the cache. In a similar manner, if data is being processed from memory, the prefetch algorithm attempts to determine what data elements may be accessed next and also reads them from memory and places them in cache. there is no guarantee that the program will execute instructions in a sequential order. If the program takes a logic branch that moves the instruction pointer to a completely different location in memory, the entire cache is useless and must be cleared and repopulated with instructions from the new location. To help alleviate this problem, a second cache layer was created. The second cache layer (called L2) can also hold instruction code and data elements, separate from the first cache layer. When the program logic jumps to a completely different area in memory to execute instructions, the second layer cache can still hold instructions from the previous instruction location. If the program logic jumps back to the area, those instructions are still being cached and can be processed almost as quickly as instructions stored in the first layer cache. Assembly language programs cannot access the instruction and data caches. By minimizing branches in programs, you can help speed up the execution of the instruction codes in your program. Branch prediction unit While implementing multiple layers of cache is one way to help speed up processing of program logic, it still does not solve the problem of “jumpy” programs. If a program takes many different logic branches, it may well be impossible for the different layers of cache to keep up, resulting in more last-minute memory access for both instruction code and data elements. To help solve this problem, the IA-32 platform processors also incorporate branch prediction. Branch prediction uses specialized algorithms to attempt to predict which instruction codes will be needed next within a program branch. Special statistical algorithms and analysis are incorporated to determine the most likely path traveled through the instruction code. Instruction codes along that path are prefetched and loaded into the cache layers. The Pentium 4 processor utilizes three techniques to implement branch prediction: ❑ Deep branch prediction ❑ Dynamic data flow analysis ❑ Speculative execution Deep branch prediction enables the processor to attempt to decode instructions beyond multiple branches in the program. Again, statistical algorithms are implemented to predict the most likely path the program will take throughout the branches. While this technique is helpful, it is not totally foolproof. Dynamic data flow analysis performs statistical real-time analysis of the data flow throughout the processor. Instructions that are predicted to be necessary for the flow of the program but not reached yet by the instruction pointer are passed to the out-of-order execution core . In addition, any instructions that can be executed while the processor is waiting for data related to another instruction are processed. Speculative execution enables the processor to determine what distant instruction codes not immediately in the instruction code branch are likely to be required, and attempt to process those instructions, again using the out-of-order execution engine. Out-of-order execution engine The out-of-order execution engine is one of the greatest improvements to the Pentium 4 processor in terms of speed. This is where instructions are prepared for processing by the execution unit. It contains several buffers to change the order of instructions within the pipeline to increase the performance of the control unit. Instructions retrieved from the prefetch and decoding pipeline are analyzed and reordered, enabling them to be executed as quickly as possible. By analyzing a large number of instructions, the out-of-order execution engine can find independent instructions that can be executed (and their results saved) until required by the rest of the program. The Pentium 4 processor can have up to 126 instructions in the outof- order execution engine at any one time. There are three sections within the out-of-order execution engine: ❑ The allocator ❑ Register renaming ❑ The micro-operation scheduler The Allocator is the traffic cop for the out-of-order execution engine. Its job is to ensure that buffer space is allocated properly for each instruction that the out-of-order execution engine is processing. If a needed resource is not available, the allocator will stall the processing of the instruction and allocate resources for another instruction that can complete its processing. The register renaming section allocates logical registers to process instructions that require register access. Instead of the eight general-purpose registers available on the IA-32 processor, the register renaming section contains 128 logical registers. It maps register requests made by instructions into one of the logical registers, to allow simultaneous access to the same register by multiple instructions. The register mapping is done using the Register Allocation Table (RAT). This helps speed up processing instructions that require access to the same register sets. The micro-operation scheduler determines when a micro-operation is ready for processing by examining the input elements that it requires. Its job is to send micro-operations that are ready to be processed to the retirement unit, while still maintaining program dependencies. The micro-operation scheduler uses two queues to place micro-operations, in one for micro-operations that require memory access and one for micro-operations that do not. The queues are tied to dispatch ports. Different types of Pentium processors may contain a different number of dispatch ports. The dispatch ports send the micro-operations to the retirement unit. Retirement unit The retirement unit receives all of the micro-operations from the pipeline decoders and the out-of-order execution engine and attempts to reassemble the micro-operations into the proper order for the program to properly execute. The retirement unit passes micro-operations to the execution unit for processing in the order that the out-of-order execution engine sends them, but then monitors the results, reassembling the results into the proper order for the program to execute. This is accomplished using a large buffer area to hold micro-operation results and place them in the proper order as they are required. When a micro-operation is completed and the results placed in the proper order, the micro-operation is considered retired and is removed from the retirement unit. The retirement unit also updates information in the branch prediction unit to ensure that it knows which branches have been taken, and which instruction codes have been processed. Execution unit ###################################################################### Register (fast memories that provide quick access to the values used by executing programs) General purpose .............Eight 32-bit registers used for storing working data Segment .....................Six 16-bit registers used for handling memory access Instruction pointer .........A single 32-bit register pointing to the next instruction code to execute .............................every microdevice at least must have a Program Counter (physical address) .............................Intel and AMD use CS:IP = code segment*16, in hex add 0 at right , in binary add 0000 at right .............................+ Instruction Pointer (EA effective address) Floating-point data .........Eight 80-bit registers used for floating-point arithmetic data Control .....................Five 32-bit registers used to determine the operating mode of the processor Debug .......................Eight 32-bit registers used to contain information when debugging the processor e r AX Accumulator for operands and results data. it has a typical accumulator function. It can be used in all I / O instructions, in string instructions and arithmetic operations. A small number of instructions requires AX. e r BX Pointer to data in the data memory segment, base register for address calculation, also used as internal counter, auxiliary of e r CX e r CX Counter for string and loop operations. it is also used as a counter in some instructions; for this reason it is indicated as count register. e r DX I/O pointer. it is designated as a data register. It is required by some input\output operations, as required by multiplication and division operations that, involving great values, presuppose the pair DX, AX. The index registers and pointers, usually contain the displacement inside the segments. e-r-SI Source Index. Data pointer for source of string operations. generic use index register. some string instructions require that the source string must be found through SI e-r-DI Destination Index. Data pointer for destination of string operations, some instructions that handle strings, the destination must be necessarily identified by DI. e-r-SP Stack pointer. it is the pointer to the top of the stack. e-r-BP Base Pointer, Stack data pointer, it is used as a pointer inside the stack, but it can also be used as a generic index register 31...............15.......7......0 ....................High....Low... Segment registers The most characteristic aspect of the CPU 8086 is the segmentation of memory, Segment registers are used precisely for keep track of the memory location of the segments in use. The usual use is: CS identifies the segment of code (code segment) , DS the data segment, SS the stack segment and ES the extra segment. e r IP register (register pointing to the next instruction code to execute) Instruction codes are always taken from the CS. About this is required a register that contains the offset of the next instruction to be performed, referenced to the current code segment. IP contains the position of the instruction referred to the base of the code segment. The status register (FLAGS) The 8086 status register contains 9 1-bit indicators, also called flags. Of these, 6 they record information on the processor status (status flags) and 3 are used to check the processor operations (control flag). SPECIAL PURPOSE REGISTERS: MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7 XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 Memory Addressing The AMD64 architecture supports address relocation. To do this, several types of addresses are needed to completely describe memory organization. Specifically, four types of addresses are defined by the AMD64 architecture: • Logical addresses • Effective addresses, or segment offsets, which are a portion of the logical address. • Linear (virtual) addresses • Physical addresses Logical Addresses. A logical address is a reference into a segmented-address space. It is comprised of the segment selector and the effective address. Notationally, a logical address is represented as Logical Address = Segment Selector : Offset The segment selector specifies an entry in either the global or local descriptor table. The specified descriptor-table entry describes the segment location in virtual-address space, its size, and other characteristics. The effective address is used as an offset into the segment specified by the selector. Logical addresses are often referred to as far pointers. Far pointers are used in software addressing when the segment reference must be explicit (i.e., a reference to a segment outside the current segment). Effective Addresses. The offset into a memory segment is referred to as an effective address (see “Segmentation” on page 5 for a description of segmented memory). Effective addresses are formed by adding together elements comprising a base value, a scaled-index value, and a displacement value. The effective-address computation is represented by the equation Effective Address = Base + (Scale x Index) + Displacement The elements of an effective-address computation are defined as follows: • Base—A value stored in any general-purpose register. • Scale—A positive value of 1, 2, 4, or 8. • Index—A two’s-complement value stored in any general-purpose register. • Displacement—An 8-bit, 16-bit, or 32-bit two’s-complement value encoded as part of the instruction. Effective addresses are often referred to as near pointers. A near pointer is used when the segment selector is known implicitly or when the flat-memory model is used. Long mode defines a 64-bit effective-address length. If a processor implementation does not support the full 64-bit virtual-address space, the effective address must be in canonical form Linear (Virtual) Addresses. The segment-selector portion of a logical address specifies a segmentdescriptor entry in either the global or local descriptor table. The specified segment-descriptor entry contains the segment-base address, which is the starting location of the segment in linear-address space. A linear address is formed by adding the segment-base address to the effective address (segment offset), which creates a reference to any byte location within the supported linear-address space. Linear addresses are often referred to as virtual addresses, and both terms are used interchangeably throughout this document. Linear Address = Segment Base Address + Effective Address When the flat-memory model is used—as in 64-bit mode—a segment-base address is treated as 0. In this case, the linear address is identical to the effective address. In long mode, linear addresses must be in canonical address form Physical Addresses. A physical address is a reference into the physical-address space, typically main memory. Physical addresses are translated from virtual addresses using page-translation mechanisms. the paging mechanism is used for virtual-address to physical-address translation. When the paging mechanism is not enabled, the virtual (linear) address is used as the physical address. They are read from memory one byte at a time, starting with the least-significant byte (lowest address). For example, the following instruction specifies the 64-bit instruction MOV RAX, 1122334455667788 instruction that consists of the following ten bytes: 48 B8 8877665544332211 48 is a REX instruction prefix that specifies a 64-bit operand size, B8 is the opcode that—together with the REX prefix—specifies the 64-bit RAX destination register, and 8877665544332211 is the 8- byte immediate value to be moved, where 88 represents the eighth (least-significant) byte and 11 represents the first (most-significant) byte. In memory, the REX prefix byte (48) would be stored at the lowest address, and the first immediate byte (11) would be stored at the highest instruction address. REX An instruction encoding prefix that specifies a 64-bit operand size and provides access to additional registers. Zero-Extension of 32-Bit Results when performing 32-bit operations with a GPR (general purpose registers) destination in 64-bit mode, the processor zero-extends the 32-bit result into the full 64-bit destination. 8-bit and 16-bit operations on GPRs preserve all unwritten upper bits of the destination GPR. This is consistent with legacy 16-bit and 32-bit semantics for partial-width results. Software should explicitly sign-extend the results of 8-bit, 16-bit, and 32-bit operations to the full 64- bit width before using the results in 64-bit address calculations. The following four code examples show how 64-bit, 32-bit, 16-bit, and 8-bit ADDs work. In these examples, “48” is a REX prefix specifying 64-bit operand size, and “01C3” and “00C3” are the opcode and ModRM bytes of each instruction # in hex 1 byte = xx Example 1: 64-bit Add: Before:RAX =0002_0001_8000_2201 RBX =0002_0002_0123_3301 48 01C3 ADD RBX,RAX ;48 is a REX prefix for size. Result:RBX = 0004_0003_8123_5502 Example 2: 32-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 01C3 ADD EBX,EAX ;32-bit add Result:RBX = 0000_0000_8123_5502 (32-bit result is zero extended) Example 3: 16-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 66 01C3 ADD BX,AX ;66 is 16-bit size override Result:RBX = 0002_0002_0123_5502 (bits 63:16 are preserved) Example 4: 8-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 00C3 ADD BL,AL ;8-bit add Result:RBX = 0002_0002_0123_3302 (bits 63:08 are preserved) Segment Register .............Description CS ...........................Code segment DS ...........................Data segment SS ...........................Stack segment ES ...........................Extra segment pointer FS ...........................Extra segment pointer GS ...........................Extra segment pointer CACHE CPU cache is a separate small block of memory used to compensate for the slower access time of main memory(RAM). A cache described as a Level 1 (L1) cache uses memory that is as fast as the CPU, so as long as the CPU is accessing the cache, it will never have to wait for an instruction or data. Level 2 and Level 3 caches are used in conjunction with a Level 1 cache and have memory whose access times are greater than the CPU, but are less than main memory. CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory (RAM). A cache is a smaller, faster memory, located closer to a processor core. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2, L3, L4, etc.). Driver: code to control a specific device. Bios extension. It prevents the Bios ,that resides in a permanent memory, from having to include all the commands for each hardware component, to avoid to assume enormous dimensions and to become quickly obsolete. ARRAY val: .int 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60 This creates a sequential series of data values placed in memory. Each data value occupies one unit of memory (which in this case is a long integer, or 4 bytes). When referencing data in the array, you must use an index system to determine which value you are accessing. The way this is done is called indexed memory mode. base_address(offset_address, index, size) #index must be a register val (, %edi, 4) # note the use of the Destination Index For example, to reference the value 20 from the values array shown, you would use the following instructions: movl $2, %edi movl val(, %edi, 4), %eax //third value is loaded in EAX //Note that the ARRAY starts with index 0 //If any of the values are zero, they can be omitted (but the commas are still required as placeholders). an example of Indirect addressing: movl %edx, 4(%edi) This instruction places the value contained in the EDX register in the memory location 4 bytes after the location pointed to by the EDI register. You can also go in the opposite direction: movl %edx, -4(&edi) This instruction places the value in the memory location 4 bytes before the location pointed to by the EDI register. info registers : Display the values of all registers print ·········: Display the value of a specific register or variable from the program print/d#dex /t #binary /x #hex x ·············: Display the contents of a specific memory location q: exit from gdb h: help l: list source lines l line number: lines before and after that one chosen info address var: var address info variables: name and address of all variables breakpoint line number : break after the line r : exe until first break c : restart s : exe next instruction until: run the program until it reaches the specified source code line /*comments also stand on several lines*/ //comments only beginning of the line # comments at every point to use C library functions in assembly language program we must link the C library files with the program object code. On Linux systems, there are two ways to link C functions to assembly language program. The first method is called static linking. Static linking links function object code directly into your application executable program file. This creates huge executable programs, and wastes memory if multiple instances of the program are run at the same time (each instance has its own copy of the same functions). The second method is called dynamic linking. Dynamic linking uses libraries that enable programmers to reference the functions in their applications, but not link the function codes in the executable program file. dynamic libraries are called at the program’s runtime by the operating system, and can be shared by multiple programs. On Linux systems, the standard C dynamic library is located in the file libc.so.x where x is a value representing the version of the library. the library file contains the standard C functions, including printf and exit. we must also specify the program that will load the dynamic library at runtime. For Linux systems, this program is ld-linux.so.2 normally found in the /lib directory. To specify this program, you must use the -dynamic-linker parameter of the GNU linker: $ ld -dynamic-linker /lib/ld-linux.so.2 -o prog -lc prog.o .section .data //another type is .rodata , any data elements defined in this section can only be accessed in read-only mode .section .bss .section .text .rodata. Any data elements defined in this section can only be accessed in read-only mode during compilation -gstabs is required gdb -q prog break *_start //start is a label defined in .section .text (gdb) run label nop break *LABEL + Offset break *_start+1 ######################### in MINGW is simply : break 1 , id est break+offset run next or step cont info registers print/d /t #binary /x #hex info variables x/c/d/x &va //va is an example of a label defined in .section.data c=character d=dec x=hex , size of the field, it can be b=byte h=2b w=4b as -gstabs -o path/prog.o path/prog.s ld -dynamic-linker /lib/ld-linux.so.2 -o path/prog -lc path/prog.o .section .text .globl main main: gcc -o path/prog path/prog.s .section .data label: .directive value example: .section .data msg: .ascii “This is a test message” factors: .double 37.45, 45.33, 12.30 height: .int 54 length: .int 62, 35, 47 #The lowest memory value contains the first data element // data elements are placed in the data section in a sequential manner, starting at the lowest memory location in the data section, and working toward higher memory locations. The stack behaves just the opposite. The stack is reserved at the end of the memory area, and as data is placed on the stack, it grows downward. .data section defines memory locations. one label: + one or more .directive Directive ----- Data Type .ascii ---------Text string .asciz ---------Null-terminated text string .byte --------- Byte value .double ------- 64 bit Double-precision floating-point number .float -------- 32 bit Single-precision floating-point number .int ---------- 32-bit integer number .long --------- 32-bit integer number (same as .int) .octa ----------16-byte integer number .quad ----------8-byte integer number .short ---------16-bit integer number .single --------Single-precision floating-point number (same as .float) .fill ----------fill the location with zeros #pp .section .data va: .long 45 .float 3.4 .byte 4, 9, 21 .ascii "shits" .section .text .globl _start _start: nop movl va, %ecx //MOVE VA VALUE IN REGISTER ECX movl $1, %eax //standard movl $0, %ebx //standard int $0x80 //standard linux //in dos = $0x20 to verify the value stored in the memory location was moved to the ECX register: GDB -q prog break *_start+1 run print\x $register next print\x $register The .equ directive is used to set a constant value to a symbol that can be used in the text section, as shown in the following examples: .equ factor, 3 .equ LINUX_SYS_CALL, 0x80 To reference the static data element, you must use a dollar sign before the constant declared The bss section Defining data elements in the bss section is somewhat different from defining them in the data section. Instead of declaring specific data types, you just declare raw segments of memory that are reserved for whatever purpose you need them for. The GNU assembler uses two directives to declare buffers, as shown following Directive .comm //Declares a common memory area for data that is not initialized .lcomm //Declares a local common memory area for data that is not initialized While the two sections work similarly, the local common memory area is reserved for data that will not be accessed outside of the local assembly code. The format for both of these directives is .comm symbol, length where symbol is a label assigned to the memory area, and length is the number of bytes contained in the memory area, as shown in the following example: .section .bss .lcomm buffer, 10000 These statements assign a 10,000-byte memory area to the buffer label. Local common memory areas cannot be accessed by functions outside of where they were declared (they can’t be used in .globl directives). One benefit to declaring data in the bss section is that the data is not included in the executable program. When data is defined in the data section, it must be included in the executable program, since it must be initialized with a specific value. Because the data areas declared in the bss section are not initialized with program data, the memory areas are reserved at runtime, and do not have to be included in the final program To wiew the size of the program: as -o prog.o prog.s ld -o prog prog.o ls -al prog //generates the output that shows the number of bytes .section .bss .lcomm buffer, 10000 //directive of .bss .section .text .globl _start _start: movl $1, %eax movl $0, %ebx int $0x80 10000 bytes reserved are not increased in the size of the executable program file instead .section .data buffer: .fill 10000 //directive of .data .section .text .globl _start _start: movl $1, %eax movl $0, %ebx int $0x80 now the size of the executable program file is increased of 10000 bytes. The default is to create one byte per field, and fill it with zeros. instead the directive .bytes declares a value movl %eax, %ebx movw %ax, %bx movb %al, %bl An immediate data element to a general-purpose register An immediate data element to a memory location A general-purpose register to another general-purpose register A general-purpose register to a segment register A segment register to a general-purpose register A general-purpose register to a control register A control register to a general-purpose register A general-purpose register to a debug register A debug register to a general-purpose register A memory location to a general-purpose register A memory location to a segment register A general-purpose register to a memory location A segment register to a memory location //rtm .section .data ml: .long 545 .section .text .globl _start _start: nop movl $444, %eax movl %eax, ml #the previous of ml value is deleted and replaced movl $1, %eax movl $0, %ebx int $0x80 $ gdb -q rtm (gdb) break *_start+1 Breakpoint 1 at 0x8048075: file rtm.s line 9. (gdb) run Starting program: /........... Breakpoint 1, _start () at rtm... 9 movl $444, %eax (gdb) x/d &ml 0x804908b: 545 (gdb) s # net instruction 12 movl %eax, ml (gdb) s 13 movl $1, %eax (gdb) x/d &ml 0x804908b: 444 (gdb) x/t &ml 0x804908b: 000000000000000000000110111100 (gdb) x/x &ml 0x804908b: 0x000001bc (gdb) (gdb) x/4d &values 0x402000 : 10 100 20 25 (gdb) x/4x &values 0x402000 : 0x0000000a 0x00000064 0x00000014 0x00000019 (gdb) x/4t &values 0x402000 : 00000000000000000000000000001010 00000000000000000000000001100100 00000000000000000000000000010100 00000000000000000000000000011001 $ ./i.a $ echo $? 100 CF Carry flag A mathematical expression has created a carry or borrow OF Overflow flag An integer value is either too large or too small PF Parity flag The register contains corrupt data from a mathematical operation SF Sign flag Indicates whether the result is negative or positive ZF Zero flag The result of the mathematical operation is zero Instruction Pair .......Description ..............EFLAGS Condition CMOVA/CMOVNBE ......Above/not below or equal .....(CF or ZF) = 0 CMOVAE/CMOVNB ......Above or equal/not below .....CF=0 CMOVNC .............Not carry ....................CF=0 CMOVB/CMOVNAE ......Below/not above or equal .....CF=1 CMOVC ..............Carry ........................CF=1 CMOVBE/CMOVNA ......Below or equal/not above .....(CF or ZF) = 1 CMOVE/CMOVZ ........Equal/zero ...................ZF=1 CMOVNE/CMOVNZ ......Not equal/not zero ...........ZF=0 CMOVP/CMOVPE .......Parity/parity even ...........PF=1 CMOVNP/CMOVPO ......Not parity/parity odd ........PF=0 CMOVGE/CMOVNL ......Greater or equal/not less ....(SF xor OF)=0 CMOVL/CMOVNGE ......Less/not greater or equal ....(SF xor OF)=1 CMOVLE/CMOVNG ......Less or equal/not greater ....((SF xor OF) or ZF)=1 CMOVO ..............Overflow .....................OF=1 CMOVNO .............Not overflow .................OF=0 CMOVS ..............Sign (negative) ..............SF=1 CMOVNS .............Not sign (non-negative) ......SF=0 # cmv .section .data output: .asciz “The largest value is %d\n” va: .int 105, 235, 61, 315, 134, 221, 53, 145, 117, 5 .section .text .globl _start _start: nop movl va, %ebx movl $1, %edi loop: movl va(, %edi, 4), %eax cmp %ebx, %eax cmova %eax, %ebx inc %edi cmp $10, %edi jne loop pushl %ebx pushl $output call printf addl $8, %esp pushl $0 call exit (gdb) s 14 movl va(, %edi, 4), %eax (gdb) s 15 cmp %ebx, %eax (gdb) print $eax $1 = 235 (gdb) print $ebx $2 = 105 (gdb) s 16 cmova %eax, %ebx (gdb) s 17 inc %edi (gdb) print $ebx $3 = 235 $ ./cmv The largest value is 315 exchange of values: MOVW %AX , %CX # TMP , cx spare register MOVW %BX , %AX MOVW %CX , %AX The BSWAP instruction reverses the order of the bytes in a register. It is important to remember that the bits are not reversed; but rather, the individual bytes contained within the register are reversed. This produces a big-endian value from a little-endian value, and visa versa. movl $0x12345678, %ebx ####### Current language: auto; currently asm (gdb) step _start () at swaptest.s: bswap %ebx ######## (gdb) print/x $ebx $1 = 0x12345678 (gdb) step _start () at swaptest.s:7 7 movl $1, %eax ######## after this instruction the bytes result reversed (gdb) print/x $ebx $2 = 0x78563412 ############ 4 bytes reversed XADD The XADD instruction is used to exchange the values between two registers, or a memory location and a register, add the values, and then store them in the destination location (either a register or a memory location). The format of the XADD instruction is xadd source, destination where source must be a register, and destination can be either a register or a memory location, and contains the results of the addition. The registers can be 8-, 16-, or 32-bit register values. CMPXCHG The CMPXCHG instruction compares the destination operand with the value in the EAX, AX, or AL registers. If the values are equal, the value of the source operand value is loaded into the destination operand. If the values are not equal, the destination operand value is loaded into the EAX, AX, or AL registers. The CMPXCHG instruction is not available on processors earlier than the 80486. In the GNU assembler, the format of the CMPXCHG instruction is cmpxchg source, destination which is the reverse of the Intel documents. The destination operand can be an 8-, 16-, or 32-bit register, or a memory location. The source operand must be a register whose size matches the destination operand. # prog .section .data data: .int 10 .section .text .globl _start _start: nop movl $10, %eax movl $5, %ebx cmpxchg %ebx, data movl $1, %eax int $0x80 The memory location referenced by the data label is compared with the value in the EAX register using the CMPXCHG instruction. Because they are equal, the value in the source operand (EBX) is loaded in the data memory location, and the value in the EBX register remains the same. You can check this behavior using the debugger: (gdb) run 9 movl $10, %eax (gdb) step 10 movl $5, %ebx (gdb) step 11 cmpxchg %ebx, data (gdb) x/d &data 0x8049090 : 10 (gdb) s 12 movl $1, %eax (gdb) print $eax $3 = 10 (gdb) print $ebx $4 = 5 (gdb) x/d &data 0x8049090 : 5 (gdb) Before the CMPXCHG instruction, the value of the data memory location is 10, which matches the value set in the EAX register. After the CMPXCHG instruction, the value in EBX (which is 5) is moved to the data memory location. Because that value does not match the value in EAX, you will notice that the data value is not changed, but the EAX value now contains the value you set in the data label. //8 .section .data dat: .byte 0x88, 0x77 , 0x66, 0x55, 0x44, 0x33, 0x22, 0x11 .section .text .globl _start _start: nop movl $0x88776655 , %edx #32 bit movl $0x44332211 , %eax movl $0x22222222 , %ecx movl $0x11111111 , %ebx cmpxchg8b dat movl $0 , %ebx movl $1 , %eax int $0x80 (gdb) x/2x &dat 0x402000 : 0x55667788 0x11223344 //88 .section .data dat: .byte 0x11, 0x22 , 0x33, 0x44, 0x55, 0x66, 0x77, 0x88 .section .text .globl _start _start: nop movl $0x88776655 , %edx movl $0x44332211 , %eax movl $0x22222222 , %ecx movl $0x11111111 , %ebx cmpxchg8b dat movl $0 , %ebx movl $1 , %eax int $0x80 (gdb) x/2x &dat 0x402000 : 0x11111111 0x22222222 movl $va, %edi This instruction moves the memory address of the label VA to the EDI register. The dollar sign ($) before the label name instructs the assembler to use the memory address, and not the data value located at the address. next instruction: movl %ebx, (%edi) is the other half of the indirect addressing mode. Without the parentheses around the EDI register, the instruction would just load the value in the EBX register to the EDI register. With the parentheses around the EDI register, the instruction instead moves the value in the EBX register to the memory location, the label VA, contained in the EDI register. This is a very powerful tool. Similar to pointers in C and C++, it enables you to control memory address locations with a register. The real power is realized by incrementing the indirect addressing value contained in the register. movl %edx, 4(%edi) This instruction places the value contained in the EDX register in the memory location 4 bytes after the location pointed to by the EDI register. You can also go in the opposite direction: movl %edx, -4(&edi) This instruction places the value in the memory location 4 bytes before the location pointed to by the EDI register. HOW THE MACHINE ARRANGES THE BYTES: dat: .byte 0x88, 0x77 , 0x66, 0x55, 0x44, 0x33, 0x22, 0x11 generates &dat = 0x55667788 0x11223344 #2*32 bit dat: .byte 0x11, 0x22 , 0x33, 0x44, 0x55, 0x66, 0x77, 0x88 generates &dat = 0x88776655 0x44332211 if ECX:EBX registers were placed in the data memory location : &dat = ebx:ecx Breakpoint 1, _start () at 8.s:9 9 movl $0x88776655 , %edx (gdb) s 10 movl $0x44332211 , %eax (gdb) s 11 movl $0x22222222 , %ecx (gdb) s 12 movl $0x11111111 , %ebx (gdb) s 13 cmpxchg8b dat (gdb) x/2x &dat 0x804909c: 0x55667788 0x11223344 (gdb) x/2x &eax No symbol "eax" in current context. (gdb) x/2x $eax 0x44332211: Cannot access memory at address 0x44332211 (gdb) print $eax $1 = 1144201745 (gdb) print/x $eax $2 = 0x44332211 (gdb) print/x $edx $3 = 0x88776655 (gdb) info registers eax 0x44332211 1144201745 ecx 0x22222222 572662306 edx 0x88776655 -2005440939 ebx 0x11111111 286331153 esp 0xbffff2b0 0xbffff2b0 ebp 0x0 0x0 esi 0x0 0 edi 0x0 0 eip 0x8048089 0x8048089 <_start+21> eflags 0x202 [ IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x0 0 (gdb) Quit (gdb) Quit (gdb) then checked and verified that the Label dat is not equal to edx:eax , cmpxchg8b moves the values stored in DAT in edx:eax ordening as 0x88776655:0x44332211 Verifying this by DAT: .byte 0x77, 0x55, 0x22, 0x23, 0x32, 0x98, 0x11, 0x45 #8b8 .section .data DAT: .byte 0x77, 0x55, 0x22, 0x23, 0x32, 0x98, 0x11, 0x45 .section .text .globl _start _start: nop movl $0x88776655, %edx movl $0x44332211, % eax movl $0x22222222, %ecx movl $0x11111111, %ebx cmpxchg8b DAT movl $0x0 , %ebx movl $0x1, %eax int $0x080 (gdb) x/2x &DAT 0x402000 <_data_start__>: 0x23225577 0x45119832 //why not 0x45119832 0x23225577 ? why are the 2 groups of 32 bits sorted like this? which the key? (gdb) info registers eax 0x23225577 589452663 ecx 0x22222222 572662306 edx 0x45119832 1158780978 ebx 0x11111111 286331153 The basic algorithm for a sort in the high-level language "c" is: for(out = array_size-1; out>0, out--) { for(in = 0; in < out; in++) { if (array[in] > array[in+1]) swap(array[in], array[in+1]); } } There are two loops. The inner loop runs through the array, checking the adjacent array value to see which is larger. If a larger value is found in front of a smaller value, the two values are swapped in the array. This continues through to the end of the array. When the first pass has completed, the largest value in the array should be at the end of the array, but the remaining values are not in any particular order. You must take N-1 passes through an array of N elements before all of the elements are in sorted order. The outer loop controls how many total passes of the inner loop are performed. For each new pass of the inner loop, there is one less element to check, as the last element of the previous pass should be in the proper order. This algorithm is implemented in the assembly language program using a data array and two counters, EBX and ECX. The EBX counter is used for the inner loop, decreasing each time an array element is tested. When it reaches zero, the ECX counter is decreased, and the EBX counter is reset. This process continues until the ECX counter reaches zero. This indicates that all of the required passes have been completed. algorithm to sort an array of integers, not the most efficient sort method, but it is the easiest to understand and demonstrate. # sort .section .data va: .int 105, 235, 61, 315, 134, 221, 53, 145, 117, 5 .section .text .globl _start _start: movl $va, %esi movl $9, %ecx # 9 comparisons movl $9, %ebx loop: movl (%esi), %eax cmp %eax, 4(%esi) jge skip xchg %eax, 4(%esi) movl %eax, (%esi) skip: add $4, %esi dec %ebx jnz loop // after ebx reaches 0, then ecx decrease to 8, then after ebx = 0 again, ecx decrease to 7 and so on to zero dec %ecx /*decreasee only ebx loop ended*/ jz end movl $va, %esi movl %ecx, %ebx /* now ebx is resetted to 8, then after a new loop is resetted to 7,..,6.....,1 */ jmp loop end: movl $1, %eax movl $0, %ebx int $0x80 The actual comparing and swapping of array values is done using indirect addressing. The ESI register is loaded with the memory address of the start of the data array. The ESI register is then used as a pointer to each array element during the comparison section: movl (%esi), %eax cmp %eax, 4(%esi) jge skip xchg %eax, 4(%esi) movl %eax, (%esi) skip: First, the value in the first array element is loaded into the EAX register, and compared with the second array element (located 4 bytes from the first). If the second element is already larger than or equal to the first element, nothing happens and the program moves on to the next pair. If the second element is less than the first element, the XCHG instruction is used to swap the first element (loaded into the EAX register) with the second element in memory. Next, the second element (now loaded into the EAX register) is then placed in the first element location in memory. After this, the ESI register is incremented by 4 bytes, now pointing to the second element in the array. The process is then repeated, now using the second and third array elements. This continues until the end of the array is reached. This simple sample program does not produce any output. Instead, to see if it really works, you can use the debugger and view the values array before and after the program is run. Here’s a sample output of the program in action: C:\>gdb -q users\\rnz\desktop\sort.exe Reading symbols from users\\rnz\desktop\sort.exe...done. (gdb) break *end Breakpoint 1 at 0x40102d: file users\rnz\desktop\sort.s, line 27. (gdb) x/10d &values 0x402000 : 105 235 61 315 0x402010 : 134 221 53 145 0x402020 : 117 5 (gdb) run Starting program: C:\users\rnz\desktop\sort.exe [New Thread 14156.0x2b00] Breakpoint 1, end () at users\rnz\desktop\sort.s:27 27 movl $1, %eax (gdb) x/10d &values 0x402000 : 5 53 61 105 0x402010 : 117 134 145 221 0x402020 : 235 315 (gdb) # sort.exp .section .data va: .int 105, 235, 61, 315, 134, 221, 53, 145, 117, 5 .section .text .globl _start _start: movl $va, %esi movl $9, %ecx movl $9, %ebx loop: movl (%esi), %eax cmp %eax, 4(%esi) jge skip xchg %eax, 4(%esi) movl %eax, (%esi) skip: add $4, %esi dec %ebx jnz loop dec %ecx jz end movl $va, %esi //movl %ecx, %ebx ############### NOTICE!!! jmp loop end: movl $1, %eax movl $0, %ebx int $0x80 C:\>as -gstabs -o users\rnz\desktop\sort.exp.o users\rnz\desktop\sort.exp.s C:\>ld -o users\rnz\desktop\sort.exp.exe users\rnz\desktop\sort.exp.o C:\>gdb -q users\rnz\desktop\sort.exp.exe Reading symbols from users\rnz\desktop\sort.exp.exe...done. (gdb) break *end Breakpoint 1 at 0x40102b: file users\rnz\desktop\sort.exp.s, line 27. (gdb) x/10d &va 0x402000 : 105 235 61 315 0x402010 : 134 221 53 145 0x402020 : 117 5 (gdb) run Starting program: C:\users\rnz\desktop\sort.exp.exe [New Thread 10920.0x2b28] Program received signal SIGSEGV, Segmentation fault. loop () at users\rnz\desktop\sort.exp.s:15 15 xchg %eax, 4(%esi) (gdb) x/10d &va 0x402000 : 61 105 134 221 0x402010 : 53 145 117 5 0x402020 : 235 0 (gdb) ##############FAILED SORT!!!!!!!!!!!!!!!!!!!!!!!! the Stack reverse the order of data insertion/retrieving, id est begin fron higher address to lower address. Stack it's a LIFO system with stack pointer to its top pushx source pushl %ecx # puts the 32-bit value of the ECX register on the stack pushw %cx # puts the 16-bit value of the CX register on the stack pushl $100 # puts the value of 100 on the stack as a 32-bit integer value pushl data # puts the 32-bit data value referenced by the data label pushl $data # puts the 32-bit memory address referenced by the data label Note the difference between using the label data versus the memory location $data. The first format (without the dollar sign) places the data value contained in the memory location in the stack, whereas the second format places the memory address referenced by the label in the stack. popx destination popl %ecx # place the next 32-bits in the stack in the ECX register popw %cx # place the next 16-bits in the stack in the CX register popl value # place the next 32-bits in the stack in the value memory location Instruction Description PUSHA/POPA Push or pop all of the 16-bit general-purpose registers PUSHAD/POPAD Push or pop all of the 32-bit general-purpose registers PUSHF/POPF Push or pop the lower 16 bits of the EFLAGS register PUSHFD/POPFD Push or pop the entire 32 bits of the EFLAGS register The PUSHA and POPA instructions are great for quickly setting aside and retrieving the current state of all the general-purpose registers at once. The PUSHA instruction pushes the 16-bit registers so they appear on the stack in the following order: DI, SI, BP, BX, DX, CX, and finally, AX. The PUSHAD instruction pushes the 32-bit counterparts of these registers in the same order. The POPA and POPAD instructions retrieve the registers in the reverse order they were pushed. The behavior of the POPF and POPFD instructions varies depending on the processor mode of operation. When the processor is running in protected mode in ring 0 (the privileged mode), all of the nonreserved flags in the EFLAGS register can be modified, with the exception of the VIP, VIF, and VM flags. The VIP and VIF flags are cleared, and the VM flag is not modified. When the processor is running in protected mode in a higher level ring (an unprivileged mode), the same results as the ring 0 mode are obtained, and the IOFL field is not allowed to be modified. Optimizing Memory Access Memory access is one of the slowest functions the processor performs. When writing assembly language programs that require high performance, it is best to avoid memory access as much as possible. Whenever possible, it is best to keep variables in registers on the processor. Register access is highly optimized for the processor, and is the quickest way to handle data. When it is not possible to keep all of the application data in registers, you should try to optimize the memory access for the application. For processors that use data caching, accessing memory in a sequential order in memory helps increase cache hits, as blocks of memory will be read into cache at one time. One other item to think about when using memory is how the processor handles memory reads and writes. Most processors (including those in the IA-32 family) are optimized to read and write memory locations in specific cache blocks, beginning at the start of the data section. On a Pentium 4 processor, the size of the cache block is 64 bits. If you define a data element that crosses a 64-bit block boundary, it will require two cache operations to retrieve or store the data element in memory. To solve this problem, Intel suggests following these rules when defining data: ❑ Align 16-bit data on a 16-byte boundary. ❑ Align 32-bit data so that its base address is a multiple of four. ❑ Align 64-bit data so that its base address is a multiple of eight. ❑ Avoid many small data transfers. Instead, use a single large data transfer. ❑ Avoid using larger data sizes (such as 80- and 128-bit floating-point values) in the stack. Aligning data within the data section can be tricky. The order in which data elements are defined can be crucial to the performance of your application. If you have a lot of similarly sized data elements, such as integer and floating-point values, place them together at the beginning of the data section. This ensures that they will maintain the proper alignment. If you have a lot of odd-sized data elements, such as strings and buffers, place those at the end of the data section so they won’t throw off the alignment of the other data elements. The gas assembler supports the .align directive, which is used to align defined data elements on specific memory boundaries. The .align directive is placed immediately before the data definition in the data section, instructing the assembler to position the data element on a memory boundary UNCONDITIONAL BRANCHES When an unconditional branch is encountered in the program, the instruction pointer is automatically routed to a different locations: ❑ Jumps ❑ Calls ❑ Interrupt .....❑ Software interrupts .....❑ Hardware interrupts ...../*interrupt controller chip receives the signal */ ...../*and gives the interrupt number to microprocessor*/ ...../*also used to return to os , 0x21 windows , 0x80 linux*/ ...../*in most cases the microprocessor push the address */ ...../*of the current interrupted execution in the stack */ ...../*then the bios exe the instructions then generate */ ...../*an IRet interrupt return that allows to microprocessor */ ...../*to retrieve the address in the stack and resume */ ...../*the interrupted execution */ CONDITIONAL BRANCHES Unlike unconditional branches, conditional branches are not always taken. The result of the conditional branch depends on the state of the EFLAGS register at the time the branch is executed. There are many bits in the EFLAGS register, but the conditional branches are only concerned with five of them: ❑ Carry flag (CF) - bit 0 (lease significant bit) ❑ Overflow flag (OF) - bit 11 ❑ Parity flag (PF) - bit 2 ❑ Sign flag (SF) - bit 7 ❑ Zero flag (ZF) - bit 6 Each conditional jump instruction examines specific flag bits to determine whether the condition is proper for the jump to occur. With five different flag bits, several jump combinations can be performed. The following sections describe the individual jump instructions. Conditional jump instructions The conditional jumps determine whether or not to jump based on the current value of the EFLAGS register. Several different conditional jump instructions use different bits of the EFLAGS register. The format of the conditional jump instruction is jxx address where xx is a one- to three-character code for the condition, and address is the location within the program to jump to (usually denoted by a label). JA Jump if above CF=0 and ZF=0 JAE Jump if above or equal CF=0 JB Jump if below CF=1 JBE Jump if below or equal CF=1 or ZF=1 JC Jump if carry CF=1 JCXZ Jump if CX register is 0 JECXZ Jump if ECX register is 0 JE Jump if equal ZF=1 JG Jump if greater ZF=0 and SF=OF JGE Jump if greater or equal SF=OF JL Jump if less SF<>OF JLE Jump if less or equal ZF=1 or SF<>OF JNA Jump if not above CF=1 or ZF=1 JNAE Jump if not above or equal CF=1 JNB Jump if not below CF=0 JNBE Jump if not below or equal CF=0 and ZF=0 JNC Jump if not carry CF=0 JNE Jump if not equal ZF=0 JNG Jump if not greater ZF=1 or SF<>OF JNGE Jump if not greater or equal SF<>OF JNL Jump if not less SF=OF JNLE Jump if not less or equal ZF=0 and SF=OF JNO Jump if not overflow OF=0 JNP Jump if not parity PF=0 JNS Jump if not sign SF=0 JNZ Jump if not zero ZF=0 JO Jump if overflow OF=1 JP Jump if parity PF=1 JPE Jump if parity even PF=1 JPO Jump if parity odd PF=0 JS Jump if sign SF=1 JZ Jump if zero ZF=1 // jt .section .text .globl _start _start: nop movl $1, %eax jmp overhier movl $10, %ebx int $0x80 overhier: movl $20, %ebx int $0x80 C:\>objdump -d users\rnz\desktop\jt.exe users\rnz\desktop\jt.exe: file format pei-i386 Disassembly of section .text: 00401000 <_start>: 401000: 90 nop 401001: b8 01 00 00 00 mov $0x1,%eax 401006: eb 07 jmp 40100f 401008: bb 0a 00 00 00 mov $0xa,%ebx 40100d: cd 80 int $0x80 0040100f : 40100f: bb 14 00 00 00 mov $0x14,%ebx 401014: cd 80 int $0x80 401016: 90 nop 401017: 90 nop 00401018 <__CTOR_LIST__>: 401018: ff (bad) 401019: ff (bad) 40101a: ff (bad) 40101b: ff 00 incl (%eax) 40101d: 00 00 add %al,(%eax) 00401020 <__DTOR_LIST__>: 401020: ff (bad) 401021: ff (bad) 401022: ff (bad) 401023: ff 00 incl (%eax) 401025: 00 00 add %al,(%eax) renzo@renzo-AO531h:~/Scrivania$ as -gstabs -o jt.o jt.s renzo@renzo-AO531h:~/Scrivania$ ld -o jt jt.o renzo@renzo-AO531h:~/Scrivania$ objdump -d jt jt: formato del file elf32-i386 Disassemblamento della sezione .text: 08048054 <_start>: 8048054: 90 nop 8048055: b8 01 00 00 00 mov $0x1,%eax 804805a: eb 07 jmp 8048063 804805c: bb 0a 00 00 00 mov $0xa,%ebx 8048061: cd 80 int $0x80 08048063 : 8048063: bb 14 00 00 00 mov $0x14,%ebx 8048068: cd 80 int $0x80 renzo@renzo-AO531h:~/Scrivania$ gdb -q jt Reading symbols from jt...done. (gdb) break *_start+1 Breakpoint 1 at 0x8048055: file jt.s, line 6. (gdb) run Starting program: /home/renzo/Scrivania/jt Breakpoint 1, _start () at jt.s:6 6 movl $1, %eax (gdb) print/x $eip $1 = 0x8048055 (gdb) step 7 jmp overhier (gdb) step 11 movl $20, %ebx (gdb) print/x $eip $2 = 0x8048063 (gdb) LABEL CALLED : pushl %ebp movl %esp, %ebp //normal function here movl %ebp, %esp popl %ebp ret # call_x .section .data output: .asciz "This is section n. %d\n" .section .text .globl _start _start: pushl $1 pushl $output call printf add $8, %esp # should clear up stack call overhere pushl $3 pushl $output call printf add $8, %esp # should clear up stack pushl $0 call exit overhere: pushl %ebp movl %esp, %ebp pushl $2 pushl $output call printf add $8, %esp # should clear up stack movl %ebp, %esp popl %ebp ret $ ./call_x This is section n. 1 This is section n. 2 This is section n. 3 $ cmp operand1, operand2 #(operand2 – operand1) ❑ Overflow flag (OF) - It is set when a signed value is too large for the data element containing it. This usually happens during arithmetic operations that overflow the size of the register holding the data. ❑ Parity flag (PF) - If the number of bits set to one in the resultant is even, the parity bit is set (one). If the number of bits set to one in the resultant is odd, the parity bit is not set (zero). ❑ Sign flag (SF) - The sign flag is used in signed numbers to indicate a sign change in the value contained in the register. In a signed number, the last bit is used as the sign bit. It indicates whether the numeric representation is negative (set to 1) or positive (set to 0). ❑ Zero flag (ZF) - if it's set (the two operands are equal) , JE , JZ branch ❑ Carry flag (CF) - The carry flag is used in mathematical expressions to indicate when an overflow has occurred in an unsigned number (remember that signed numbers use the overflow flag). The carry flag is set when an instruction causes a register to go beyond its data size limit. Unlike the overflow flag, the DEC and INC instructions do not affect the carry flag. The carry flag will also be set when an unsigned value is less than zero. For example, this code snippet will also set the carry flag: movl $2, %eax subl $4, %eax jc overflow The resulting value in the EAX register is 254, which represents –2 as a signed number, the correct answer. This means that the overflow flag would not be set. However, because the answer is below zero for an unsigned number, the carry flag is set. Unlike the other flags, there are instructions that can specifically modify the carry flag. These are described in the following table. ....................CLC Clear the carry flag (set it to zero) ....................CMC Complement the carry flag (change it to the opposite of what is set) ....................STC Set the carry flag (set it to one)valuebit 0 (lease significant bit) jxx address (label) The conditional jump instructions take a single operand in the instruction code—the address to jump to. While usually a label in an assembly language program, the operand is converted into an offset address in the instruction code. Two types of jumps are allowed for conditional jumps: ❑ Short jumps ❑ Near jumps A short jump uses an 8-bit signed address offset, whereas a near jump uses either a 16-bit or 32-bit signed address offset. The offset value is added to the instruction pointer. Conditional jump instructions do not support far jumps in the segmented memory model. If you are programming in the segmented memory model, you must use programming logic to determine whether the condition The loop instructions use the ECX register as a counter and automatically decrease its value as the loop instruction is executed. Instruction .............. Description LOOP ..................... Loop until the ECX register is zero LOOPE/LOOPZ .............. Loop until either the ECX register is zero, or the ZF flag is not set LOOPNE/LOOPNZ ............ Loop until either the ECX register is zero, or the ZF flag is set The LOOPE/LOOPZ and LOOPNE/LOOPNZ instructions provide the additional benefit of monitoring the Zero flag. The format for each of these instructions is loop address where address is a label name for a location in the program code to jump to. Unfortunately, the loop instructions support only an 8-bit offset, so only short jumps can be performed. Before the loop starts, you must set the value for the number of iterations to perform in the ECX register. This usually looks something like the following: < code before the loop > movl $100, %ecx loop1: < code to loop through > loop loop1 < code after the loop > Be careful with the code inside the loop. If the ECX register is modified, it will affect the operation of the loop. Use extra caution when implementing function calls within the loop, as functions can easily trash the value of the ECX register without you knowing it. An added benefit of the loop instructions is that they decrease the value of the ECX register without affecting the EFLAGS register flag bits. When the ECX register reaches zero, the Zero flag is not set. # luup .section .data output: .asciz " The value is: %d\n " .section .text .globl _start _start: movl $100, %ecx movl $0, %eax luup: addl %ecx, %eax loop luup pushl %eax pushl $output call printf add $8, %esp movl $1, %eax movl $0, %ebx int $0x80 if the value of ECX is already zero before the LOOP instruction, it will be decreased by one, making it -1. Because this value is not zero, the LOOP instruction continues on its way, looping back to the defined label. The loop will eventually exit when the register overflows, and the incorrect value is displayed. JCXZ instruction is used to perform a conditional branch if the ECX register is zero. This is exactly what we need to solve this problem. DISASSEMBLING a C program /* ifthen.c */ #include int main() { int a = 100; int b = 25; if (a > b) { printf("The higher value is %d\n", a); } else printf("The higher value is %d\n", b); return 0; } renzo@renzo-AO531h:~/Scrivania$ gcc -S ifthen.c renzo@renzo-AO531h:~/Scrivania$ cat ifthen.s .file "ifthen.c" .text .section .rodata .LC0: .string "The higher value is %d\n" .text .globl main .type main, @function main: .LFB0: .cfi_startproc leal 4(%esp), %ecx .cfi_def_cfa 1, 0 andl $-16, %esp pushl -4(%ecx) pushl %ebp .cfi_escape 0x10,0x5,0x2,0x75,0 movl %esp, %ebp pushl %ebx pushl %ecx .cfi_escape 0xf,0x3,0x75,0x78,0x6 .cfi_escape 0x10,0x3,0x2,0x75,0x7c subl $16, %esp call __x86.get_pc_thunk.ax addl $_GLOBAL_OFFSET_TABLE_, %eax movl $100, -16(%ebp) movl $25, -12(%ebp) movl -16(%ebp), %edx cmpl -12(%ebp), %edx jle .L2 subl $8, %esp pushl -16(%ebp) leal .LC0@GOTOFF(%eax), %edx pushl %edx movl %eax, %ebx call printf@PLT addl $16, %esp jmp .L3 .L2: subl $8, %esp pushl -12(%ebp) leal .LC0@GOTOFF(%eax), %edx pushl %edx movl %eax, %ebx call printf@PLT addl $16, %esp .L3: movl $0, %eax leal -8(%ebp), %esp popl %ecx .cfi_restore 1 .cfi_def_cfa 1, 0 popl %ebx .cfi_restore 3 popl %ebp .cfi_restore 5 leal -4(%ecx), %esp .cfi_def_cfa 4, 4 ret .cfi_endproc .LFE0: .size main, .-main .section .text.__x86.get_pc_thunk.ax,"axG",@progbits,__x86.get_pc_thunk.ax,comdat .globl __x86.get_pc_thunk.ax .hidden __x86.get_pc_thunk.ax .type __x86.get_pc_thunk.ax, @function __x86.get_pc_thunk.ax: .LFB1: .cfi_startproc movl (%esp), %eax ret .cfi_endproc .LFE1: .ident "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0" .section .note.GNU-stack,"",@progbits CODE ANALYSIS : pushl %ebp movl %esp, %ebp subl $24, %esp andl $-16, %esp movl $0, %eax subl %eax, %esp //stores the EBP register so it can be used as a pointer to the local stack area in the program. The stack pointer, ESP, is then manually manipulated to make room for putting local variables on the stack. movl $100, -4(%ebp) movl $25, -8(%ebp) // creates the two variables used in the If statement. the first instruction manually moves the value for the a variable into a location on the stack (4 bytes in front of the location pointed to by the EBP register). The second instruction manually moves the value for the b variable into the next location on the stack (8 bytes in front of the location pointed to by the EBP register). This technique, is commonly used in functions. Now that both variables are stored on the stack, it’s time to execute the if statement: movl -4(%ebp), %eax cmpl -8(%ebp), %eax jle .L2 // First, the value for the a variable is moved to the EAX register, and then that value is compared to the value for the b variable, still in the local stack. Instead of looking for the if condition a > b, the assembly language code is looking for the opposite, a <= b. If the statement evaluates to “true,” the jump to the .L2 label is made, which is the “else” part of the If statement: .L2: movl -8(%ebp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call printf // This is the code to print the answer for the b variable, which was contained in the else part of the If statement. First the b variable value is retrieved and manually placed on the stack, and then the location of the output text (located at the .LC0 label) is placed on the stack. With both elements on the stack, the printf C function is called to display the answer. The code then proceeds to the ending instructions. // If the JLE instruction was false, then a is not less than or equal to b, and the jump is not performed. Instead, the “then” part of the If statement is performed: movl -4(%ebp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call printf jmp .L3 // Here, the a variable is loaded onto the stack, along with the output text. Then the printf C function is called to display the answer, and execution jumps to the .L3 label. Finally, all roads load to the exit C function .L3: movl $0, (%esp) call exit .size main, .-main .section .note.GNU-stack,””,@progbits .ident "GCC: linux distro BRANCH PREDICTION When a branch instruction is encountered, the processor out-of-order engine must determine the next instruction to be processed. The out-of-order unit utilizes a separate unit called the branch prediction front end to determine whether or not a branch should be followed. The branch prediction front end employs different techniques in its attempt to predict branch activity. When creating assembly language code that includes conditional branches, you should be aware of this processor feature. .... Unconditional branches With unconditional branches, the next instruction is not difficult to determine, but depending on how long of a jump there was, the next instruction may not be available in the instruction prefetch cache. When the new instruction location is determined in memory, the out-of-order engine must first determine if the instruction is available in the prefetch cache. If not, the entire prefetch cache must be cleared, and reloaded with instructions from the new location. This can be costly to the performance of the application. .... Conditional branches Conditional branches present an even greater challenge to the processor. For each unconditional branch, the branch prediction unit must determine if the branch should be taken or not. Usually, when the outof- order engine is ready to execute the conditional branch, not enough information is available to know for certain which direction the branch will take. Instead, the branch prediction algorithms attempt to guess which path a particular conditional branch will take. This is done using rules and learned history. Three main rules are implemented by the branch prediction algorithms: ❑ Backward branches are assumed to be taken. ❑ Forward branches are assumed to be not taken ❑ Branches that have been previously taken are taken again. Using normal programming logic, the most often seen use of backward branches (branches that jump to previous instruction codes) is in loops. For example, the code snippet movl $100, $ecx loop1: addl %cx, %eax decl %ecx jns loop1 will jump 100 times back to the loop1 label, but fall through to the next instruction only once. The first branching rule will always assume that the backwards branch will be taken. Out of the 101 times the branch is executed, it will only be wrong once. Forward branches are a little trickier. The branch prediction algorithm assumes that most of the time conditional branches that go forward are not taken. In programming logic, this assumes that the code immediately following the jump instruction is most likely to be taken, rather than the jump that moves over the code. This situation is seen in the following code snippet: movl -4(%ebp), %eax cmpl -8(%ebp), %eax jle .L2 movl -4(%ebp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call printf jmp .L3 .L2: movl -8(%ebp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call printf .L3: It is the code snippet from the analysis of the C program ifthen. The code following the JLE instruction handles the “then” part of the If statement. From a branch prediction point of view, we can now see the reason why the JLE instruction was used instead of a JG instruction. When the compiler created the assembly language code, it attempted to maximize the code’s performance by guessing that the “then” part of the If statement would be more likely to be taken than the “else” part. Because the processor branch prediction unit assumes forward jumps to not be taken, the “then” code would already be in the instruction prefetch cache, ready to be executed. The final rule implies that branches that are performed multiple times are likely to follow the same path the majority of the time. The Branch Target Buffer (BTB) keeps track of each branch instruction performed by the processor, and the outcome of the branch is stored in the buffer area. The BTB information overrides either of the two previous rules for branches. For example, if a backward branch is not taken the first time it is encountered, the branch prediction unit will assume it will not be taken any subsequent times, rather than assume that the backwards branch rule would apply. The problem with the BTB is that it can become full. As the BTB becomes full, looking up branch results takes longer, and performance for executing the branch decreases. /* for.c */ #include int main() { int i = 0; int j; for (i = 0; i < 1000; i++) { j = i * 5; printf(“The answer is %d\n”, j); } return 0; } $ gcc -S for.c $ cat for.s .file “for.c” .section .rodata .LC0: .string “The answer is %d\n” .text .globl main .type main, @function main: pushl %ebp movl %esp, %ebp subl $24, %esp andl $-16, %esp movl $0, %eax subl %eax, %esp movl $0, -4(%ebp) movl $0, -4(%ebp) .L2: cmpl $999, -4(%ebp) jle .L5 jmp .L3 .L5: movl -4(%ebp), %edx movl %edx, %eax sall $2, %eax addl %edx, %eax movl %eax, -8(%ebp) movl -8(%ebp), %eax movl %eax, 4 movl $.LC0, (%esp) call printf leal -4(%ebp), %eax incl jmp .L2 .L3: movl $0, (%esp) call exit .size main, .-main .section .note.GNU-stack,””,@progbits .ident “GCC: (GNU) linux distro” Similar to the if statement code, the for statement code first does some housekeeping with the ESP and EBP registers, manually setting the EBP register to the start of the stack, and making room for the variables used in the function. The for statement starts with the .L2 label: .L2: cmpl $999, -4(%ebp) jle .L5 jmp .L3 The condition set in the for statement is set at the beginning of the loop. In this case, the condition is to determine whether the variable is less than 1,000. If the condition is true, execution jumps to the .L5 label, where the for loop code is. When the condition is false, execution jumps to the .L3 label, which is the ending code. The For loop code is as follows: .L5: movl -4(%ebp), %edx movl %edx, %eax sall $2, %eax addl %edx, %eax movl %eax, -8(%ebp) movl -8(%ebp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call printf The first variable location (the i variable in the C code) is moved to the EDX register, and then moved to the EAX register. The next two instructions are mathematical operations. The CALL instruction performs a left shift of the EAX register two times. This is equivalent to multiplying the number in the EAX register by 4. The next instruction adds the EDX register value to the EAX register value. Now the EAX register contains the original value multiplied by 5 (tricky). After the value has been multiplied by 5, it is stored in the location reserved for the second variable (the j variable in the C code). Finally, the value is placed on the stack, along with the location of the output text, and the printf C function is called. The next part of the code gets back to the for statement function: leal -4(%ebp), %eax incl (%eax) jmp .L2 The LEA instruction loads the effective memory address of the declared variable into the register specified. Thus, the memory location of the first variable (i) is loaded into the EAX register. The next instruction uses the indirect addressing mode to increment the value pointed to by the EAX register by one. This in effect adds one to the i variable. After that, execution jumps back to the start of the for loop, where the I value will be tested to determine whether it is less than 1,000, and the whole process is performed again. From this example you can see the framework for implementing for loops in assembly language. The pseudocode looks something like this: for: jxx forcode ; jump to the code of the condition is true jmp end ; jump to the end if the condition is false forcode: < for loop code to execute> jmp for ; go back to the start of the For statement end: The while loop code uses a format similar to the For loop code. TO OPTIMIZE conditional branch instructions can have a detrimental effect on the prefetch cache. As branches are detected in the cache, the out-of-order engine attempts to predict what path the branch is most likely to take. If it is wrong, the instruction prefetch cache is loaded with instructions that are not used, and processor time is wasted. To help solve this problem, you should be aware of how the processor predicts branches, and attempt to code your branches in the same manner. Also, eliminating branches whenever possible will greatly speed things up. Examining loops and converting them into a sequential series of operations enables the processor to load all of the instructions into the prefetch cache, and not have to worry about branching for the loop. The most obvious way to solve branch performance problems is to eliminate the use of branches whenever possible. Intel has helped in this by providing some specific instructions. CMOV instructions were specifically designed to help the assembly language programmer avoid using branches to set data values. An example of using a CMOV instruction is as follows: movl value, %ecx cmpl %ebx, %ecx cmova %ecx, %ebx The CMOVA instruction checks the results from the CMP instruction. If the unsigned integer value in the ECX register is above the unsigned integer value in the EBX register, the value in the ECX register is placed in the EBX register. This functionality enabled us to create the cmovtest.s program, which determined the largest number in a series without a bunch of jump instructions. Sometimes duplicating a few extra instructions can eliminate a jump. This small instruction overhead will easily fit into the instruction prefetch cache, and make up for the performance hit of the jump itself. An example: loop: cmp data(, %edi, 4), %eax je part2 call function1 jmp looptest part2: call function2 looptest: inc %edi cmpl $10, %edi # 10 - edi = jnz loop The loop calls one of two functions, depending on the value read from the data array. After the function is called, a jump is made to the end of the loop to increase the index value of the array and loop back to the start of the loop. Each time the first function is called, the JMP instruction must be evaluated to jump forward to the looptest label. Because this is a forward branch, it will not be predicted to be taken, and a performance penalty will result. To change this, you can modify the code snippet to look like the following: loop: cmp data(, %edi, 4), %eax je part2 call function1 inc %edi cmp $10, %edi jnz loop jmp end part2: call function2 inc %edi cmp $10, %edi jnz loop end: Instead of using the forward branch within the loop, the looptest code was duplicated within the first function code section, eliminating one forward jump from the code. UNROLL LOOPS movl values, %ebx movl $1, %edi loop: movl values(, %edi, 4), %eax #second element of array label: values loaded in eax // if edi=0 it would have been the first element cmp %ebx, %eax cmova %eax, %ebx inc %edi cmp $4, %edi jne loop Instead of looping through the instructions to look for the largest value four times, you can unroll the loop into a series of four moves: movl values, %ebx movl $values, %ecx movl (%ecx), %eax cmp %ebx, %eax cmova %eax, %ebx movl 4(%ecx), %eax cmp %ebx, %eax cmova %eax, %ebx movl 8(%ecx), %eax cmp %ebx, %eax cmova %eax, %ebx movl 12(%ecx), %eax cmp %ebx, %eax cmova %eax, %ebx While the number of instructions has greatly increased, the processor will be able to feed all of them directly into the instruction prefetch cache, and zip through them in no time. Be careful when unrolling loops though, as it is possible to unroll too many instructions, and fill the prefetch cache. This will force the processor to constantly fill and empty the prefetch cache. NUMBERS (gdb) x/x &data 0x80490bc : 0x00000225 (gdb) x/4b &data 0x80490bc : 0x25 0x02 0x00 0x00 (gdb) print/x $eax # reminding that ' print ' with no /x shows decimal value $1 = 0x225 (gdb) The decimal value 549 is stored in memory location data, and moved to the EAX register. The first gdb command uses the x command to display the value in memory located at the data label in hexadecimal format. The hexadecimal display shows what we would expect for the hex version of 549. The next command displays the 4 bytes that make up the integer value. Notice that the binary format version shows the 0x25 and 0x02 hex values reversed, which is what we would expect for little-endian format. The last command uses the print command to display the same value after it is loaded into the EAX register, again in hexadecimal format. = 0x688 = 6*256+8*16+8*1 = 1672 each byte is represented by a hexadecimal pair (4 bits per hexadecimal value), which are combined to form the eight-character hexadecimal value. Again, this example uses big-endian format, as seen in the register. ❑ Unsigned integers ❑ Signed integers ❑ Binary-coded decimal ❑ Packed binary-coded decimal ❑ Single-precision floating-point ❑ Double-precision floating-point ❑ Double-extended floating-point the SIMD extensions on Pentium processors add other advanced numeric data types: ❑ 64-bit packed integers ❑ 128-bit packed integers ❑ 128-bit packed single-precision floating-point ❑ 128-bit packed double-precision floating-point One problem with the signed magnitude method is that there are two different ways to express a zero value: 00000000 (decimal +0) and 10000000 (decimal -0). This can complicate some mathematical processes. Also, arithmetic using signed magnitude numbers is complicated, as adding and subtracting simple signed integers cannot be done in the same way as unsigned numbers. For example, doing a simple binary addition of the values 00000001 (decimal 1) and 10000001 (decimal –1) produces 10000010 (decimal –2), which is not the correct answer. Different arithmetic instructions for signed integers and unsigned numbers would be required on the processor. ONE’s complement The one’s complement method takes the inverse of the unsigned integer value to produce the similar negative value. The inverse changes any zero bits to ones, and any ones bits to zeroes. Thus, the one’s complement of 00000001 would be 11111110. Again, as with signed magnitude numbers, one’s complement numbers have some problems when performing mathematical operations. There are two ways of representing a zero value (00000000 and 11111111), and arithmetic with one’s complement numbers is also complicated (it does not allow you to do standard binary math). TWO’s complement The two’s complement method solves the arithmetic problem of the signed magnitude and one’s complement methods using a simple mathematical trick. For negative integer values, a one is added to the one’s complement of the value. For example, to find the two’s complement value for decimal -1 you would do the following: 1. Take the one’s complement of 00000001, which is 11111110. 2. Add one to the one’s complement, which is 11111111. Doing the same for the value -2 you would get 11111110, and for -3 it would be 11111101. You may notice a trend here. The two’s complement value counts down from 11111111 (decimal –1) until it gets to 10000000, which represents -128. Of course, for multibyte integer sizes the same principle applies across the bytes. While this seems like an odd thing to do, it solves all of the problems in adding and subtracting signed integers. For example, adding the values 00000001 (+1) and 11111111 (-1) produces the value 00000000, along with a carry value. The carry value is ignored in signed integer arithmetic, so the final value is indeed 0. The same hardware can be used to add and subtract both unsigned and signed values. When converting unsigned integer values to a larger bit size (such as converting a word to a doubleword), you must ensure that all of the leading bits are set to zero. You should not simply copy one value to the other, as shown here: movw %ax, %bx There is no guarantee that the upper part of the EBX register contains zeroes. To accomplish this, you must use two instructions: movl $0, %ebx movw %ax, %ebx The MOVL instruction is used to load a zero value in the EBX register. This guarantees that the EBX register is completely zero. Then you can safely move the unsigned integer value in the AX register to the EBX register. To help you in these situations, Intel provides the MOVZX instruction. This instruction moves a smallersized unsigned integer value (in either a register or a memory location) to a larger-sized unsigned integer value (only in a register). The format of the MOVZX instruction is movzx source, destination where source can be an 8-bit or 16-bit register or memory location, and destination can be a 16-bit or 32-bit register. The movzxtest.s program demonstrates this instruction: # movzxtest.s .section .text .globl _start _start: nop movl $279, %ecx movzx %cl, %ebx ########## ecl , ech = WRONG // CX = 100010111 = 279 // CH = 00000001 = 1 // CL = 00010111 = 23 movl $1, %eax int $0x80 The movzxtest.s program simply puts a large value in the ECX register, and then uses the MOVZX instruction to copy the lower 8 bits to the EBX register. Because the value placed in the ECX register used a word unsigned integer to represent it (it is larger than 255), the value in CL represents only part of the complete value. You can watch the program in the debugger and see what is happening to the registers: $ gdb -q movzxtest (gdb) break *_start+1 Breakpoint 1 at 0x8048075: file movzxtest.s, line 5. (gdb) run Starting program: /home/rich/palp/chap07/movzxtest Breakpoint 1, _start () at movzxtest.s:5 5 movl $279, %ecx Current language: auto; currently asm (gdb) s 6 movzx %cl, %ebx (gdb) s 7 movl $1, %eax (gdb) print $ecx $1 = 279 (gdb) print $ebx $2 = 23 (gdb) print/x $ecx $3 = 0x117 (gdb) print/x $ebx $4 = 0x17 (gdb) By printing out the decimal values of the EBX and ECX registers, you can tell right away that the unsigned integer value was not copied correctly—the original value was 279 but the new value is only 23. By displaying the values in hexadecimal, you can see what happened. The original value in hex is 0x0117, which takes a doubleword to hold. The MOVZX instruction moved just the lower byte of the ECX register, but zeroed out the remaining bytes in the EBX register, producing the 0x17 value in the EBX register. # ch .section .text .globl _start _start: nop movl $279, %ecx movzx %ch, %ebx ############## 00000001 movl $1, %eax int $0x80 C:\>gdb -q users\rnz\desktop\ch.exe Reading symbols from users\rnz\desktop\ch.exe...done. (gdb) break *_start+1 (gdb) run (gdb) print $ebx $1 = 1 (gdb) print $ch $2 = 1 (gdb) print $ech $3 = void (gdb) (gdb) print $cx $4 = 279 (gdb) print/x $ecx $5 = 0x117 (gdb) print/x $ebx $6 = 0x1 (gdb) print $cl $1 = 23 (gdb) print $ecx $4 = 279 (gdb) print $ecl $5 = void Extending signed integer values is different than extending unsigned integers. Padding the high bits with zeroes will change the value of the data for negative numbers. For example, the value -1 (11111111) moved to a doubleword would yield the value 0000000011111111, which in signed integer notation would be +127, not -1. For a signed extension to work, the new bits added must be set to a one value. Thus, the new doubleword would yield the value 1111111111111111, which is the signed integer notation for the value -1, which is what it should be. Intel has provided the MOVSX instruction to allow extending signed integers and preserving the sign. It is similar to the MOVSZ instruction, but it assumes that the bytes to be moved are in signed integer format and attempts to preserve the signed integer value for the move. # _movsx .section .text .globl _start _start: nop movw $-79, %cx movl $0, %ebx movw %cx, %bx movsx %cx, %eax movl $1, %eax movl $0, %ebx int $0x80 # _movsx in WINDOWS using MINGW-W64 .section .text //.globl _start /*ENTRY POINT */ //_start: /*ENTRY POINT*/ nop movw $-79, %cx movl $0, %ebx movw %cx, %bx movsx %cx, %eax movl $1, %eax movl $0, %ebx int $0x21 //gdb break 1 //*_start+1 not necessary 5 nop (gdb) s 6 movw $-79, %cx (gdb) s 7 movl $0, %ebx (gdb) s 8 movw %cx, %bx (gdb) s 9 movsx %cx, %eax (gdb) info reg eax 0x60ffcc 6356940 ecx 0x40ffb1 4259761 edx 0x401000 4198400 ebx 0xffb1 65457 esp 0x60ff74 0x60ff74 ebp 0x60ff80 0x60ff80 esi 0x401000 4198400 edi 0x401000 4198400 eip 0x40100d 0x40100d eflags 0x246 [ PF ZF IF ] cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x53 83 gs 0x2b 43 (gdb) print $bx $1 = -79 (gdb) print $cx $2 = -79 (gdb) # mmovsx .section .text .globl _start _start: nop movw $79, %cx xor %ebx, %ebx ############# set to zero in every case movw %cx, %bx movsx %cx, %eax movl $1, %eax movl $0, %ebx int $0x80 # qquad .section .data data1: .int 1, -1, 463345, -333252322, 0 # every element of array has 4 bytes data2: .quad 1, -1, 463345, -333252322, 0 # every element has 8 bytes .section .text .globl _start _start: nop movl $1, %eax movl $0, %ebx int $0x21 gdb -q users\rnz\desktop\qquad.exe (gdb) break 1 (gdb) run Starting program: C:\users\rnz\desktop\qquad.exe [New Thread 440.0x260] Breakpoint 1, start () at users\rnz\desktop\qquad.s:10 10 nop (gdb) x/20b &data1 0x402000 : 0x01 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0x402008 : 0xf1 0x11 0x07 0x00 0x1e 0xf9 0x22 0xec 0x402010 : 0x00 0x00 0x00 0x00 //32 byte each element (gdb) (gdb) x/40b &data2 0x402014 : 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x40201c : 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x402024 : 0xf1 0x11 0x07 0x00 0x00 0x00 0x00 0x00 0x40202c : 0x1e 0xf9 0x22 0xec 0xff 0xff 0xff 0xff 0x402034 : 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 //64 bit each element (gdb) (gdb) x/5d &data1 0x402000 : 1 -1 463345 -333252322 0x402010 : 0 (gdb) 0x402014 : 1 0 -1 -1 0x402024 : 463345 //WRONG , x/d NOT SUFFICIENT. the debugger didn’t know how to handle displaying the values using just the x/d command. If you want to display quadword signed integer values in the debugger, you must use the gd option (gdb) (gdb) x/5gd &data2 0x402014 : 1 -1 0x402024 : 463345 -333252322 0x402034 : 0 (gdb) SIMD Integers The Intel Single Instruction Multiple Data (SIMD) technology provides additional ways to define integers These new integer types enable the processor to perform arithmetic operations on a group of multiple integers simultaneously. The SIMD architecture uses the packed integer data type. A packed integer is a series of bytes that can represent more than one integer value. Mathematical operations can be performed on the series of bytes as a whole, working on the individual integer values within the series in parallel. MMX integers The Multimedia Extension (MMX) technology introduced in the Pentium MMX and Pentium II processors provided three new integer types: ❑ 64-bit packed byte integers ❑ 64-bit packed word integers ❑ 64-bit packed doubleword integers Each of these data types provides for multiple integer data elements to be contained (or packed) in a single 64-bit MMX register. # mmxt .section .data values1: .int 1, -1 values2: .byte 0x10, 0x05, 0xff, 0x32, 0x47, 0xe4, 0x00, 0x01 .section .text .globl _start _start: nop movq values1, %mm0 movq values2, %mm1 movl $1, %eax movl $0, %ebx //int $0x80 in Linux int $0x21 (gdb) print $mm0 $1 = {uint64 = -4294967295, v2_int32 = {1, -1}, v4_int16 = {1, 0, -1, -1}, v8_int8 = {1, 0, 0, 0, -1, -1, -1, -1}} (gdb) print/t $mm0 $3 = {uint64 = 1111111111111111111111111111111100000000000000000000000000000001, v2_int32 = {1, 11111111111111111111111111111111}, v4_int16 = {1, 0, 1111111111111111, 1111111111111111}, v8_int8 = {1, 0, 0, 0, 11111111, 11111111, 11111111, 11111111}} (gdb) (gdb) print/x $mm0 $2 = {uint64 = 0xffffffff00000001, v2_int32 = {0x1, 0xffffffff}, v4_int16 = {0x1, 0x0, 0xffff, 0xffff}, v8_int8 = {0x1, 0x0, 0x0, 0x0, 0xff, 0xff, 0xff, 0xff}} (gdb) print/x $mm1 $3 = {uint64 = 0x100e44732ff0510, v2_int32 = {0x32ff0510, 0x100e447}, v4_int16 = {0x510, 0x32ff, 0xe447, 0x100}, v8_int8 = {0x10, 0x5, 0xff, 0x32, 0x47, 0xe4, 0x0, 0x1}} .byte 10, 05, 250, 32, 47, 4, 0, 1 (gdb) print $mm1 $2 = {uint64 = 72062194501158154, v2_int32 = {553256202, 16778287}, v4_int16 = {1290, 8442, 1071, 256}, v8_int8 = {10, 5, -6, 32, 47, 4, 0, 1}} (gdb) to be continued

HOME PAGE