Computer Organization CH2 Instructions: Language of the Computer
Computer Organization CH2 Instructions: Language of the Computer
2.1 Introduction
Instruction Set
Instructions
- command a computer’s hardware
- the words of a computer’s language
An Instruction set
- vocabulary of the language
- considered as the repertoire(劇目) of instructions of a computer
Different computers have diff. instruction set
- with many aspect in common
- a few basic operations that all computer must provide(有些基礎的運算是通用的)
Early computers had very simple instruction set
- many modern computers also
The RISC-V instruction set
the example throughout this course
developed at UC Berkeley as open ISA in 2010
- Typical of many modern ISAs
- Similar ISAs have a large share of embedded core market
- In consumer electronics, network/storage equipment, cameras, printers
RV32 Instructions
32 registers
- x0~x31, x0=0
- 32-bit words registers
Addressing
- byte addressable(位元組定位)
- Sequential word accesses differ by 4 (4 byte = 32 bit)
- E.g., Mem[0], Mem[4], Mem[8], …
Instruction can be divide into 6 group
- arithmetic
- data transfer
- logical
- shift
- conditional branch
- unconditional branch
2.2 Operation of the Computer Hardware
Arithmetic Operation
this instruction instructs a computer to add the two var. b and c to put their sum to a
- add a, b, c
Instructions, add and substract, each has three operands
- two source
- one destination
- 兩個來源,一個目的地
Design Principle 1:Simplicity favors reqularity(簡單有利於規律性)
- Regularity makes imple. simpler(規律簡化實作)
- simplicity enables higher performance at lower cost(簡單能夠以低消耗,實現更高的表現)
2.3 Operand of the Computer Hardware
Compiling a C statement Example
C code with the five var. f, g, h, i and j
f = (g+h)-(i+j)
add t0, g, h //temp t0 = g+h
add t1, i, j
sub f, t0, t1 //f = t0-t1
The single statement is braked into several instructions
- 1 operation is performed per RISC-V instruction
- Temporary var. (t0 and t1) are used to keep the intermediate results
Register Opearnds
The operands of arithmetic instrction are kept in a limited number of special locations
- called register
Registers are primitive(原語) used in hardware design
- Register are also visible to the programmer when the computer is completed
- think of register as the bricks of computer construction
A 32-bit RISC-V computer has 32 register
- use for frequency accessed data
- 32-bit data is called a “word”
- 64-bit data is called a “double word”
- for a 64-bit machine, there could be 32 * 64-bit general purpose register x0 to x31
- 仍舊是32個暫存器,但每個暫存器的大小變成64 bit
Design Principle 2:Smaller is faster(小的速度更快)
- main memory: million of location, but slower
Why 32 register?
- 31 register may not be faster than 32
- but designer have to balanc the clock cycle and more registers, and number of bits in the instruction format
- A very large number of register may increase the clock cycle time
- takes electronic signals longer when they must travel farther
- 過ㄉㄚ會導致clock cycle time增加,因為需要電訊號需要跑更遠
RISC-V Register
The common arrangement for x0~x31 registers
- x0: the constant value 0
- x1: return address
- x2: stack pointer
- x3: global pointer
- x4: thread pointer
- x5-x7, x28-x31:temporaties (not reserved)
- x8: frame pointer
- x9, x18-x27: saved registers
- x10-x11: function arguments/results (not reserved)
- x12-x17: function arguments (not reserved)
Compiling a C statement Ex.
like previous part
f=(g+h)-(i+j)
In fact, a compiler has to associate program var. with registers
//f, g, h, i, j are in x19, x20, .., x23
add x5, x20, x21
add x6, x22, x23
sub x19, x5, x6 //x5 and x6 are for temporary data
Memory Operands
Main memory used for composite data(主記憶體被用於複雜的資料)
- simple data ele.: int., float
- complex data structure: array, dynamic data
Arithmetic operations occur only on register(運算操作只使用於暫存器上)
- data transfer instruction are required to
- load values from memory into register
- store result from register to memory
memory is byte addressed
- memory is considered as a large single dimentional array
- address is the index to this array
- each memory address identifies an 8-bit byte (8個bit,其實就是一個byte)
RISC-V does not require words to be aligned in memory
- unlike some ISAs
- 不需要對齊,也就是不用每個字都必須要從0, 4, 8, 12, …的位置開始
- 只需要每個字之間的間隔為4
RISV-C is Little Endian
- least-significant byte at least address of a word
- Big Endian
- most-significant byte at least address
Memory Operand Ex.
A is an array of 100 words
- compiler associates the var. g and h with the registers x20 and x21
- base address of A is in x22
g = h+A[8]
its RISC-V code
lw x9 8(x22) //transfer A[8] to register x9
add x20, x21, x9
Memory address
Compiler helps
- associate var. with registers
- allocate data structure
- arrays and structures, to memory location
RISC-V is byte addressable
- the address of a word matches the address of one of the 4 bytes within the word
- address of sequential words differ by four(連續字之間的位置差了4 byte)
- each array ele. is 4-byte data(其實就是一個word)
Alignment restriction
- In many architectures, words musts start at addresses that are multiple of 4
- for better accessing efficiency
- e.g. MIPS
with the byte addressing concept
previous code become
lw x9, 32(x22) //32 = 8*4
add x20, x21, x9
A sw(store word) instruction is needed if the result should be stored into a memory location
sw x20, 48(x22) //48 = 12*4, A[12] = f
what is th offset when we handling the array in 32-bit machine?
- each ele. is word
Registers vs. memory
Registers are faster tn access than memory
- assuming 32-bit data
- register are roughly 200 times faster (0.25 vs. 50 nano ec.)
- 10,000 times more energy effcient (0.1 vs. 1000 picoJoules0)
- than DRAM in 2020
Operating on memory data requires loads and stores
- more instruction to be executed
Many programs have more var. than computers registers
- compiler must use register for var. as much as possible
- “Spilling register” is putting less frequency used var. into memory
- register optimization is important for execution performanc
Immediate Operands
Constant are often used in prog.
- increment an index(constant) to point to the next ele. of an array
- in fact, more than half of the RISC-V arithmetic instruction have a constant as an operand when running the SPEC CPU2006 benchmarks(一種CPU效能檢測的標準工具)
add immediate(addi) instruction can be used to avoid the load instruction
addi x22, x22, 4 //x22 = x22+4
The constant zero has a special role
- which is to simplify instruction set by offering useful variations
- can negate value, by using the sub instruction with zero for the first operand
- RISC-V dedicates register x0 to be hard-wired(硬連線) to the value zero
2.4 Signed and Unsigned Numbers
Unsigned numbers(bits)
Two’s complement
Two’s complement signed numbers
Negation Shortcut
to negate a two’s complement number
, - simply invert every 0 to 1 and every 1 to 0
- add one to the result
sign extension shortcut
simply copied most significant bit to fill the new bits
- copy sign bit
Examples: extends an 8-bit number to 16-bit number
- +2: 0000 0010 => 0000 0000 0000 0010
- –2: 1111 1110 => 1111 1111 1111 1110
In RISC-V
- lb: sign-extend loaded byte
- lbu: zero-extend loaded byte
- 補0
2.5 Representing Instructions in the Computer
Representing instruction
The way computers see instruction
Machine language
- is the numeric version of instruction
- binary representation used for communication within a computer system
Machine code
- a sequenc of instruction encoded in binary numbers
- translated from assembly code
- to avoid reading/writing long, tiresome binary string
- hexadecimal(base 16) numbers are often adopted
- 4 bit per hexadecimal(hex) digit
- Ex.
- Ex.
RISC-V instructions
- encoded as 32-bit instruction words
- Regularity
- An instruction encodes
- operation code
- register numbers, etc.
RISC-V R-format instructions
Instruction fields
- opcode: operaton code, denotating the operation and format of an instruction
- rd: destination register number
- funct3: 3-bit fun. code(additional opcode)
- rs1: the first source register number
- rs2: the second source register number
- funct7: 7-bit function code(addition opcode)
I-format instructions
An instruction will need longer fields than those shown above
- E.g., constant may be used to select ele. from arrays or data structure
- often needs to be much larger than 31
- 5-bit field is too small to useful
- for addi or lw inst.
Immediate arithmetic(addi) and load instructions(lw)
- rs1: source register or base address number
- immediate: constant operand, or offset added to base address(for load inst.)
- two’s-complement, sign extended numbers
- region of
or 2048 bytes( or 512 words)
Design Priciple 3:Good design demands good compromises(好的設計需要好的妥協)
- different formats compilcate decoding(不同的格式,會造成編碼上的困難)
- but allow 32-bit instruction uniformly
- keep format as simplier as possible
S-format instructions
We also need a format for the store word inst.
- sw, which needs two source registers
- the base addr.
- store data
- an immediate for the addr. offset
- a little different from R-type inst.
- rd vs. imm
- same layout
format
- rs1: base addr. register number
- rs2: sourc operand register number
- immediate: 12-bit immediate offset to base addr
- split so that rs1 and rs2 fields always in the same pos.
Summary of R-, I-, S-format inst.
The format are distinguished by the val. in the opcode field
- each format is assigned a distinct set of opcode val. in the first field
- hardware knows how to treat the rest of the inst.
The “Stored Program” concept
- Instructions represented in binary, just like data
- Instructions and data stored in memory
programs can operate on programs
- E.g., compilers, linkers, …
- Binary compatibility(二進位相容性) allows compiled programs to work on different computers
- Standardized ISAs
2.6 Logical Operations
Logical operations
It is useful to operate on fields of bits within a word or even on individual bits
- examining characters within a word, each of which is stored as 8 bits(每個字符儲存為8位元,就是一個位元組)
These operations were added to programming languages
and instruction set architectures
- simplify the packing & unpacking of bits into words
- Useful for extracting and inserting groups of bits in a word
Shift Operations
Shifts move all the bits in a word to the left or right, filling the emptied bits with 0s(用0填滿空位)
- immed: how many positions to shift
Shift left logical immediate
- shift left
- fill with 0
- slli by i bits, multiplies by
slli x11, x19, 4 //reg x11 = reg 19 << 4 bits
Shift right logical immediate
- shift right
- fill with 0
- srli by i bits divides by
(unsigned only)
AND op.
To isolate fields via bit-by-bit operation
- Useful to mask bits in a word
- Select some bits, clear others to 0
OR and XOR op.
Useful to include bits in a word
Differencing operation
- It can be used to emulate NOT operation
- That is, it calculates a 1 only if the values are different in the two operands
2.7 Instructions for making Decisions
Conditional op.
Decision making is commonly represented in programming languages
- using the if statement, sometimes combined with go to statements and labels
Two decision-making instructions (conditional
branches*) in RISC-V assembly language
beq (branch if equal)
- beq rs1, rs2, L1(最大的跳躍範圍就是32bit)
- This instruction means go to the statement labeled L1 if the value in register rs1 = the value in register rs2
bne (branch if not equal)
- bne rs1, rs2, L1
- It means go to the statement labeled L1 if the value in register rs1 does != the value in register rs2
*An instruction that tests a value and that allows for a subsequent transfer of control to a new address in the program based on the outcome of the test
A higher-level view of branches
Branch to a labeled instruction if a condition is true
- Otherwise, continue sequentially
Decisions are important for if statements and for
loops (iterating a computation)
Compiling if state.
C code
if(i==j)
f=g+h;
else
f=g-h;
// note:f, g, ... in x19, x20, ...
Compiled RISC-V code
- Exit is assembler calculates address
bne x22, x23, Else add x19, x20, x21 beq x0, 0, Exit Else: sub x19, x20, x21 Exit: ...
While loop
C code
while(save[i] == k)
i+=1;
// note: i in x22, k in x24, address of save in x25
Compiled RISC-V code
Loop:
slli x10, x22, 2 //i代表index, 每個index是4個byte,所以i需要乘上4
add x10, x10, x25
lw x9, 0(x10)
bne x9, x24, Exit
addi x22, x22, 1
beq x0, x0, Loop
Exit:
...
Basic Blocks
A basic block is a sequence of instructions with
- No embedded branches (except at end)
- No branch targets (except at beginning)
基本區塊(basic block)是編譯程式執行最佳化的最基本單位,其定義為沒有分支與沒有分支目的的指令,若有分支也只能在區塊的最後一個指令,若有分支目的也只能在區塊的第一個指令。
https://hackmd.io/@HsuChiChen/computer_organization
A compiler identifies basic blocks for optimization
An advanced processor can accelerate execution of basic blocks
2.8 Supporting Procedures in Computer Hardware
Procedures
A procedure (or function)
- Is a stored subroutine that performs a specific task based on the parameters with which it is provided
- Is one tool programmers use to structure programs, both to make them easier to understand and to allow code to be reused(使程式易懂,以及可重複利用)
- Is one way to implement abstraction in software
Six steps required for invoking a procedural call
- Place parameters in registers x10 to x17
- 把非保留的暫存器中的資料,儲存進記憶體中
- Transfer control to procedure (with branch inst.)
- 把控制交給procedure
- Acquire storage for procedure (spill registers)
- 獲取procedure的儲存空間
- Perform procedure’s operations
- Place result in register for caller (restore registers as well)
- 計算結果放進register
- Return to place of call (return address in x1 and jump back)
Procedure Calls
RISC-V software follows the following procedure calling convention in allocating its 32 registers
- x10–x17: eight parameter registers in which to pass parameters or return values
- x1: one return address register to return to the point of origin
RISC assembly language has the two instructions for procedures
- jal (jump-and-link): It branches to an address and simultaneously saves the address of the following instruction to the destination register rd (the “link” portion)
- 把接下來指令的位址存到rd
- jalr (jump-and-link register): it helps return from a procedure; branches to the address stored in the return register x1, which is just what we want
- 跳回r1中的位址
Procedure Calls (Cont’d)
Procedure call w/ jump and link
jal x1, ProcedureLabel
- Address of following instruction put in x1
- Jumps to target address
- Caller is the program that instigates a procedure and provides the necessary parameter values
- Program counter (PC) is a register to hold the address of the current instruction being executed
- 掌握現在使用執行中的指令的register
- PC can be seen as instruction address register
- The jal instruction actually saves PC + 4 in its designation register (usually x1) to link to the byte address of the following instruction to set up the procedure return
- 因為現在的指令在PC,下一個指令就要+4
Procedure return w/ jump and link register
jalr x0, 0(x1)
- Like jal, but jumps to 0 plus the address in x1
- Use x0 as rd (x0 cannot be changed)
- Can also be used for computed jumps
- e.g., for case/switch statements
- Callee is a procedure (ProcedureLabel) that executes a series of stored instructions based on parameters provided by the caller and then returns control to the caller
Compiling a C procdure
C code
int leaf_example(int g, int h, int i, int j) {
int f;
f = (g+h)-(i+j);
}
//argument g, ..., j in x10, ..., x13
//f in x20
//temp in x5, x6
//needs to save x5, x6, x20 on stack
RISC-V code
leaf_example:
addi sp, sp, -12
sw x5, 8(sp)
sw x6, 4(sp)
sw x20, 0(sp)
add x5, x10, x11
add x6, x12, x13
sub x20, x5, x6
addi x10, x20, 0
lw x20, 0(sp)
lw x6, 4(sp)
lw x5, 8(sp)
addi sp, sp, 12
jalr x0, 0(x1)
Stack
It is a data structure organized as a last-in-first-out queue for spilling registers (to preserve the data contents)
- Placing data onto the stack is called a push, and removing data from the stack is called a pop
Temporary and Preserved Register
RISC-V sw separates following 19 registers into two groups
- to avoid saving/restoring a register whose value isn’t used
- which might happen with a temporary register
- x5−x7 and x28−x31: temporary registers that are not preserved by the callee (called procedure) on a procedure call
- 這些register不需要被callee保留起來
- The caller does not expect registers x5 and x6 to be preserved across a procedure call (these registers are maintained by caller)
- The example in p. 49 does not follow this arrangement
- x8−x9 and x18−x27: saved registers that must be preserved on a procedure call
- 在callee中需要保護這些register
- if used, the callee saves and restores them
- These registers are maintained by callee
Nested Procdures: Leaf and Non-leaf Procedures
Leaf procedures are procedures that do not call other
- Only be callee (e.g., leaf_example)
Non-leaf procedures are procedures that do call others
- Be both caller and callee (e.g., non_leaf)
Without proper protection of registers, there will be conflicts on their contents
- E.g., since non_leaf hasn’t finished its task yet, there is a conflict over the use of registers (e.g., x0, x1, x10)
- 上面的暫存器會被覆蓋
For a nested call, caller needs to save on the stack
- Its return address (x1)
- Any arguments and temporaries needed after the call
- Restore from the stack after the call
void root() {
...
non_leaf(10); //jal x1, non_leaf (x10=10)
// jalr x0, 0(x1)
...
}
void non_leaf(int x){
int a=0;
a = leaf_example(4, x, 2, 1); //jal x1, leaf_example(x10=4, x11=x, x12=2, x13=1)
//jalr x0, 0(x1)
}
int leaf_example(int g, int h, int i, int j) {
int f;
f=(g+h)-(i+j);
return f;
}
A Non-leaf Example
C code
int fact(int n) {
if(n<1)
return f;
else
return n*fact(n-1);
}
//argument n in x10
//result in x10
RISC-V code
fact:
addi sp, sp, -8
sw x1, 4(sp)
sw x10, 0(sp)
//if statement
addi x5, x10, -1 //x5 = n-1
bge x5, x0, L1 //if n>=1, go to L1
addi x10, x0, 1 //else set return value to 1
addi sp, sp, 8 //pop stack
jalr x0, 0(x1) //return to caller
L1:
addi x10, x10, -1 //n = n-1
jal x1, fact //call fact(n-1)
addi x6, x10, 0 //move result of fact(n-1) to x6
lw x10, 0(sp) //restore caller's n
lw x1, 4(sp) //restore caller's address
addi sp, sp, 8 //pop stack
mul x10, x10, x6 //return n*fact(n-1)
jalr x0, 0(x1) //return to caller
Local data on the Stack
The stack is also used to store variables
- that are local to the procedure (and maintained by callee)
- but do not fit in registers
- E.g., local arrays, structures, C automatic variables
The stack grows downwards at runtime
- SP value might change during the procedure
- so references to a local variable in memory might have different offsets depending on where they are in the procedure
- making the procedure harder to understand
- 變動的sp會導致procedure變得很難懂
Procedure frame (activation record) (程式框架)
- A stack segment contains a procedure’s saved registers and local variables
A frame pointer fp (x8) points to the first word of the frame of a procedure
Solution: fp offers a stable base register for procedure frame for local variable accesses
if there are no local var. on the stack within a procedure
- the compiler will save time by not setting and restoring the fp
if using fp, it is init. using the address in sp on a call
- and sp is restored by fp
Memory Layout
In addition to local variables, global data in a program is also kept in the memory
- E.g., static variables and dynamic data structures in C programs
The memory layout shown in figure shows
- the RISC-V convention for allocation of memory when running the Linux operating system
Text seg.
- Starts at 0000 0000 0040 0000hex
- The RISC-V program (machine) code
Static data seg.
- Starts at 0000 0000 1000 0000hex
- Constants and other static variables
- Pointed by gp (global pointer)
Dynamic data (heap; grows upwards)
- Structures grow/shrink during their lifetimes, e.g., linked lists
- The C functions, malloc and free, manipulates the heap data
Stack (grows downwards)
- Automatic storage
2.9 Communicating with People
Representing ASCII Char.
Byte-encoded character sets
- American Standard Code for Information Interchange (ASCII)
- ASCII: 128 characters (95 graphic, 33 control)
- Latin-1: 256 characters (ASCII, +96 more graphic characters)
Unicode: 32-bit character set
- Used in Java, C++ wide characters, …
- Most of the world’s alphabets, plus symbols
- UTF-8, UTF-16: variable-length encodings
Load/Store Data in Different widths
LOAD/STORE byte/halfword/word
- Load byte/halfword/word: Sign extend to bits in rd
- lb rd, offset(rs1)
- lh rd, offset(rs1)
- lw rd, offset(rs1)
- Load byte/halfword/word unsigned: Zero extend to bits in rd
- lbu rd, offset(rs1)
- lhu rd, offset(rs1)
- lwu rd, offset(rs1)
- Store byte/halfword/word: Store rightmost 8/16/32 bits
- sb rs2, offset(rs1)
- sh rs2, offset(rs1)
- sw rs2, offset(rs1)
lbu x12, 0(x10) //read byte from source, placing it in the rightmost 8 bits in x12
sb x12, 0(x11) //takes a byte from the rightmost 8 bits of x12 and writes the byte to x11[0]
A string copy ex.
C code
void strcpy(char x[], char y[]) {
size_t i;
i = 0;
while((x[i]=y[i])!='\0')
i+1;
}
\\Null-terminated string
\\assume the base addr. for arrays x and y found in x10 and x11
RISC-V code
strcpy:
addi sp, sp, -4
sw x10, 0(sp)
add x19, x0, x0
L1:
add x5, x19, x10
lbu x6, 0(x5)
add x7, x9, x10
sb x6, 0(x7)
beq x6, x0, L2
addi x19, x19, 1
jal x0, L1
L2:
lw x19, 0(sp)
addi sp, sp, 4
jalr x0, 0(x1)
2.10 RISC-V Addressing for Wide Immediates and Addresses
Most constants are small
- 12-bit immediate is sufficient
- But, sometimes there are bigger immediates (constants)
- In this case, lui instruction can be used to set up a 32-bit immediate into a register for further use
lui (load upper immediate)
- lui rd, constant
- To copy a 20-bit constant to bits [31:12] of rd, and clear bits [11:0] of rd to 0
- The example below loads a 32-bit immediate 0x003D 0500hex into x19 by using a lui and an addi instruction
- 如何製造表示32bit位置的register
- 沒辦法直接add出來,slli也只能做出2的倍數
- 透過lui設置[31:12]的bit,再使用addi即可
- For 64-bit, it can extend bit 31 to bits [63:32]
- 如何製造表示32bit位置的register
Branch Addressing
Branch instructions specify the contents
- Opcode, two registers, offset (to target address)
Most branch targets are near branch
- Forward or backward relative to PC
- → PC-relative addressing (in byte)
- Target address = PC + offset
- Target address = PC + immediate * 2(immediate代表half word)
- branch instruction的offset的數字介於branch以及branch target是halfword,因為設計者希望支援2 bytes long
SB format
- A 7-bit opcode, a 3-bit function code, two 5-bit register
- operands (rs1 and rs2), and a 12-bit address immediate
- bne x10, x11,
- If x10 != x11, go to location
= 0111 1101 0000
- If x10 != x11, go to location
Jump Addressing
jal(unconditional jump and link)
- jal rd, immed
- uses 20-bit immediated for larger range
UJ format
- A 7-bit opcode, a 5-bit destination register operand (rd), and a 20-bit address immediate
- The link address, which is the address of the instruction following the jal, is written to rd
- 跳躍回來的下一個指令放在rd
- jal x0,
- *Go to location
= 0111 1101 0000
Addressing Mode Summary
Addressing mode
- One of several addressing regimes delimited by their varied use of operands and/or addresses(多個尋址之一 由其劃定的不同使用運算元和/或地址)
- Immediate addressing, where the operand is a constant within the instruction itself
- Register addressing, where the operand is a register
- Base or displacement addressing, where the operand is at the memory location whose address is the sum of a register and a constant in the instruction PC = Reg + offset (in mem.)
- PC-relative addressing, where the branch address is the sum of the PC and a constant in the instruction
Inst. Encoding
find out the exact op. of a given inst.
- check the fields: opcode, func3, and (optional) funct7
2.11 Parallelism and Instructions: Synchronization
Parallel Execution and Synchronization
Parallel execution is hard when tasks need to cooperate
- E.g., some tasks are writing new values that others must read
Synchronization is used to know when a task is finished writing so that it is safe for another to read
- A data race would occur if they don’t synchronize
- where the results of the program can change depending on how events happen to occur
- 同步,用於確認任務是否完成,接下來的讀取是否安全?
Example
- Two processors sharing an area of memory
- P1 writes, then P2 reads
- A data race occurs if P1 and P2 don’t synchronize
- Result depends on order of accesses
Synchronization
Synchronization mechanisms are typically built with userlevel routines (e.g., lock and unlock)
- that rely on hardware-supplied synchronization instructions
- Lock/unlock can be used straightforwardly to create regions where only a single processor can operate, called a mutual exclusion(互斥),
- as well as to implement more complex synchronization mechanisms
Hardware-supplied primitives for read/write memory atomically(原子記憶體的讀取)
- Nothing else can interpose itself between the read or the write of the memory location(沒有其他東西可以在記憶體讀取之間插入)
Two possible implementation of the HW primitives
- A single instruction → e.g., atomic swap of register and memory
- An atomic pair of instructions → the second instruction returns a value showing whether the pair of instructions was executed as if the pair was atomic
Synchronization in RISC-V
In RISC-V, this special pair of instructions include a special load (lr.w) and a special store (sc.w)
- if the contents of the memory location specified by the load-reserved are changed before the store-conditional to the same address occurs, then
- the store-conditional fails and does not write the value to memory
- Check Sec. 2.11, p. 128~131, in the textbook
lr.w (load-reserved word)
lr.w rd,(rs1)
- Load from address in rs1 to rd
- Place reservation on memory address
sc.w (store-conditional word)
sc.w rd,(rs1),rs2
- Store from rs2 to address in rs1
- Succeeds if location not changed since the lr.w
- Write 0 in rd
- Fails if location is changed
- Write non-zero value in rd
- Succeeds if location not changed since the lr.w
Ex.
Example 1: Impl. atomic swap (to test/set lock variable)
Swap the contents between [x20] and x23
again:
lr.w x10,(x20)
sc.w x11,(x20),x23 // x11 = status (1 means failure)
bne x11,x0,again // branch if store failed
addi x23,x10,0 // X23 = loaded value
Example 2: Impl. lock acquire
addi x12,x0,1 // copy locked value
again:
lr.w x10,(x20) // read lock
bne x10,x0,again // check if it is 0 yet
sc.w x11,(x20),x12 // attempt to store
bne x11,x0,again // branch if fails
Unlock:
sw x0,0(x20) // free lock
2.12 Translating and Starting a Program
C Program Execution Flow
Compiler and Assembler
Compiler translates a high-level language program (e.g., C) into machine instructions
Assembler further converts the instructions into machine code (binary)
- Handle pseudoinstructions that are not implemented by the target hardware but found in machine instructions
- li x9, 123 // load immediate value 123 into register x9
- RISC-V assembler converts it into the following code (li is acceptable by assembler, but is not implemented in RISC-V machine language)
- addi x9, x0, 123 // register x9 gets register x0 + 123
Assembler
In fact, the assembler turns the assembly language program into an object file
- 把組合語言的程式轉換成.o檔
- which is a combination of code, data, and information needed to place instructions properly in memory
- It must determine the addresses corresponding to all labels
- A symbol table helps assemblers keep track of labels used in branches and data transfer instructions
An object file contains the six parts:
- Header: described contents of object module
- Text segment: translated instructions
- Static data segment: data allocated for the life of the program
- Relocation info: for contents that depend on absolute location of loaded program
- Symbol table: global definitions and external references
- Debug info: for associating with source code
Linker
It takes all the independently assembled machine
language programs and “stitches” them together
It produces an executable file running on a
computer
- Merges segments
- Resolve labels (determine their addresses)
- Patch location-dependent and external references
- NOTE: It could leave location dependencies to be fixed by a relocating loader later(留下定位資訊 以方便後續追蹤位置)
Loader
Loader is a system software
- that is called by the operating system
- to place an object program in main memory so that it is ready to execute
In UNIX systems, the loader follows these steps:
- Reads the executable file header to determine size of the text and data segments
- Creates an address space large enough for the text and data
- Copies the instructions and data from the executable file into memory (also initialize the data)
- Or set page table entries so they can be faulted in
- Copies the parameters (if any) to the main program onto the stack
- Initializes the processor registers (including sp, fp, gp) and sets the stack pointer to the first free location
- Branches to a start-up routine that copies the parameters into the argument registers and calls the main routine of the program
- Copies arguments to x10, … and calls main
- When main returns, the start-up routine terminates the program with an exit syscall
Static vs. Dynamic Linking
Two ways to link to external object files:
Static link: Library routines that are linked to a program during linking
- External references are resolved during the linking stage
- Bad for upgrade and code size, since the library routines become part of
the executable code
Dynamic link: Library routines that are linked to a program
during execution
- Only load and link library the procedure when it is called
- Require procedure code to be relocatable
- Avoid image bloat caused by static linking of all (transitively) referenced libraries (i.e., smaller size of the built executable)
- Automatically pick up new library versions
- Check Figure 2.21 for more information
2.13 A C Sort Example to Put It All Together
Instruction count and CPI are not good performance
indicators in isolation
Java/JIT compiled code is significantly faster than
JVM interpreted
2.22 fallacies and pitfall
Powerful instruction => higher performance
- Fewer instructions required
- But complex instructions are hard to implement
- May slow down all instructions, including simple ones
- Compilers are good at making fast code fromsimple instructions
Use assembly code for high performance
- But modern compilers are better at dealing with
modern processors - More lines of code more errors and less
productivity
Backward compatibility(向下相容性) => instruction set doesn’t
change
- But they do accrete more instructions
Sequential words are not at sequential addresses
- Increment by 4 (for 32-bit architecture), not by 1 !!!
Keeping a pointer to an automatic variable after procedure returns
- E.g., passing pointer back via an argument
- Pointer becomes invalid when stack popped
2.23 concluding remark
Design principles
- Simplicity favors regularity
- Smaller is faster
- Good design demands good compromises
Make the common case fast
Layers of software/hardware Compiler, assembler, hardware
RISC-V: typical of RISC ISAs ➢c.f. x86