AArch64

Emu68 on Bare-Metal AArch64

Introduction

On this wiki page I will collect information necessary for me to port and run Emu68 on the AArch64 architecture. It should be not only usable for (obviously) me; anyone trying to step into bare-metal coding on the 64-bit arm architecture could eventually find it useful.

Memory management unit (MMU)

Before we continue with boot procedure, first few words about the MMU setup. The Emu68 kernel is not very secure. It is intended to run old m68k code of CPU architectures without e.g. hypervisor concepts in mind. The AArch64 side uses therefore very simple memory map and awfully simple MMU translation tables occupying just very few 4K memory pages. A lot is allowed, among others access to MMIO registers from m68k code. If you require mode sophisticated memoy management with enhanced security, you have to design it a bit differently.

Emu68 memory layout on AArch64

The address space of AArch64 is split into two halves - the user space resides in the lower and the kernel space resides usually in the upper half. The size of both can, but does not have to, be equal. For Emu68 I have decided for following memory layout:

0x0000000000000000 - 0x00000000xxxxxxxx : User RAM, cacheable
0x00000000xxxxxxxx - 0x00000000yyyyyyyy : -gap-
0x00000000zzzzzzzz - 0x00000000pppppppp : MMAPed registers, peripherals, device memory type
0x00000000pppppppp - 0xffffff7fffffffff : -gap-
0xffffff8000000000 - 0xffffff8000ffffff : Emu68 code, JIT cache
0xffffff8001000000 - 0xfffffffeffffffff : -gap-
0xffffffff00000000 - 0xffffffffffffffff : Mirror of the 4GB, non-cacheable, supervisor only

Short explanation. The RAM visible by M68k code will always start at a virtual address 0. If this does not correspond to the actual memory map of the AArch64, all necessary pages needed to fulfill the requirement will be taken from the top of available RAM. Further, the MMIO space of the CPU remains in the lower half of the address space and is accessible for the M68k. It is necessary in case where the M68k code wants to talk with the peripherals, e.g. in hardware drivers.

Header for the bootloader

In order to satisfy uboot bootloader, the kernel binary has to start with appropriate header. Let's look how it is defined (quoted after https://www.kernel.org/doc/Documentation/arm64/booting.txt):

u32 code0;              /* Executable code */
u32 code1;              /* Executable code */
u64 text_offset;        /* Image load offset, little endian */
u64 image_size;         /* Effective Image size, little endian */
u64 flags;              /* kernel flags, little endian */
u64 res2 = 0;           /* reserved */
u64 res3 = 0;           /* reserved */
u64 res4 = 0;           /* reserved */
u32 magic = 0x644d5241; /* Magic number, little endian, "ARM\x64" */
u32 res5;               /* reserved (used for PE COFF offset) */

All fields in the header are LittleEndian. The first two entries are two AArch64 instructions used to jump into the actual kernel. Usually it is just a branch, like b start_of_my_kernel or something similar. Why there is a place for two of them? Very simple. One could be for example something like 1: adr xn, 1b to pass the physical address where the kernel is located. Another interesting application is using the first opcode to generate the 2-byte header of an EXE file: MZ. In such case the first opcode does effectively nothing, it cal be any opcode having the two lowmost bytes looking like MZ, while the second one is used to perform the branch. One of possible examples is an opcode add x13, x18, #0x16 which does not do any harm and generates the four bytes 0x4d 0x5a 0x00 0x91. That way it is possible to embed the typical PE header of an executable and to generate an image which is valid for both uboot and efi.

The text_offset field is a load offset of the image, typically with a value of 0x80000. The bytes below the offset down to beginning of the 2MB page are either unused by the kernel or could be taken for a temporary stack.

The image_size specifies amount of RAM which has to be free for kernels use once it is started. It does not have to be the physical size of the file. On Emu68 this field is used to tell the bootloader how much RAM will the kernel consume; usually it is set to 16 megabytes or more. I'm using this space for .bss section and as a memory pool for kernels use.

The flags are telling bootloader whether the kernel is running in Big- or in Little-Endian mode, what is the base size of a MMU page (on AArch64 it can be 4K, 16K or 64K). The flags are also telling the bootloader if the kernel should be loaded at lowmost part of DRAM, in case it cannot use the memory below its own load address. The bootloader is allowed to refuse loading the kernel if the requirements specified in flags are not met on the hardware.

Finally the magic field consisting of ARMd, or ARM\x64 if you prefer hexadecimal mode is used to indicate that the image is a valid AArch64 boot image.

The header on Emu68 is embedded in the asm startup code and looks like this:

asm("   .section .startup           \n"
"       .globl _start               \n"
"       .globl _boot                \n"
"       .type _start,%function      \n"
"_boot: b       _start              \n"
"       .long   0                   \n"
"       .quad " xstr(L64(0x00080000)) " \n"
"       .quad " xstr(L64(KERNEL_RSRVD_PAGES << 21)) "\n"
#if EMU68_HOST_BIG_ENDIAN
"       .quad " xstr(L64(0xb)) "    \n"
#else
"       .quad " xstr(L64(0xa)) "    \n"
#endif
"       .quad 0                     \n"
"       .quad 0                     \n"
"       .quad 0                     \n"
"       .long " xstr(L32(0x644d5241)) "\n"
"       .long 0                     \n"
".byte 0                            \n"
".align 4                           \n"
".string \"$VER: Emu68.img " VERSION_STRING_DATE "\"\n"
".byte 0                            \n"
".align 5                           \n"

The startup code is inserted into a C source file, that's why I have the asm section. I am also using macros (L32(), L64()) to convert constants into LittleEndian mode, regardless on the setting of gcc compiler. The header is followed by a version string, which is a very handy way of embedding the kernel version into the file. It has been borrowed from Amiga world, where every single executable had it. After expanding the VERSION_STRING_MACRO my header looks e.g. like this: $VER: Emu68.img 0.1-alpha-1 (10.01.2020) git: 1f0294b,dirty. Version number is taken from git tag, it is followed with the build date and finally with git hash. Since the code has been modified after last git commit, it has a "dirty" mark at the end of the version string.

Booting

Leaving EL3/EL2 modes

Once the bootloader initialized the memory and performed basic set up of the machine, it loads the kernel into memory. It parses the header to make sure the kernels requirements are fulfilled by the hardware and then it passes the control to the kernel by jumping to the beginning of it.

The kernel can be executed in EL1 or EL2 mode, rarely on EL3. The EL3 is reserved for trusted firmware but it might happen that the bootloader give us control over it (one can force the bootloader of RasPi to do so, for example). If we are not going to use virtualisation features, then we can just drop the EL2 mode and then work only with EL1 (supervisor, for kernel) and EL0 (user space). This is how I do it:

asm("_start:                        \n"
"       mrs     x9, CurrentEL       \n"
"       and     x9, x9, #0xc        \n"
"       cmp     x9, #8              \n"
"       b.eq    leave_EL2           \n"
"       b.gt    leave_EL3           \n"
"continue_boot:                     \n"

[... - lots of code here, will be explained later]

"leave_EL3:                         \n"
"       mrs     x10, SCTLR_EL3      \n"
"       orr     x10, x10, #(1 << 25)\n"
"       msr     SCTLR_EL3, x10      \n"
"       adr     x10, leave_EL2      \n"
"       msr     ELR_EL3, x10        \n"
"       ldr     w10, =0x000003c9    \n"
"       msr     SPSR_EL3, x10       \n"
"       eret                        \n"

"leave_EL2:                         \n"
"       mrs     x10, SCTLR_EL2      \n"
"       orr     x10, x10, #(1 << 25)\n"
"       msr     SCTLR_EL2, x10      \n"
"       mov     x10, #3             \n"
"       msr     CNTHCTL_EL2, x10    \n"
"       mov     x10, #0x80000000    \n"
"       msr     HCR_EL2, x10        \n"
"       adr     x10, continue_boot  \n"
"       msr     ELR_EL2, x10        \n"
"       ldr     w10, =0x000003c5    \n"
"       msr     SPSR_EL2, x10       \n"
"       eret                        \n"
"       .section .text              \n");

At very beginning I am reading current EL level our code runs on. If kernel runs at EL1 already (because e.g. the hardware does not support EL2), nothing will happen and we will go to continue_boot label. Otherwise either leave_EL2 or leave_EL3 will be called. Please note that I'm running ARM cpu in BigEndian mode, so I avoid reading/writing of memory at this stage. Additionaly, I do not perform jumps to absolute addresses, because the kernel is called at its physical load base, not the virtual address I want.

Both EL2 and EL3 leave functions start with setting correct endianess for that mode. This is necessary for later, in case I ever want to use any of these modes. The leave_EL3 continues with storing physical address (the adr opcode) of leave_EL2 function. Subsequently it sets the SPSR_EL3 register accordingly, so that the subsequent eret opcode switches from EL3 to EL2 and the CPU starts to execute the leave_EL2 routine. Of course I could just aswell perform all the setup of EL2 mode in EL3 and jump directly to my supervicor, but I wanted to save a little bit of code here.

Escape from EL2 to EL1 is a little bit longer. After the endianess of this mode is set, it sets some bits in CNTHCTL_EL2 control register, allowing me to use the cpu counters from EL1 and EL0 modes. Further, it sets the RW bit of HCR_EL2 register, which instructs CPU that EL1 and eventually EL0 modes are running in AArch64 mode. The rest is obvious - setting return address to continue_boot label, setting to EL1 mode and issuing eret. That's all. We are in EL1.

EL1 setup

The very first thing we do in EL1 mode is switching to BigEndian mode for both EL1 and EL0. Subsequently we need to prepare initial MMU mapping so that we can jump to the C code at a virtual address of our choice. This is the code:

asm("continue_boot:                 \n"
"       mrs     x10, SCTLR_EL1      \n"
"       orr     x10, x10, #(1 << 25) | (1 << 24)\n"
"       msr     SCTLR_EL1, x10      \n"

"       adr     x9, __mmu_start     \n"
"       ldr     w10, =__mmu_size    \n"
"1:     str     xzr, [x9], #8       \n"
"       sub     w10, w10, 8         \n"
"       cbnz    w10, 1b             \n"
"2:                                 \n"

"       adrp    x16, mmu_user_L1    \n"
"       mov     x9, 0x70d           \n"
"       mov     x10, #0x40000000    \n"
"       str     x9, [x16, #0]       \n"
"       add     x9, x9, x10         \n"
"       str     x9, [x16, #8]       \n"
"       add     x9, x9, x10         \n"
"       str     x9, [x16, #16]      \n"
"       add     x9, x9, x10         \n"
"       str     x9, [x16, #24]      \n");

The very first step to prepare MMU map is to clear the memory used for paging. Assuming it was cleared by the bootloader very risky. The clear loop is loading the size of MMU tables using PC-relative addressing (ldr w10, =__mmu_size). The address of of page tables is read with adr instruction.

Once the area is cleaed, initial setup for user mapping is done. As you can see it is as simple as possible. I set up only four entries, each 1 GiB in size, in the level 1 translation table. They are initially uncached and are read/write from supervisor mode. They create a 1:1 map of the first four gigabytes of physical address space and are subject of change further, in the C code. Access for user mode will be granted later, since user-writable memory is automatically treated as non-executable for EL1 and higher.

Setting up the upper half of address space, for the kernel, is a little bit more complex. Since our kernel can be loaded anywhere in the RAM and the only thing we know for sure is, it will be aligned on 2MB page boundary, using L1 mapping itself is not enough. We need to prepare L1 and L2 maps here. Here's the code for L1 MMU table:

asm("   adrp    x16, mmu_kernel_L1  \n"
"       adrp    x17, mmu_kernel_L2  \n"

"       orr     x9, x17, #3         \n"
"       str     x9, [x16]           \n"

"       mov     x9, 0x70d           \n"
"       str     x9, [x16, #4064]    \n"
"       add     x9, x9, x10         \n"
"       str     x9, [x16, #4072]    \n"
"       add     x9, x9, x10         \n"
"       str     x9, [x16, #4080]    \n"
"       add     x9, x9, x10         \n"
"       str     x9, [x16, #4088]    \n");

You can see two differet things here. The first entry in L1 table is a pointer to L2 table, where kernel memory starting at an address 0xffffff8000000000 will be prepared. At the end of the address space we set up an uncached 1:1 map of physical memory accessible for kernel only. This is very handy when one e.g. will manipulate the MMU pages further or e.g. will move the kernel to new location. Now let's look at the MMU Level2 setup:

asm("   adrp    x16, _boot          \n"
"       sub     x16, x16, #0x80000  \n"
"       orr     x16, x16, #0x700    \n"
"       orr     x16, x16, #0xd      \n"
"       mov     x9, #" xstr(KERNEL_RSRVD_PAGES) "\n"
"1:     str     x16, [x17], #8      \n"
"       add     x16, x16, #0x200000 \n"
"       sub     x9, x9, #1          \n"
"       cbnz    x9, 1b              \n")

Here, we load physical page address of the _boot label and subtract the aforementioned text_offset from it. This is our start address we subsequently poke into the L2 table, entry after entry until the requested number of 2MB pages is set up. Yet again, we disable cache for now and will do proper setup later - I try to go as far as necessary in assembler and want to write as much of the code I can in C.

So, map is ready. Shall we load it? No! Not yet. Before MMU is turned on it needs a bit of adjustment. The ARM cpu allows one to select the smallest MMU page size to be 4K, 16K or 64K. Besides it allows one to select the size of address space, separate for lower and for upper halves of VA. Depending on the size CPU will select the depth of translation. I have selected 39 bits of address space which is highest possible if I want to start translation at Level1. Wider space would force CPU to start translating at Level0, on the other hand starting translation at Level2 would reduce address space to the width of 30 bits (1GB). Finally, I need to set up the memory attributes (register MAIR_EL1) which I have used in the MMU tables. Here's the code:

asm("   ldr     x10, =0x4404ff      \n"
"       msr     MAIR_EL1, x10       \n"

"       ldr     x10, =0xb5193519    \n"
"       msr     TCR_EL1, x10        \n"

"       adrp    x10, mmu_user_L1    \n"
"       msr     TTBR0_EL1, x10      \n"
"       adrp    x10, mmu_kernel_L1  \n"
"       msr     TTBR1_EL1, x10      \n");

Memory attribute 0 is used to write-back cacheable RAM, attribute 1 is dedicated for devices (MMIO range) and attribute 2 is for uncached RAM. The size of virtual address space is set in register TCR_EL1. Addresses of both tables are loaded into registers TTBR0_EL1 (lower half, user) and TTBR1_EL1 (upper half, supervisor). Now the MMU can be turned on:

asm("   isb     sy                  \n"
"       mrs     x10, SCTLR_EL1      \n"
"       orr     x10, x10, #1        \n"
"       msr     SCTLR_EL1, x10      \n"
"       isb     sy                  \n");

Calling C code

Before boot procedure in C is called, there are last two things to do. First, we set up stack register and enable use of SIMD extensions from EL1 and EL0 modes. Doing it in assembly is a good idea, since the C compiler for AArch64 will use stack of course and it will assume the SIMD unit is always available, so it will eventually generate code using SIMD quite early. The setup is failry simple, it is just a matter of setting two bits in CPACR_EL1 register and loading some address into sp register:

asm("   ldr     x9, =_boot          \n"
"       mov     sp, x9              \n"
"       mov     x10, #0x00300000    \n"
"       msr     CPACR_EL1, x10      \n"
"       isb     sy                  \n"
"       isb     sy                  \n"
"       ic      IALLU               \n"
"       isb     sy                  \n");

You may ask why I flush instruction cache and add synchronization barriers? This was lesson learned from 32bit ARM code. I have enabled VFP coprocessor there and wondered why the code is eventually crashing on some systems, especially on RasberryPi 4. The reason was simple - I have indeed enabled VFP but there were already instructions in the CPU's pipeline with not-yet-enabled VFP. Adding barriers there solved all isses so I have added them in this AArch64 code too. Feel free to experiment if they are necessary or not.

Final step - the .bss section has to be cleared. I do not know what it will be used by on the C side, so I clear it in assembler:

asm("   ldr     x9, =__bss_start    \n"
"       ldr     w10, =__bss_size    \n"
"1:     cbz     w10, 2f             \n"
"       str     xzr, [x9], #8       \n"
"       sub     w10, w10, 8         \n"
"       cbnz    w10, 1b             \n"
"2:     ldr     x30, =boot          \n"
"       br      x30                 \n");

The x30 register is loaded form PC-relative memory location this time, because it has to be a full virtual address. That's all, boot is a routine written in C taking one parameter in x0 register - pointer to flattened device tree:

void boot(void *dtree)
{
    ...
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly