The previous post was a last part of the Linux kernel booting process chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with pid 1
. There are many things to do before the kernel will start first init
process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the arch/x86/kernel/head_64.S and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the start_kernel
function from the init/main.c will be called.
In the last part of the previous chapter we stopped at the jmp instruction from the arch/x86/boot/compressed/head_64.S assembly source code file:
jmp *%rax
At this moment the rax
register contains address of the Linux kernel entry point which that was obtained as a result of the call of the decompress_kernel
function from the arch/x86/boot/compressed/misc.c source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start.
Okay, we got the address of the decompressed kernel image from the decompress_kernel
function into rax
register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the arch/x86/kernel/head_64.S assembly source code file and at the beginning of it, we can see following definitions:
.text
__HEAD
.code64
.globl startup_64
startup_64:
...
...
...
We can see definition of the startup_64
routine that is defined in the __HEAD
section, which is just a macro which expands to the definition of executable .head.text
section:
#define __HEAD .section ".head.text","ax"
We can see definition of this section in the arch/x86/kernel/vmlinux.lds.S linker script:
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
...
...
...
} :text = 0x9090
Besides the definition of the .text
section, we can understand default virtual and physical addresses from the linker script. Note that address of the _text
is location counter which is defined as:
. = __START_KERNEL;
for the x86_64. The definition of the __START_KERNEL
macro is located in the arch/x86/include/asm/page_types.h header file and represented by the sum of the base virtual address of the kernel mapping and physical start:
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)
Or in other words:
- Base physical address of the Linux kernel -
0x1000000
; - Base virtual address of the Linux kernel -
0xffffffff81000000
.
Now we know default physical and virtual addresses of the startup_64
routine, but to know actual addresses we must to calculate it with the following code:
leaq _text(%rip), %rbp
subq $_text - __START_KERNEL_map, %rbp
Yes, it defined as 0x1000000
, but it may be different, for example if kASLR is enabled. So our current goal is to calculate delta between 0x1000000
and where we actually loaded. Here we just put the rip-relative
address to the rbp
register and then subtract $_text - __START_KERNEL_map
from it. We know that compiled virtual address of the _text
is 0xffffffff81000000
and the physical address of it is 0x1000000
. The __START_KERNEL_map
macro expands to the 0xffffffff80000000
address, so at the second line of the assembly code, we will get following expression:
rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)
So, after the calculation, the rbp
will contain 0
which represents difference between addresses where we actually loaded and where the code was compiled. In our case zero
means that the Linux kernel was loaded by default address and the kASLR was disabled.
After we got the address of the startup_64
, we need to do a check that this address is correctly aligned. We will do it with the following code:
testl $~PMD_PAGE_MASK, %ebp
jnz bad_address
Here we just compare low part of the rbp
register with the complemented value of the PMD_PAGE_MASK
. The PMD_PAGE_MASK
indicates the mask for Page middle directory
(read Paging about it) and defined as:
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
where PMD_PAGE_SIZE
macro defined as:
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT 21
As we can easily calculate, PMD_PAGE_SIZE
is 2
megabytes. Here we use standard formula for checking alignment and if text
address is not aligned for 2
megabytes, we jump to bad_address
label.
After this we check address that it is not too large by the checking of highest 18
bits:
leaq _text(%rip), %rax
shrq $MAX_PHYSMEM_BITS, %rax
jnz bad_address
The address must not be greater than 46
-bits:
#define MAX_PHYSMEM_BITS 46
Okay, we did some early checks and now we can move on.
The first step before we start to setup identity paging is to fixup following addresses:
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
All of early_level4_pgt
, level3_kernel_pgt
and other address may be wrong if the startup_64
is not equal to default 0x1000000
address. The rbp
register contains the delta address so we add to the certain entries of the early_level4_pgt
, the level3_kernel_pgt
and the level2_fixmap_pgt
. Let's try to understand what these labels mean. First of all let's look at their definition:
NEXT_PAGE(early_level4_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level2_kernel_pgt)
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
KERNEL_IMAGE_SIZE/PMD_SIZE)
NEXT_PAGE(level2_fixmap_pgt)
.fill 506,8,0
.quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
.fill 5,8,0
NEXT_PAGE(level1_fixmap_pgt)
.fill 512,8,0
Looks hard, but it isn't. First of all let's look at the early_level4_pgt
. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first 511
entries. And after this we can see one level3_kernel_pgt
entry. Note that we subtract __START_KERNEL_map + _PAGE_TABLE
from it. As we know __START_KERNEL_map
is a base virtual address of the kernel text, so if we subtract __START_KERNEL_map
, we will get physical address of the level3_kernel_pgt
. Now let's look at _PAGE_TABLE
, it is just page entry access rights:
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
You can read more about it in the Paging part.
The level3_kernel_pgt
- stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros L3_START_KERNEL
or 510
times. Here the L3_START_KERNEL
is the index in the page upper directory which contains __START_KERNEL_map
address and it equals 510
. After this, we can see the definition of the two level3_kernel_pgt
entries: level2_kernel_pgt
and level2_fixmap_pgt
. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
_PAGE_DIRTY)
access rights. The second - level2_fixmap_pgt
is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one level2_fixmap_pgt
entry and 10
megabytes hole for the vsyscalls mapping. The next level2_kernel_pgt
calls the PDMS
macro which creates 512
megabytes from the __START_KERNEL_map
for kernel .text
(after these 512
megabytes will be modules memory space).
Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the rbp
register contains delta between the address of the startup_64
symbol which was got during kernel linking and the actual address. So, for this moment, we just need to add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are:
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
or the last entry of the early_level4_pgt
which is the level3_kernel_pgt
, last two entries of the level3_kernel_pgt
which are the level2_kernel_pgt
and the level2_fixmap_pgt
and five hundreds seventh entry of the level2_fixmap_pgt
which is level1_fixmap_pgt
page directory.
After all of this we will have:
early_level4_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0] -> 512 MB kernel mapping
level2_fixmap_pgt[507] -> level1_fixmap_pgt
Note that we didn't fixup base address of the early_level4_pgt
and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it.
Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, 1 : 1
. Let's look at it in detail. First of all we get the rip-relative
address of the _text
and _early_level4_pgt
and put they into rdi
and rbx
registers:
leaq _text(%rip), %rdi
leaq early_level4_pgt(%rip), %rbx
After this we store address of the _text
in the rax
and get the index of the page global directory entry which stores _text
address, by shifting _text
address on the PGDIR_SHIFT
:
movq %rdi, %rax
shrq $PGDIR_SHIFT, %rax
where PGDIR_SHIFT
is 39
. PGDIR_SHFT
indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:
#define PGDIR_SHIFT 39
#define PUD_SHIFT 30
#define PMD_SHIFT 21
After this we put the address of the first entry of the early_dynamic_pgts
page table to the rdx
register with the _KERNPG_TABLE
access rights (see above) and fill the early_level4_pgt
with the 2 early_dynamic_pgts
entries:
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
movq %rdx, 0(%rbx,%rax,8)
movq %rdx, 8(%rbx,%rax,8)
The rbx
register contains address of the early_level4_pgt
and %rax * 8
here is index of a page global directory occupied by the _text
address. So here we fill two entries of the early_level4_pgt
with address of two entries of the early_dynamic_pgts
which is related to _text
. The early_dynamic_pgts
is array of arrays:
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
which will store temporary page tables for early kernel which we will not move to the init_level4_pgt
.
After this we add 4096
(size of the early_level4_pgt
) to the rdx
(it now contains the address of the first entry of the early_dynamic_pgts
) and put rdi
(it now contains physical address of the _text
) to the rax
. Now we shift address of the _text
ot PUD_SHIFT
to get index of an entry from page upper directory which contains this address and clears high bits to get only pud
related part:
addq $4096, %rdx
movq %rdi, %rax
shrq $PUD_SHIFT, %rax
andl $(PTRS_PER_PUD-1), %eax
As we have index of a page upper directory we write two addresses of the second entry of the early_dynamic_pgts
array to the first entry of this temporary page directory:
movq %rdx, 4096(%rbx,%rax,8)
incl %eax
andl $(PTRS_PER_PUD-1), %eax
movq %rdx, 4096(%rbx,%rax,8)
In the next step we do the same operation for last page table directory, but filling not two entries, but all entries to cover full size of the kernel.
After our early page table directories filled, we put physical address of the early_level4_pgt
to the rax
register and jump to label 1
:
movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
That's all for now. Our early paging is prepared and we just need to finish last preparation before we will jump into C code and kernel entry point later.
After that we jump to the label 1
we enable PAE
, PGE
(Paging Global Extension) and put the content of the phys_base
(see above) to the rax
register and fill cr3
register with it:
1:
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
movq %rcx, %cr4
addq phys_base(%rip), %rax
movq %rax, %cr3
In the next step we check that CPU supports NX bit with:
movl $0x80000001, %eax
cpuid
movl %edx,%edi
We put 0x80000001
value to the eax
and execute cpuid
instruction for getting the extended processor info and feature bits. The result will be in the edx
register which we put to the edi
.
Now we put 0xc0000080
or MSR_EFER
to the ecx
and call rdmsr
instruction for the reading model specific register.
movl $MSR_EFER, %ecx
rdmsr
The result will be in the edx:eax
. General view of the EFER
is following:
63 32
--------------------------------------------------------------------------------
| |
| Reserved MBZ |
| |
--------------------------------------------------------------------------------
31 16 15 14 13 12 11 10 9 8 7 1 0
--------------------------------------------------------------------------------
| | T | | | | | | | | | |
| Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
| | E | | | | | | | | | |
--------------------------------------------------------------------------------
We will not see all fields in details here, but we will learn about this and other MSRs
in a special part about it. As we read EFER
to the edx:eax
, we check _EFER_SCE
or zero bit which is System Call Extensions
with btsl
instruction and set it to one. By the setting SCE
bit we enable SYSCALL
and SYSRET
instructions. In the next step we check 20th bit in the edi
, remember that this register stores result of the cpuid
(see above). If 20
bit is set (NX
bit) we just write EFER_SCE
to the model specific register.
btsl $_EFER_SCE, %eax
btl $20,%edi
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
1: wrmsr
If the NX bit is supported we enable _EFER_NX
and write it too, with the wrmsr
instruction. After the NX bit is set, we set some bits in the cr0
control register, namely:
X86_CR0_PE
- system is in protected mode;X86_CR0_MP
- controls interaction of WAIT/FWAIT instructions with TS flag in CR0;X86_CR0_ET
- on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387;X86_CR0_NE
- enable internal x87 floating point error reporting when set, else enables PC style x87 error detection;X86_CR0_WP
- when set, the CPU can't write to read-only pages when privilege level is 0;X86_CR0_AM
- alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;X86_CR0_PG
- enable paging.
by the execution following assembly code:
#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
X86_CR0_PG)
movl $CR0_STATE, %eax
movq %rax, %cr0
We already know that to run any code, and even more C code from assembly, we need to setup a stack. As always, we are doing it by the setting of stack pointer to a correct place in memory and resetting flags register after this:
movq initial_stack(%rip), %rsp
pushq $0
popfq
The most interesting thing here is the initial_stack
. This symbol is defined in the source code file and looks like:
GLOBAL(initial_stack)
.quad init_thread_union+THREAD_SIZE-8
The GLOBAL
is already familiar to us from. It defined in the arch/x86/include/asm/linkage.h header file expands to the global
symbol definition:
#define GLOBAL(name) \
.globl name; \
name:
The THREAD_SIZE
macro is defined in the arch/x86/include/asm/page_64_types.h header file and depends on value of the KASAN_STACK_ORDER
macro:
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
We consider when the kasan is disabled and the PAGE_SIZE
is 4096
bytes. So the THREAD_SIZE
will expands to 16
kilobytes and represents size of the stack of a thread. Why is thread
? You may already know that each process may have parent processes and child processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the union with the thread_info
structure.
And as we can see the init_thread_union
is represented by the thread_union
union. Earlier this union looked like:
union thread_union {
struct thread_info thread_info;
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
but from the Linux kernel 4.9-rc1
release, thread_info
was moved to the task_struct
structure which represents a thread. So, for now thread_union
looks like:
union thread_union {
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info thread_info;
#endif
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
where the CONFIG_THREAD_INFO_IN_TASK
kernel configuration option is enabled for x86_64
architecture. So, as we consider only x86_64
architecture in this book, an instance of thread_union
will contain only stack and thread_info
structure will be placed in the task_struct
.
The init_thread_union
looks like:
union thread_union init_thread_union __init_task_data = {
#ifndef CONFIG_THREAD_INFO_IN_TASK
INIT_THREAD_INFO(init_task)
#endif
};
which represents just thread stack. Now we may understand this expression:
GLOBAL(initial_stack)
.quad init_thread_union+THREAD_SIZE-8
that initial_stack
symbol points to the start of the thread_union.stack
array + THREAD_SIZE
which is 16 killobytes and - 8 bytes. Here we need to subtract 8
bytes at the top of stack. This is necessary to guarantee illegal access of the next page memory.
After the early boot stack is set, to update the Global Descriptor Table with the lgdt
instruction:
lgdt early_gdt_descr(%rip)
where the early_gdt_descr
is defined as:
early_gdt_descr:
.word GDT_ENTRIES*8-1
early_gdt_descr_base:
.quad INIT_PER_CPU_VAR(gdt_page)
We need to reload Global Descriptor Table
because now kernel works in the low userspace addresses, but soon kernel will work in its own space. Now let's look at the definition of early_gdt_descr
. Global Descriptor Table contains 32
entries:
#define GDT_ENTRIES 32
for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the definition of the early_gdt_descr_base
.
First of gdt_page
defined as:
struct gdt_page {
struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));
in the arch/x86/include/asm/desc.h. It contains one field gdt
which is array of the desc_struct
structure which is defined as:
struct desc_struct {
union {
struct {
unsigned int a;
unsigned int b;
};
struct {
u16 limit0;
u16 base0;
unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
};
};
} __attribute__((packed));
and presents familiar to us GDT
descriptor. Also we can note that gdt_page
structure aligned to PAGE_SIZE
which is 4096
bytes. It means that gdt
will occupy one page. Now let's try to understand what is INIT_PER_CPU_VAR
. INIT_PER_CPU_VAR
is a macro which defined in the arch/x86/include/asm/percpu.h and just concats init_per_cpu__
with the given parameter:
#define INIT_PER_CPU_VAR(var) init_per_cpu__##var
After the INIT_PER_CPU_VAR
macro will be expanded, we will have init_per_cpu__gdt_page
. We can see in the linker script:
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);
As we got init_per_cpu__gdt_page
in INIT_PER_CPU_VAR
and INIT_PER_CPU
macro from linker script will be expanded we will get offset from the __per_cpu_load
. After this calculations, we will have correct base address of the new GDT.
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create per-CPU
variable, each CPU will have its own copy of this variable. Here we creating gdt_page
per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own GDT
table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about per-CPU
variables in the Theory/per-cpu post.
As we loaded new Global Descriptor Table, we reload segments as we did it every time:
xorl %eax,%eax
movl %eax,%ds
movl %eax,%ss
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
After all of these steps we set up gs
register that it post to the irqstack
which represents special stack where interrupts will be handled on:
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
where MSR_GS_BASE
is:
#define MSR_GS_BASE 0xc0000101
We need to put MSR_GS_BASE
to the ecx
register and load data from the eax
and edx
(which are point to the initial_gs
) with wrmsr
instruction. We don't use cs
, fs
, ds
and ss
segment registers for addressing in the 64-bit mode, but fs
and gs
registers can be used. fs
and gs
have a hidden part (as we saw it in the real mode for cs
) and this part contains descriptor which mapped to Model Specific Registers. So we can see above 0xc0000101
is a gs.base
MSR address. When a system call or interrupt occurred, there is no kernel stack at the entry point, so the value of the MSR_GS_BASE
will store address of the interrupt stack.
In the next step we put the address of the real mode bootparam structure to the rdi
(remember rsi
holds pointer to this structure from the start) and jump to the C code with:
movq initial_code(%rip), %rax
pushq $__KERNEL_CS # set correct cs
pushq %rax # target address in negative space
lretq
Here we put the address of the initial_code
to the rax
and push fake address, __KERNEL_CS
and the address of the initial_code
to the stack. After this we can see lretq
instruction which means that after it return address will be extracted from stack (now there is address of the initial_code
) and jump there. initial_code
is defined in the same source code file and looks:
.balign 8
GLOBAL(initial_code)
.quad x86_64_start_kernel
...
...
...
As we can see initial_code
contains address of the x86_64_start_kernel
, which is defined in the arch/x86/kerne/head64.c and looks like this:
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
...
...
...
}
It has one argument is a real_mode_data
(remember that we passed address of the real mode data to the rdi
register previously).
This is first C code in the kernel!
We need to see last preparations before we can see "kernel entry point" - start_kernel function from the init/main.c.
First of all we can see some checks in the x86_64_start_kernel
function:
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - __STAT_KERNEL_map
, that kernel text with modules is not less than image of the kernel and etc... BUILD_BUG_ON
is a macro which looks as:
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
Let's try to understand how this trick works. Let's take for example first condition: MODULES_VADDR < __START_KERNEL_map
. !!conditions
is the same that condition != 0
. So it means if MODULES_VADDR < __START_KERNEL_map
is true, we will get 1
in the !!(condition)
or zero if not. After 2*!!(condition)
we will get or 2
or 0
. In the end of calculations we can get two different behaviors:
- We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because
MODULES_VADDR
can't be less than__START_KERNEL_map
will be in our case); - No compilation errors.
That's all. So interesting C trick for getting compile error which depends on some constants.
In the next step we can see call of the cr4_init_shadow
function which stores shadow copy of the cr4
per cpu. Context switches can change bits in the cr4
so we need to store cr4
for each CPU. And after this we can see call of the reset_early_page_tables
function where we resets all page global directory entries and write new pointer to the PGT in cr3
:
for (i = 0; i < PTRS_PER_PGD-1; i++)
early_level4_pgt[i].pgd = 0;
next_early_pgt = 0;
write_cr3(__pa_nodebug(early_level4_pgt));
Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (PTRS_PER_PGD
is 512
) in the loop and make it zero. After this we set next_early_pgt
to zero (we will see details about it in the next post) and write physical address of the early_level4_pgt
to the cr3
. __pa_nodebug
is a macro which will be expanded to:
((unsigned long)(x) - __START_KERNEL_map + phys_base)
After this we clear _bss
from the __bss_stop
to __bss_start
and the next step will be setup of the early IDT
handlers, but it's big concept so we will see it in the next part.
This is the end of the first part about linux kernel initialization.
If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.
In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.