This is the sixth part of the Kernel booting process
series. In the previous part we have seen the end of the kernel boot process. But we have skipped some important advanced parts.
As you may remember the entry point of the Linux kernel is the start_kernel
function from the main.c source code file started to execute at LOAD_PHYSICAL_ADDR
address. This address depends on the CONFIG_PHYSICAL_START
kernel configuration option which is 0x1000000
by default:
config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
default "0x1000000"
---help---
This gives the physical address where the kernel is loaded.
...
...
...
This value may be changed during kernel configuration, but also load address can be selected as a random value. For this purpose the CONFIG_RANDOMIZE_BASE
kernel configuration option should be enabled during kernel configuration.
In this case a physical address at which Linux kernel image will be decompressed and loaded will be randomized. This part considers the case when this option is enabled and load address of the kernel image will be randomized for security reasons.
Before the kernel decompressor will start to find random memory range where the kernel will be decompressed and loaded, the identity mapped page tables should be initialized. If a bootloader used 16-bit or 32-bit boot protocol, we already have page tables. But in any case, we may need new pages by demand if the kernel decompressor selects memory range outside of them. That's why we need to build new identity mapped page tables.
Yes, building of identity mapped page tables is the one of the first step during randomization of load address. But before we will consider it, let's try to remember where did we come from to this point.
In the previous part, we saw transition to long mode and jump to the kernel decompressor entry point - extract_kernel
function. The randomization stuff starts here from the call of the:
void choose_random_location(unsigned long input,
unsigned long input_size,
unsigned long *output,
unsigned long output_size,
unsigned long *virt_addr)
{}
function. As you may see, this function takes following five parameters:
input
;input_size
;output
;output_isze
;virt_addr
.
Let's try to understand what these parameters are. The first input
parameter came from parameters of the extract_kernel
function from the arch/x86/boot/compressed/misc.c source code file:
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
unsigned char *input_data,
unsigned long input_len,
unsigned char *output,
unsigned long output_len)
{
...
...
...
choose_random_location((unsigned long)input_data, input_len,
(unsigned long *)&output,
max(output_len, kernel_total_size),
&virt_addr);
...
...
...
}
This parameter is passed from assembler code:
leaq input_data(%rip), %rdx
from the arch/x86/boot/compressed/head_64.S. The input_data
is generated by the little mkpiggy program. If you have compiled linux kernel source code under your hands, you may find the generated file by this program which should be placed in the linux/arch/x86/boot/compressed/piggy.S
. In my case this file looks:
.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 6988196
.globl z_output_len
z_output_len = 29207032
.globl input_data, input_data_end
input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:
As you may see it contains four global symbols. The first two z_input_len
and z_output_len
which are sizes of compressed and uncompressed vmlinux.bin.gz
. The third is our input_data
and as you may see it points to linux kernel image in raw binary format (all debugging symbols, comments and relocation information are stripped). And the last input_data_end
points to the end of the compressed linux image.
So, our first parameter of the choose_random_location
function is the pointer to the compressed kernel image that is embedded into the piggy.o
object file.
The second parameter of the choose_random_location
function is the z_input_len
that we have seen just now.
The third and fourth parameters of the choose_random_location
function are address where to place decompressed kernel image and the length of decompressed kernel image respectively. The address where to put decompressed kernel came from arch/x86/boot/compressed/head_64.S and it is address of the startup_32
aligned to 2 megabytes boundary. The size of the decompressed kernel came from the same piggy.S
and it is z_output_len
.
The last parameter of the choose_random_location
function is the virtual address of the kernel load address. As we may see, by default it coincides with the default physical load address:
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
which depends on kernel configuration:
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
+ (CONFIG_PHYSICAL_ALIGN - 1)) \
& ~(CONFIG_PHYSICAL_ALIGN - 1))
Now, as we considered parameters of the choose_random_location
function, let's look at implementation of it. This function starts from the checking of nokaslr
option in the kernel command line:
if (cmdline_find_option_bool("nokaslr")) {
warn("KASLR disabled: 'nokaslr' on cmdline.");
return;
}
and if the options was given we exit from the choose_random_location
function ad kernel load address will not be randomized. Related command line options can be found in the kernel documentation:
kaslr/nokaslr [X86]
Enable/disable kernel and module base offset ASLR
(Address Space Layout Randomization) if built into
the kernel. When CONFIG_HIBERNATION is selected,
kASLR is disabled by default. When kASLR is enabled,
hibernation will be disabled.
Let's assume that we didn't pass nokaslr
to the kernel command line and the CONFIG_RANDOMIZE_BASE
kernel configuration option is enabled. In this case we add kASLR
flag to kernel load flags:
boot_params->hdr.loadflags |= KASLR_FLAG;
and the next step is the call of the:
initialize_identity_maps();
function which is defined in the arch/x86/boot/compressed/kaslr_64.c source code file. This function starts from initialization of mapping_info
an instance of the x86_mapping_info
structure:
mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
mapping_info.kernpg_flag = _KERNPG_TABLE;
The x86_mapping_info
structure is defined in the arch/x86/include/asm/init.h header file and looks:
struct x86_mapping_info {
void *(*alloc_pgt_page)(void *);
void *context;
unsigned long page_flag;
unsigned long offset;
bool direct_gbpages;
unsigned long kernpg_flag;
};
This structure provides information about memory mappings. As you may remember from the previous part, we already setup'ed initial page tables from 0 up to 4G
. For now we may need to access memory above 4G
to load kernel at random position. So, the initialize_identity_maps
function executes initialization of a memory region for a possible needed new page table. First of all let's try to look at the definition of the x86_mapping_info
structure.
The alloc_pgt_page
is a callback function that will be called to allocate space for a page table entry. The context
field is an instance of the alloc_pgt_data
structure in our case which will be used to track allocated page tables. The page_flag
and kernpg_flag
fields are page flags. The first represents flags for PMD
or PUD
entries. The second kernpg_flag
field represents flags for kernel pages which can be overridden later. The direct_gbpages
field represents support for huge pages and the last offset
field represents offset between kernel virtual addresses and physical addresses up to PMD
level.
The alloc_pgt_page
callback just validates that there is space for a new page, allocates new page:
entry = pages->pgt_buf + pages->pgt_buf_offset;
pages->pgt_buf_offset += PAGE_SIZE;
in the buffer from the:
struct alloc_pgt_data {
unsigned char *pgt_buf;
unsigned long pgt_buf_size;
unsigned long pgt_buf_offset;
};
structure and returns address of a new page. The last goal of the initialize_identity_maps
function is to initialize pgdt_buf_size
and pgt_buf_offset
. As we are only in initialization phase, the initialze_identity_maps
function sets pgt_buf_offset
to zero:
pgt_data.pgt_buf_offset = 0;
and the pgt_data.pgt_buf_size
will be set to 77824
or 69632
depends on which boot protocol will be used by bootloader (64-bit or 32-bit). The same is for pgt_data.pgt_buf
. If a bootloader loaded the kernel at startup_32
, the pgdt_data.pgdt_buf
will point to the end of the page table which already was initialzed in the arch/x86/boot/compressed/head_64.S:
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
where _pgtable
points to the beginning of this page table _pgtable. In other way, if a bootloader have used 64-bit boot protocol and loaded the kernel at startup_64
, early page tables should be built by bootloader itself and _pgtable
will be just overwrote:
pgt_data.pgt_buf = _pgtable
As the buffer for new page tables is initialized, we may return back to the choose_random_location
function.
After the stuff related to identity page tables is initilized, we may start to choose random location where to put decompressed kernel image. But as you may guess, we can't choose any address. There are some reseved addresses in memory ranges. Such addresses occupied by important things, like initrd, kernel command line and etc. The
mem_avoid_init(input, input_size, *output);
function will help us to do this. All non-safe memory regions will be collected in the:
struct mem_vector {
unsigned long long start;
unsigned long long size;
};
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
array. Where MEM_AVOID_MAX
is from mem_avoid_index
enum which represents different types of reserved memory regions:
enum mem_avoid_index {
MEM_AVOID_ZO_RANGE = 0,
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
MEM_AVOID_MAX,
};
Both are defined in the arch/x86/boot/compressed/kaslr.c source code file.
Let's look at the implementation of the mem_avoid_init
function. The main goal of this function is to store information about reseved memory regions described by the mem_avoid_index
enum in the mem_avoid
array and create new pages for such regions in our new identity mapped buffer. Numerous parts fo the mem_avoid_index
function are similar, but let's take a look at the one of them:
mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
mem_avoid[MEM_AVOID_ZO_RANGE].size);
At the beginning of the mem_avoid_init
function tries to avoid memory region that is used for current kernel decompression. We fill an entry from the mem_avoid
array with the start and size of such region and call the add_identity_map
function which should build identity mapped pages for this region. The add_identity_map
function is defined in the arch/x86/boot/compressed/kaslr_64.c source code file and looks:
void add_identity_map(unsigned long start, unsigned long size)
{
unsigned long end = start + size;
start = round_down(start, PMD_SIZE);
end = round_up(end, PMD_SIZE);
if (start >= end)
return;
kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
start, end);
}
As you may see it aligns memory region to 2 megabytes boundary and checks given start and end addresses.
In the end it just calls the kernel_ident_mapping_init
function from the arch/x86/mm/ident_map.c source code file and pass mapping_info
instance that was initilized above, address of the top level page table and addresses of memory region for which new identity mapping should be built.
The kernel_ident_mapping_init
function sets default flags for new pages if they were not given:
if (!info->kernpg_flag)
info->kernpg_flag = _KERNPG_TABLE;
and starts to build new 2-megabytes (because of PSE
bit in the mapping_info.page_flag
) page entries (PGD -> P4D -> PUD -> PMD
in a case of five-level page tables or PGD -> PUD -> PMD
in a case of four-level page tables) related to the given addresses.
for (; addr < end; addr = next) {
p4d_t *p4d;
next = (addr & PGDIR_MASK) + PGDIR_SIZE;
if (next > end)
next = end;
p4d = (p4d_t *)info->alloc_pgt_page(info->context);
result = ident_p4d_init(info, p4d, addr, next);
return result;
}
First of all here we find next entry of the Page Global Directory
for the given address and if it is greater than end
of the given memory region, we set it to end
. After this we allocate a new page with our x86_mapping_info
callback that we already considered above and call the ident_p4d_init
function. The ident_p4d_init
function will do the same, but for low-level page directories (p4d
-> pud
-> pmd
).
That's all.
New page entries related to reserved addresses are in our page tables. This is not the end of the mem_avoid_init
function, but other parts are similar. It just build pages for initrd, kernel command line and etc.
Now we may return back to choose_random_location
function.
After the reserved memory regions were stored in the mem_avoid
array and identity mapping pages were built for them, we select minimal available address to choose random memory region to decompress the kernel:
min_addr = min(*output, 512UL << 20);
As you may see it should be smaller than 512
megabytes. This 512
megabytes value was selected just to avoid unknown things in lower memory.
The next step is to select random physical and virtual addresses to load kernel. The first is physical addresses:
random_addr = find_random_phys_addr(min_addr, output_size);
The find_random_phys_addr
function is defined in the same source code file:
static unsigned long find_random_phys_addr(unsigned long minimum,
unsigned long image_size)
{
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
if (process_efi_entries(minimum, image_size))
return slots_fetch_random();
process_e820_entries(minimum, image_size);
return slots_fetch_random();
}
The main goal of process_efi_entries
function is to find all suitable memory ranges in full accessible memory to load kernel. If the kernel compiled and runned on the system without EFI support, we continue to search such memory regions in the e820 regions. All founded memory regions will be stored in the
struct slot_area {
unsigned long addr;
int num;
};
#define MAX_SLOT_AREA 100
static struct slot_area slot_areas[MAX_SLOT_AREA];
array. The kernel will select a random index of this array for kernel to be decompressed. This selection will be executed by the slots_fetch_random
function. The main goal of the slots_fetch_random
function is to select random memory range from the slot_areas
array via kaslr_get_random_long
function:
slot = kaslr_get_random_long("Physical") % slot_max;
The kaslr_get_random_long
function is defined in the arch/x86/lib/kaslr.c source code file and it just returns random number. Note that the random number will be get via different ways depends on kernel configuration and system opportunities (select random number base on time stamp counter, rdrand and so on).
That's all from this point random memory range will be selected.
After random memory region was selected by the kernel decompressor, new identity mapped pages will be built for this region by demand:
random_addr = find_random_phys_addr(min_addr, output_size);
if (*output != random_addr) {
add_identity_map(random_addr, output_size);
*output = random_addr;
}
From this time output
will store the base address of a memory region where kernel will be decompressed. But for this moment, as you may remember we randomized only physical address. Virtual address should be randomized too in a case of x86_64 architecture:
if (IS_ENABLED(CONFIG_X86_64))
random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);
*virt_addr = random_addr;
As you may see in a case of non x86_64
architecture, randomzed virtual address will coincide with randomized physical address. The find_random_virt_addr
function calculates amount of virtual memory ranges that may hold kernel image and calls the kaslr_get_random_long
that we already saw in a previous case when we tried to find random physical
address.
From this moment we have both randomized base physical (*output
) and virtual (*virt_addr
) addresses for decompressed kernel.
That's all.
This is the end of the sixth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
If you have any questions or suggestions write me a comment or ping me in twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.