How to make fastmap support more cpu cores than 32 or 80 #1

jiguanglizipao · 2021-09-27T14:06:41Z

I tried to run FastMap on a server with 256 cores, and FastMap crashed at alloc_page_lock. I read the code and found that FastMap only supports 32 and 80 freelists (or cores) according to mmap_buffer_hash_table.c shown below. How to make it variable? I am also confused about setting constants in shared_defines.h? Could you show a correct setting with 256 cores? Thank you.

FastMap/driver/mmap_buffer_hash_table.c

Line 135 in 06a2c93

#error "NUM_FREELISTS number is not supported!"

FastMap/driver/mmap_buffer_hash_table.c

Line 280 in 06a2c93

const unsigned int cpu = __rand[cpuid][i];

The text was updated successfully, but these errors were encountered:

tpapagian · 2021-09-28T10:45:02Z

Regarding shared_defines.h you should only change NUM_FREELISTS and NUM_QUEUES to be equals to the number of cores (256 in your case).

FastMap/driver/mmap_buffer_hash_table.c

Line 135 in 06a2c93

#error "NUM_FREELISTS number is not supported!"

will return an error. To be honest we haven’t done this generic enough as we only support 32 and 80 cores. In order to generate an array for your case, it should be a [NUM_FREELISTS][NUM_FREELISTS]. Each line refers to a specific core. For example line 0 refers to core 0. In the case of 32 * 32, core 0 will check the freelists 28, 1, 18, 6, 29, etc (

FastMap/driver/mmap_buffer_hash_table.c

Line 16 in 06a2c93

    
           { 28, 1, 18, 6, 29, 0, 16, 5, 8, 23, 20, 2, 3, 21, 4, 13, 10, 9, 27, 15, 14, 30, 7, 11, 22, 24, 31, 25, 19, 26, 17, 12 },

). Core 1 will use the second line etc. So the only requirement is that each line should contain a random permutation of numbers 0 to num_cores - 1 without any repetitions.

Feel free to provide a pull request with a more generic way to produce these arrays.

Depending on the throughput of your device you may increase a little bit the EVICTOR_THREADS define in shared_defines.h (i.e. 16 or 32). But this will require some testing to do find out the most appropriate value. I believe it is fine to start with 8.

Please let me know if anything is not clear or you encounter any more issues.

JohnMalliotakis · 2021-09-28T11:21:35Z

Hi, I actually have a patch which modifies the page allocation mechanism to be NUMA aware and work with any core number. I will try to commit it by tonight.

However, there are a couple more issues which need to be addressed in order to support 256 cores. The main issue lies with our reverse mapping structure, which stores all the mappings established for a page, and is defined here

FastMap/driver/tagged_page.h

Line 16 in 06a2c93

struct pr_vma_rmap

The __vaddr field is used to store the virtual address associated with the reverse mapping. As these addresses are page-aligned we can use the 12 least significant bits in __vaddr to store metadata, namely the core number with which the mapping is associated (see here

FastMap/driver/tagged_page.c

Line 209 in 06a2c93

/* this is bits 6:0 in vaddr -> 7 bits -> 2^7 = 128 max CPUs */

) and the mapping index in the page reverse mappings array (rmap field here

FastMap/driver/tagged_page.h

Line 25 in 06a2c93

    
           struct pr_vma_rmap rmap[MAX_RMAPS_PER_PAGE]; /* reverse mappings */ /* XXX this should be first */

). The MAX_RMAPS_PER_PAGE macro essentially defines how many mmap calls one can issue over the device/file at the same time. To support 256 cores, you could reduce the number of bits used for the index metadata (which must be able to cover the range [0, MAX_RMAPS_PER_PAGE - 1]) and use the additional bits for the cpu metadata. That would involve modifying the pvr_set/get_idx/cpu functions in the file tagged_page.c according to your needs.

Finally, you would have to modify the structures in dmap.h which are hardcoded to support at most 128 cores, namely struct pr_vma_entry and struct pr_vma_data, and replace 128 with 256. Hope this helps, feel free to ask if you encounter any other problems.

jiguanglizipao · 2021-09-28T17:00:04Z

I followed @tpapagian 's comment and successfully ran on 128 cores if I disabled the half cores (hyper-threading actually) in BIOS. The random read workload fromappend-only-log works and shows good performance. However, the random write workload runs very slow (< 10K IOPS) and leads to memory leaks. I also tried increasing EVICTOR_THREADS to 32, but it did not help.

For 256 cores, I changed pvr_get_cpu from return p->__vaddr & 0x7F; to return p->__vaddr & 0xFF; and changed pvr_get_idx from return (p->__vaddr & 0x0F80) >> 7; to return (p->__vaddr & 0x0F00) >> 8;. I think 4 bits for pvr_get_idx covers MAX_RMAPS_PER_PAGE (default setting is 3). The log is attached below.

[ 4605.543124] task: ffff9d25e9ff4500 task.stack: ffffa912d4a48000
[ 4605.543125] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[ 4605.543125] RSP: 0018:ffffa912d4a4bdb8 EFLAGS: 00010246
[ 4605.543126] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 4605.543126] RDX: 0000000000000000 RSI: ffffa912d4a4be08 RDI: 0000000000000000
[ 4605.543127] RBP: 0000000000000008 R08: 00000000000000ff R09: 0000000000000001
[ 4605.543127] R10: ffffffff860672a0 R11: 0000000000000000 R12: ffff9d35ebfead00
[ 4605.543128] R13: 0000000000000040 R14: 0000000000000228 R15: 0000000000000080
[ 4605.543129] FS:  0000000000000000(0000) GS:ffff9d360ea00000(0000) knlGS:0000000000000000
[ 4605.543129] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4605.543130] CR2: 0000000000000000 CR3: 00000059e340a000 CR4: 00000000003406e0
[ 4605.543130] Call Trace:                                     
[ 4605.543135]  raw_release+0xb7/0x2b2 [dmap]
[ 4605.543138]  ? __fput+0xd8/0x220
[ 4605.543143]  ? task_work_run+0x8a/0xb0
[ 4605.543146]  ? do_exit+0x395/0xb80
[ 4605.543147]  ? SyS_ioctl+0x74/0x80
[ 4605.543148]  ? rewind_stack_do_exit+0x17/0x20
[ 4605.543149] Code: 44 24 08 41 bd 40 00 00 00 83 e2 20 48 8b 04 24 4c 8b 48 08 4c 89 c8 83 e0 03 48 83 f8 01 0f 85 69 02 00 00 4c 89 c8 48 83 e0 fe <0f> b6 08 4c 89 e8 48 d3 e0 48 83 e8 01 48 39 c7 0f 86 a8 00 00 
[ 4605.543163] RIP: radix_tree_next_chunk+0x76/0x320 RSP: ffffa912d4a4bdb8
[ 4605.543163] CR2: 0000000000000000
[ 4605.543163] ---[ end trace a51886265ae8a084 ]---

JohnMalliotakis · 2021-09-28T18:14:08Z

I have also added the patches for NUMA aware allocations. Just set NUM_FREELISTS to anything other than 1 in shared_defines.h. The value does not matter because we use the return value of num_online_cpus(). I'll have a look at the log you posted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make fastmap support more cpu cores than 32 or 80 #1

How to make fastmap support more cpu cores than 32 or 80 #1

jiguanglizipao commented Sep 27, 2021

tpapagian commented Sep 28, 2021

JohnMalliotakis commented Sep 28, 2021

jiguanglizipao commented Sep 28, 2021

JohnMalliotakis commented Sep 28, 2021

How to make fastmap support more cpu cores than 32 or 80 #1

How to make fastmap support more cpu cores than 32 or 80 #1

Comments

jiguanglizipao commented Sep 27, 2021

tpapagian commented Sep 28, 2021

JohnMalliotakis commented Sep 28, 2021

jiguanglizipao commented Sep 28, 2021

JohnMalliotakis commented Sep 28, 2021