Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make fastmap support more cpu cores than 32 or 80 #1

Open
jiguanglizipao opened this issue Sep 27, 2021 · 4 comments
Open

How to make fastmap support more cpu cores than 32 or 80 #1

jiguanglizipao opened this issue Sep 27, 2021 · 4 comments

Comments

@jiguanglizipao
Copy link

I tried to run FastMap on a server with 256 cores, and FastMap crashed at alloc_page_lock. I read the code and found that FastMap only supports 32 and 80 freelists (or cores) according to mmap_buffer_hash_table.c shown below. How to make it variable? I am also confused about setting constants in shared_defines.h? Could you show a correct setting with 256 cores? Thank you.

#error "NUM_FREELISTS number is not supported!"

const unsigned int cpu = __rand[cpuid][i];

@tpapagian
Copy link
Collaborator

Regarding shared_defines.h you should only change NUM_FREELISTS and NUM_QUEUES to be equals to the number of cores (256 in your case).

#error "NUM_FREELISTS number is not supported!"
will return an error. To be honest we haven’t done this generic enough as we only support 32 and 80 cores. In order to generate an array for your case, it should be a [NUM_FREELISTS][NUM_FREELISTS]. Each line refers to a specific core. For example line 0 refers to core 0. In the case of 32 * 32, core 0 will check the freelists 28, 1, 18, 6, 29, etc (
{ 28, 1, 18, 6, 29, 0, 16, 5, 8, 23, 20, 2, 3, 21, 4, 13, 10, 9, 27, 15, 14, 30, 7, 11, 22, 24, 31, 25, 19, 26, 17, 12 },
). Core 1 will use the second line etc. So the only requirement is that each line should contain a random permutation of numbers 0 to num_cores - 1 without any repetitions.

Feel free to provide a pull request with a more generic way to produce these arrays.

Depending on the throughput of your device you may increase a little bit the EVICTOR_THREADS define in shared_defines.h (i.e. 16 or 32). But this will require some testing to do find out the most appropriate value. I believe it is fine to start with 8.

Please let me know if anything is not clear or you encounter any more issues.

@JohnMalliotakis
Copy link
Collaborator

Hi, I actually have a patch which modifies the page allocation mechanism to be NUMA aware and work with any core number. I will try to commit it by tonight.

However, there are a couple more issues which need to be addressed in order to support 256 cores. The main issue lies with our reverse mapping structure, which stores all the mappings established for a page, and is defined here

struct pr_vma_rmap

The __vaddr field is used to store the virtual address associated with the reverse mapping. As these addresses are page-aligned we can use the 12 least significant bits in __vaddr to store metadata, namely the core number with which the mapping is associated (see here

/* this is bits 6:0 in vaddr -> 7 bits -> 2^7 = 128 max CPUs */
) and the mapping index in the page reverse mappings array (rmap field here
struct pr_vma_rmap rmap[MAX_RMAPS_PER_PAGE]; /* reverse mappings */ /* XXX this should be first */
). The MAX_RMAPS_PER_PAGE macro essentially defines how many mmap calls one can issue over the device/file at the same time. To support 256 cores, you could reduce the number of bits used for the index metadata (which must be able to cover the range [0, MAX_RMAPS_PER_PAGE - 1]) and use the additional bits for the cpu metadata. That would involve modifying the pvr_set/get_idx/cpu functions in the file tagged_page.c according to your needs.

Finally, you would have to modify the structures in dmap.h which are hardcoded to support at most 128 cores, namely struct pr_vma_entry and struct pr_vma_data, and replace 128 with 256. Hope this helps, feel free to ask if you encounter any other problems.

@jiguanglizipao
Copy link
Author

I followed @tpapagian 's comment and successfully ran on 128 cores if I disabled the half cores (hyper-threading actually) in BIOS. The random read workload fromappend-only-log works and shows good performance. However, the random write workload runs very slow (< 10K IOPS) and leads to memory leaks. I also tried increasing EVICTOR_THREADS to 32, but it did not help.

For 256 cores, I changed pvr_get_cpu from return p->__vaddr & 0x7F; to return p->__vaddr & 0xFF; and changed pvr_get_idx from return (p->__vaddr & 0x0F80) >> 7; to return (p->__vaddr & 0x0F00) >> 8;. I think 4 bits for pvr_get_idx covers MAX_RMAPS_PER_PAGE (default setting is 3). The log is attached below.

[ 4605.543124] task: ffff9d25e9ff4500 task.stack: ffffa912d4a48000
[ 4605.543125] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[ 4605.543125] RSP: 0018:ffffa912d4a4bdb8 EFLAGS: 00010246
[ 4605.543126] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 4605.543126] RDX: 0000000000000000 RSI: ffffa912d4a4be08 RDI: 0000000000000000
[ 4605.543127] RBP: 0000000000000008 R08: 00000000000000ff R09: 0000000000000001
[ 4605.543127] R10: ffffffff860672a0 R11: 0000000000000000 R12: ffff9d35ebfead00
[ 4605.543128] R13: 0000000000000040 R14: 0000000000000228 R15: 0000000000000080
[ 4605.543129] FS:  0000000000000000(0000) GS:ffff9d360ea00000(0000) knlGS:0000000000000000
[ 4605.543129] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4605.543130] CR2: 0000000000000000 CR3: 00000059e340a000 CR4: 00000000003406e0
[ 4605.543130] Call Trace:                                     
[ 4605.543135]  raw_release+0xb7/0x2b2 [dmap]
[ 4605.543138]  ? __fput+0xd8/0x220
[ 4605.543143]  ? task_work_run+0x8a/0xb0
[ 4605.543146]  ? do_exit+0x395/0xb80
[ 4605.543147]  ? SyS_ioctl+0x74/0x80
[ 4605.543148]  ? rewind_stack_do_exit+0x17/0x20
[ 4605.543149] Code: 44 24 08 41 bd 40 00 00 00 83 e2 20 48 8b 04 24 4c 8b 48 08 4c 89 c8 83 e0 03 48 83 f8 01 0f 85 69 02 00 00 4c 89 c8 48 83 e0 fe <0f> b6 08 4c 89 e8 48 d3 e0 48 83 e8 01 48 39 c7 0f 86 a8 00 00 
[ 4605.543163] RIP: radix_tree_next_chunk+0x76/0x320 RSP: ffffa912d4a4bdb8
[ 4605.543163] CR2: 0000000000000000
[ 4605.543163] ---[ end trace a51886265ae8a084 ]---

@JohnMalliotakis
Copy link
Collaborator

I have also added the patches for NUMA aware allocations. Just set NUM_FREELISTS to anything other than 1 in shared_defines.h. The value does not matter because we use the return value of num_online_cpus(). I'll have a look at the log you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants