Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HMSDK-2.0] Memory tiering issue with Hynix CXL devices #2

Open
j0807s opened this issue Feb 19, 2024 · 12 comments
Open

[HMSDK-2.0] Memory tiering issue with Hynix CXL devices #2

j0807s opened this issue Feb 19, 2024 · 12 comments

Comments

@j0807s
Copy link

j0807s commented Feb 19, 2024

Hello,

We are currently testing HMSDK-2.0 with Hynix CXL devices. However, we have encountered an issue where all the memory devices, including the expanders, have the same memory tier (i.e., memory_tier), which may hinder automatic promotion and demotion.

How can we create a new memory tier for the CXL devices and utilize them as second-tiered memory?

Our environmental setup is as follows:

OS (kernel) : Ubuntu 22.04.3 LTS (Linux 6.6.0-hmsdk2.0+)​
CPU : Intel Xeon 4410Y (Sapphire Rapids) @2.0 GHz, 12 cores​
Memory (Socket 0,1) : 32 GB DDR5-4000 MT/s, Total 128 GB ​
CXL Expander: PCIe 5.0 , Each with 96 GB​
Mother Board: Super X13DAI-T (Supporting CXL 1.1, CXL Type 3 Legacy Enabled​)

Thanks.

@JongminKim-KU
Copy link

We figured out that the memory in all of NUMA nodes is in the same tier:
$ ls /sys/devices/virtual/memory_tiering
memory_tier4 power uevent
$ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist
0-3

@hyeongtakji
Copy link
Collaborator

Hello Junsu and Jongmin,

Thank you for reporting this issue.

However, we have encountered an issue where all the memory devices, including the expanders, have the same memory tier (i.e., memory_tier), which may hinder automatic promotion and demotion.

If all NUMA nodes are on the same memory tier, promotion and demotion won't happen.

How can we create a new memory tier for the CXL devices and utilize them as second-tiered memory?

As far as I know, there is no way to change the tier of NUMA nodes other than applying custom patches when building your Linux kernel. Maybe we can share the simple patch that we've used for tests. @honggyukim will it be okay?

Also, we are currently working on RFC v2 for LKML and it will include patches that enable users to set destination nodes for migrations regardless of the memory tier of the system. However, I'm not sure when we will post it. Still, I'll update you when it's available.

@j0807s
Copy link
Author

j0807s commented Feb 19, 2024

Thank you for your explanation and support!

@honggyukim
Copy link
Member

Hi Junsu and Jongmin,

Thanks for the report. As mentioned by @hyeongtakji, the current HMSDK 2.0 won't work unless your system has tiered memory setup.

We figured out that the memory in all of NUMA nodes is in the same tier:

$ ls /sys/devices/virtual/memory_tiering
memory_tier4 power uevent
$ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist
0-3

If you want to make the NUMA node 0, 1 as first tier, and node 2, 3 as second tier, you can just use the following workaround change.

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 437441cdf78f..13f82b5d67e8 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -18,6 +18,7 @@
  * the same memory tier.
  */
 #define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
+#define MEMTIER_ADISTANCE_CXL  (MEMTIER_ADISTANCE_DRAM * 5)

 struct memory_tier;
 struct memory_dev_type {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 37a4f59d9585..3fdbc3c9bfa9 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 static struct memory_dev_type *default_dram_type;
+static struct memory_dev_type *default_cxl_type;

 static struct bus_type memory_tier_subsys = {
        .name = "memory_tiering",
@@ -484,7 +485,10 @@ static struct memory_tier *set_node_memory_tier(int node)
        if (!node_state(node, N_MEMORY))
                return ERR_PTR(-EINVAL);

-       __init_node_memory_type(node, default_dram_type);
+       if (node < 2)
+               __init_node_memory_type(node, default_dram_type);
+       else
+               __init_node_memory_type(node, default_cxl_type);

        memtype = node_memory_types[node].memtype;
        node_set(node, memtype->nodes);
@@ -646,6 +650,9 @@ static int __init memory_tier_init(void)
        default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
        if (IS_ERR(default_dram_type))
                panic("%s() failed to allocate default DRAM tier\n", __func__);
+       default_cxl_type = alloc_memory_type(MEMTIER_ADISTANCE_CXL);
+       if (IS_ERR(default_cxl_type))
+               panic("%s() failed to allocate default CXL tier\n", __func__);

        /*
         * Look at all the existing N_MEMORY nodes and add them to

Also, we are currently working on RFC v2 for LKML and it will include patches that enable users to set destination nodes for migrations regardless of the memory tier of the system. However, I'm not sure when we will post it. Still, I'll update you when it's available.

I'm preparing for this now. Hopefully, I can post it maybe by the next week. I will share the patch here when it's updated.

Thanks.

@honggyukim
Copy link
Member

CXL Expander: PCIe 5.0 , Each with 96 GB​

I worry if you use 2 CXL expander cards. The current kernel change might not be able to find a proper promotion target in the second CXL node. It's due to the inaccuracy of node distance in the upstream kernel, but we better find a better way to handle this problem. If we have the explicit destination setting in DAMON, then this can be handled later.

@honggyukim
Copy link
Member

For now, I would recommend you to test your workload with a single CXL expander. And more importantly, please make sure if your evaluation environment has enough cold memory so that you can demote them to CXL memory. Having those cold memory, you can make enough space for CXL to DRAM promotion.

In other words, if your system has large working set that is larger than your DRAM capacity, then you won't be able to get benefit. We created large amount of cold memory with mmap program for evaluation and you can think that those mmaped cold memory as idle VMs in data centers.

Please see our evaluation environment for more explanation.
https://github.com/skhynix/hmsdk/wiki/HMSDK-v2.0-Performance-Results

@JongminKim-KU
Copy link

Thank you for sharing the modification and experiment setup details.

We will immediately modify the source code before the patch is updated and rebuild the kernel with a single CXL expander.

@honggyukim
Copy link
Member

Please let us know when you have issues again. Thanks!

@j0807s
Copy link
Author

j0807s commented Feb 20, 2024

We have patched the kernel and have observed that the promotion and demotion work during our experiments!

We sincerely appreciate your help!

@honggyukim
Copy link
Member

I'm glad to hear that it's working in your environment. Please don't hesitate when you have more issues later. Thanks.

@honggyukim
Copy link
Member

honggyukim commented Mar 2, 2024

Also, we are currently working on RFC v2 for LKML and it will include patches that enable users to set destination nodes for migrations regardless of the memory tier of the system. However, I'm not sure when we will post it. Still, I'll update you when it's available.

I'm preparing for this now. Hopefully, I can post it maybe by the next week. I will share the patch here when it's updated.

The RFC v2 patches are posted at https://lore.kernel.org/linux-mm/[email protected]. In this patch series, /sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/target_nid is created to set demotion/promotion target node ID explicitly. If this isn't set, then it uses memory tiering as a fallback.

If you're okay with the workaround patch above, then you don't need to use v2 patch, but I'm just sharing the recent update.

@j0807s
Copy link
Author

j0807s commented Mar 6, 2024

It seems the RFC v2 patches would provide much more flexibility to construct a tiered memory system with multiple CXL devices especially when considering NUMA topology(e.g., the 1st tier for the nodes 0,1,2 and the 2nd tier for node 3, etc).

Thank you for sharing the helpful information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants