Fix record batch memory size double counting #13377

2010YOUY01 · 2024-11-12T10:59:35Z

Which issue does this PR close?

First step to fix #13089

Rationale for this change

Now record_batch.get_array_memory_size() will overestimate memory usage: If multiple array are pointing to the same underlying buffer, they will be counted repeatedly.
A more detailed explanation can be found in this PR's comment:

/// Calculate total used memory of this batch.
///
/// This function is used to estimate the physical memory usage of the `RecordBatch`. The implementation will add up all unique `Buffer`'s memory
/// size, due to:
/// - The data pointer inside `Buffer` are memory regions returned by global memory
/// allocator, those regions can't have overlap.
/// - The actual used range of `ArrayRef`s inside `RecordBatch` can have overlap
/// or reuse the same `Buffer`. For example: taking a slice from `Array`.
///
/// Example:
/// For a `RecordBatch` with two columns: `col1` and `col2`, two columns are pointing
/// to a sub-region of the same buffer.
///
/// [xxxxxxxxxxxxxxxxxxx] <--- buffer
///       ^    ^  ^    ^
/// 	  |    |  |    |
/// col1->[    ]  |    |    
/// col2--------->[    ]
///
/// In the above case, `get_record_batch_memory_size` will return the size of
/// the buffer, instead of the sum of `col1` and `col2`'s actual memory size.
///
/// Note: Current `RecordBatch`.get_array_memory_size()` will double count the
/// buffer memory size if multiple arrays within the batch are sharing the same
/// `Buffer`. This method provides temporary fix until the issue is resolved:
/// https://github.com/apache/arrow-rs/issues/6439
pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize {
...
}

This function is used for spilled execution to estimate physical memory usage, this overestimation caused many bugs in memory-limited sort/aggregation/join. For example, if there is a RecordBatch with 10 columns, all of 10 columns are sharing the same Buffer, then record_batch.get_array_memory_size() will return a 10X estimation, to make memory-limited query fail quite easily.

I believe #13089 is caused by this issue, and likely #9417 #10511 #12136 #11390

What changes are included in this PR?

Introduced a new get_record_batch_memory_size() to avoid double count, by using a internal HashSet to recognize reused buffers.
While @waynexia is working on a comprehensive solution in arrow-rs apache/arrow-rs#6439, I think it's useful to introduce this temporary fix in DataFusion due to:

After fixing record_batch.get_array_memory_size() with memory overcounting, it's non trivial to fix all tests at once (manual memory tracking is tricky, when I was trying to make one external aggregate query to run, it took me a while to figure out why one test case fail after a change)
If we can adopt this temporary fix, we can gradually swap out record_batch.get_array_memory_size(), and add regression tests for memory-limited query bugs. After we have a fix in arrow, the temporary fix function can be deprecated and replace with the origin one more easily.

Are these changes tested?

Yes

Are there any user-facing changes?

No

blaginin · 2024-11-12T12:59:13Z

This PR does indeed fix #10511 😀. I just tested the branch, and the code that crashes in main works perfectly here

blaginin · 2024-11-12T18:20:07Z

datafusion/physical-plan/src/spill.rs

+    // Count all children `ArrayData` recursively
+    for child in array_data.child_data() {
+        count_array_data_memory_size(child, counted_buffers, total_size);
+    }


Does it make sense to use #[recursive] to protect from cases with large nested data types?

I've learned something new today.
Maybe apache/datafusion-sqlparser-rs#984 can be fixed with this attribute.
But this attribute come with performance overhead 🤔 https://docs.rs/recursive/latest/recursive/ I think stack overflow will happen after 10s of layers of recursion, which is likely for expression but I am not sure arrays can also have such deep nesting

yes I agree we don't need to annoate all recursive function calls -- only the ones that will become very large/deep

blaginin · 2024-11-12T18:24:13Z

datafusion/physical-plan/src/spill.rs

+    counted_buffers: &mut HashSet<NonNull<u8>>,
+    total_size: &mut usize,
+) {
+    // Count memory usage for `array_data`


nit, but you can probably add size of array_data.data_type itself

I think this approach also missed several other metadata's memory size (like datatype, buffer pointers), they will be included in the more-comprehensive fix in arrow side.
For memory counting in large memory consumer, it's allowed to have certain inaccuracy, as long as major consumption is counted. However I agree this should be better documented.

blaginin · 2024-11-12T18:27:45Z

datafusion/physical-plan/src/spill.rs

+/// buffer memory size if multiple arrays within the batch are sharing the same
+/// `Buffer`. This method provides temporary fix until the issue is resolved:
+/// <https://github.com/apache/arrow-rs/issues/6439>
+pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize {


in TopK, RecordBatchStore still uses get_array_memory_size, do you think we should switch to get_record_batch_memory_size there as well?

I think they should all be changed, however after changing them in TopK, some existing test cases might be tricky to fix, and more end-to-end tests should be added. So I plan to do it incrementally.

Cool -- can you possibly file a ticket to track any work that you know about? I can help file it / with the explanation as well

blaginin · 2024-11-12T18:50:05Z

datafusion/physical-plan/src/spill.rs

+
+        let size = get_record_batch_memory_size(&batch);
+        assert_eq!(size, 8320);
+    }
 }


I think this line isn't covered, because I commented it out and all tests in this file passed. Let's add one more test?

Agreed.
Also I believe there are some tools can automatically do similar checks (mutate code and make sure some test case will fail, if don't then there is some issue with test coverage), like https://mutants.rs/
We can investigate how to integrate them into the project 😄

blaginin · 2024-11-12T19:01:14Z

datafusion/physical-plan/src/spill.rs

+        .unwrap();
+
+        let size = get_record_batch_memory_size(&batch);
+        assert_eq!(size, 60);


My only concern with this PR is that the result of get_record_batch_memory_size differs from get_array_memory_size. For example, here batch.get_array_memory_size() would return 252 instead of 60.

This could be dangerous because the project would end up with two different methods of calculating memory sizes. I can imagine a scenario in the future, where we reserve memory based on one calculation method and shrink it using the result from the other. While the difference may not be large each time, over many repetitions or a large dataset, it could behave almost like a memory leak (but without actual memory), making debugging very challenging...

Should we completely switch to the new method, blocking the usage of the old one? Should we try to make two numbers match closely?

This is a great point. I also feel that this manual memory accounting is complex and error-prone. We’d better change all of it. (Maybe also use some RAII in the implementation, instead of manually growing and shrinking memory usage as we’re doing right now.)

Finding a way to automatically update the memory accounting is certainly a good idea in my mind. As we have mentioned, I think the most important thing will be to find a way to account for arrow buffers completely Then we can work it into DataFusion

alamb · 2024-11-13T14:11:59Z

Thanks @2010YOUY01 -- will look at this later today or tomorrow

comphead · 2024-11-13T16:16:37Z

datafusion/core/tests/memory_limit/mod.rs

+
+    // Merge operation needs extra memory to do row conversion, so make the
+    // memory limit larger.
+    let mem_limit = partition_size * 2;


This can be introduced as DataFusion parameter so the user can configure the memory allocation realm. I got some feeling the mem is data dependent, depending on datatypes/data being processed

comphead

really love the documentation, so no need to go through the code.

One thing to mentioned is how fast this method is? as I believe the method will be called frequently

2010YOUY01 · 2024-11-14T08:41:33Z

One thing to mentioned is how fast this method is? as I believe the method will be called frequently

This is a very good point, I think when doing the same fix at arrow side, we should cache the result inside RecordBatch (if they're immutable)
I will run some benchmarks

alamb

Thanks @2010YOUY01 and @blaginin -- this PR makes a lot of sense to me

Filing follow on tickets would be a good idea in my mind

alamb · 2024-11-14T09:19:59Z

datafusion/physical-plan/src/spill.rs

@@ -109,10 +111,80 @@ pub fn spill_record_batch_by_size(
    Ok(())
 }

+/// Calculate total used memory of this batch.


💯 for this comment

alamb · 2024-11-14T09:20:50Z

datafusion/physical-plan/src/spill.rs

+/// buffer memory size if multiple arrays within the batch are sharing the same
+/// `Buffer`. This method provides temporary fix until the issue is resolved:
+/// <https://github.com/apache/arrow-rs/issues/6439>
+pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize {


Cool -- can you possibly file a ticket to track any work that you know about? I can help file it / with the explanation as well

alamb · 2024-11-14T09:22:07Z

datafusion/physical-plan/src/spill.rs

+    // Count all children `ArrayData` recursively
+    for child in array_data.child_data() {
+        count_array_data_memory_size(child, counted_buffers, total_size);
+    }


yes I agree we don't need to annoate all recursive function calls -- only the ones that will become very large/deep

alamb · 2024-11-14T09:23:22Z

datafusion/physical-plan/src/spill.rs

+        .unwrap();
+
+        let size = get_record_batch_memory_size(&batch);
+        assert_eq!(size, 60);


Finding a way to automatically update the memory accounting is certainly a good idea in my mind. As we have mentioned, I think the most important thing will be to find a way to account for arrow buffers completely Then we can work it into DataFusion

alamb · 2024-11-14T09:23:43Z

datafusion/physical-plan/src/spill.rs

+        let slice1 = original.slice(0, 3);
+        let slice2 = original.slice(2, 3);
+
+        let batch =


alamb · 2024-11-14T09:25:15Z

datafusion/physical-plan/src/spill.rs

+                .unwrap();
+
+        let size = get_record_batch_memory_size(&batch);
+        // The size should only count the shared buffer once


It would be good in my mind to change this test so that rather than testing a hard coded size, it would compute the size of a single slice and verify that is the same

that way the test would verify the actual invariant (that the sizes are the same) rather than relying on keeping the two values in sync

2010YOUY01 · 2024-11-15T07:05:10Z

Thank you all for the feedbacks! I've updated the followings:

Improved unit test coverage
Opened an issue for the remaining tasks Replace record_batch.get_array_memory_size() in spilling operators #13430
Ran the TPCH benchmark and make sure there is no noticeable difference in end-to-end execution for this method

comphead

lgtm thanks @2010YOUY01

Fix record batch memory size double counting

d6659bc

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Nov 12, 2024

2010YOUY01 added 3 commits November 12, 2024 19:19

clippy

52785f5

clippy

d185ae6

clippy

84cee68

blaginin reviewed Nov 12, 2024

View reviewed changes

comphead reviewed Nov 13, 2024

View reviewed changes

alamb approved these changes Nov 14, 2024

View reviewed changes

2010YOUY01 mentioned this pull request Nov 15, 2024

Replace record_batch.get_array_memory_size() in spilling operators #13430

Open

review

951a5f4

comphead approved these changes Nov 15, 2024

View reviewed changes

comphead merged commit 172cf8d into apache:main Nov 15, 2024
25 checks passed

2010YOUY01 deleted the fix-batch-size branch November 16, 2024 02:51

blaginin mentioned this pull request Nov 16, 2024

Excessive memory consumption on sorting #10511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix record batch memory size double counting #13377

Fix record batch memory size double counting #13377

2010YOUY01 commented Nov 12, 2024 •

edited

Loading

blaginin commented Nov 12, 2024 •

edited

Loading

blaginin Nov 12, 2024

2010YOUY01 Nov 14, 2024

alamb Nov 14, 2024

blaginin Nov 12, 2024

2010YOUY01 Nov 14, 2024

blaginin Nov 12, 2024

2010YOUY01 Nov 14, 2024

alamb Nov 14, 2024

blaginin Nov 12, 2024

2010YOUY01 Nov 14, 2024

blaginin Nov 12, 2024 •

edited

Loading

blaginin Nov 12, 2024

2010YOUY01 Nov 14, 2024

alamb Nov 14, 2024

alamb commented Nov 13, 2024

comphead Nov 13, 2024

comphead left a comment

2010YOUY01 commented Nov 14, 2024

alamb left a comment

alamb Nov 14, 2024

alamb Nov 14, 2024

alamb Nov 14, 2024

alamb Nov 14, 2024

alamb Nov 14, 2024

alamb Nov 14, 2024

2010YOUY01 commented Nov 15, 2024

comphead left a comment

Fix record batch memory size double counting #13377

Fix record batch memory size double counting #13377

Conversation

2010YOUY01 commented Nov 12, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

blaginin commented Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blaginin Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 13, 2024

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

2010YOUY01 commented Nov 14, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Nov 15, 2024

comphead left a comment

Choose a reason for hiding this comment

2010YOUY01 commented Nov 12, 2024 •

edited

Loading

blaginin commented Nov 12, 2024 •

edited

Loading

blaginin Nov 12, 2024 •

edited

Loading