You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked there is not other Issues or Pull Requests covering the same topic to open
Describe the bug
So I already read about the OOM killing mdedup because if high memory usage, but I'm quite a bit puzzled here.
I ran the following on a set of 590k emails:
Once about 23k emails were done, mdedup had a memory usage of 42GB, which is roughly 1,8MB per email ! At that pace, I'd need 1TB to dedup my emails.
This number climbed almost proportionnaly, which makes me think that, since I used the -b raw parameter. mdedup does not only keep paths and corresponding sets of hashes in memory, but full emails.
I've done another test without -b switch, and this time I got about 15GB for 23k emails which still is a very high number. At that pace I'd need about 400GB for my emails to be deduped.
I've made a quick calculation, keeping 10 SHA256 hashes (32 bytes) in memory per email would require about 180MB of data (regardless of any data representation) plus let's say 1KB per filename which would require another whopping 576MB, for a total of less than a GB of data.
Now I can imagine that the dedup table would take some memory, but 1TB of memory ? Even ZFS isn't that hungry.
This program is highly memory inefficient, since I think it keeps way more than just sets of hashes. Does it keep the whole in memory representation of an email object per email ?
Are my findings normal or off the charts ?
I've had earlier issues with high memory consumption with lots of json data.
I'd advise to have a look at msgspec library which saved me about 40% memory usage on a json dict with millions of entries.
Anyway, I'd advise you to only keep hashes in memory instead of whatever class object repr you're keeping.
Preliminary checks
Describe the bug
So I already read about the OOM killing mdedup because if high memory usage, but I'm quite a bit puzzled here.
I ran the following on a set of 590k emails:
Once about 23k emails were done, mdedup had a memory usage of 42GB, which is roughly 1,8MB per email ! At that pace, I'd need 1TB to dedup my emails.
This number climbed almost proportionnaly, which makes me think that, since I used the
-b raw
parameter. mdedup does not only keep paths and corresponding sets of hashes in memory, but full emails.I've done another test without
-b
switch, and this time I got about 15GB for 23k emails which still is a very high number. At that pace I'd need about 400GB for my emails to be deduped.I've made a quick calculation, keeping 10 SHA256 hashes (32 bytes) in memory per email would require about 180MB of data (regardless of any data representation) plus let's say 1KB per filename which would require another whopping 576MB, for a total of less than a GB of data.
Now I can imagine that the dedup table would take some memory, but 1TB of memory ? Even ZFS isn't that hungry.
This program is highly memory inefficient, since I think it keeps way more than just sets of hashes. Does it keep the whole in memory representation of an email object per email ?
Are my findings normal or off the charts ?
I've had earlier issues with high memory consumption with lots of json data.
I'd advise to have a look at
msgspec
library which saved me about 40% memory usage on a json dict with millions of entries.Anyway, I'd advise you to only keep hashes in memory instead of whatever class object repr you're keeping.
Additional context
mdedup output just before I had to stop it:
Memory usage of mdedup
Memory map of mdedup
Environment
Running on RHEL 9.4 x64.
Additional context
Sorry if my wording was a bit harsh, I'm just totally frightened by those high memory usage numbers.
The text was updated successfully, but these errors were encountered: