Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use regular jdk String.intern() #350

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mpollmeier
Copy link
Contributor

I'm not sure why we didn't use String.intern() from the start.
This PR doesn't only simplify our codebase, but saves us the
internedStrings ConcurrentHashMap. The heap size for a large CPG
(linux 4 kernel) goes down from 35830m to 35260m, i.e. 570m, or 1.6%.

I'm not sure why we didn't use `String.intern()` from the start.
This PR doesn't only simplify our codebase, but saves us the
`internedStrings` ConcurrentHashMap. The heap size for a large CPG
(linux 4 kernel) goes down from 35830m to 35260m, i.e. 570m, or 1.6%.
@bbrehm
Copy link
Contributor

bbrehm commented Jan 30, 2023

cf https://shipilev.net/jvm/anatomy-quarks/10-string-intern/

The tl;dr is that String.intern() is unfortunately sharing tables with JVM-internals (e.g. strings from class-file constant pools) and is probably inappropriate; and unfortunately there is no ready-made alternative in the standard library.

However, I'm open to being convinced by benchmarks. A critical question (in context of mesh) would be impact on GC times and memory consumption if we do multiple scans without restarting the JVM in the middle.

@mpollmeier
Copy link
Contributor Author

Do you really think that String.intern() is on the hot path for mesh? I very much doubt it...
Apart from mesh being closed source, the entire stack is dockerised. So in order to create those benchmarks I'd need to jump through a few hoops, and it's only reproducable for SL interns.

How about some scans on joern instead? I don't want to make a scientific paper out of it, just a few observations re GC and memory usage for multiple scans on the same JVM instance - sounds good?

@ml86
Copy link
Contributor

ml86 commented Jan 30, 2023

Sounds like we want to keep things the way they are. Sadly there is no date on the linked article but i would also want to see some benchmarks before continuing with this.

@mpollmeier
Copy link
Contributor Author

i would also want to see some benchmarks before continuing with this.

would you be happy with the one described in my comment above?

@bbrehm
Copy link
Contributor

bbrehm commented Jan 30, 2023

How about some scans on joern instead?

Sounds good to me. Just run a couple different scans after another. My main point about mesh was that we must include multiple scans after each other without jvm restart, because string.intern() uses special stuff and we pay not just for the call to string.intern but also distributed over future GC pauses.

However, it sounds like the general consensus is "never ever use String.intern(), always use custom interner".

I would not be opposed to writing a more appropriate hash-table for string interning, though. How did you identify the interning as a problem? ("more appropriate": We can relax some requirements inherent to ConcurrentHashMap, so I would not be surptrised if we can do better for our specific use case)

@mpollmeier
Copy link
Contributor Author

How did you identify the interning as a problem?

When I opened up a large database heapdump (CPG with linux 4 kernel), the StringInterner stood out as one of the big blobs.

Copy link
Contributor

@bbrehm bbrehm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think this over for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants