-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency issue inside com.fasterxml.jackson.databind.util.LRUMap.get(Object) #503
Comments
Just to clarify, the line LinkedHashMap.java:333 inside J9 is related to the update of the internal structure when the entry is found and accessOrder is true. |
Ouch. These are nasty problems. Thank you for reporting this and analysing the problem. You are right in that there are performance implications, so I wonder if use of read/write lock is pointless given this issue; basically all access becomes write-access, and so simpler locking makes sense. |
By the way, I just tested a few solutions and the best one is to make all methods synchronized. |
@fjtc Another possibility is to just make it plain |
I already tested a solution that leaves the r/w lock as it is now but I changes the accessOrder to false. It works better than any other version with accessOrder set to true but I discarded it as valid a solution because it drops the LRU behavior. If the LRU behavior is not mandatory, I think that the use of a HashMap with a r/w lock is indeed a better solution from the point of view of performance. By the other hand, the lack of the LRU behavior will let the cache to grow without control. |
My intent was to drop LRU aspect, but NOT the size limit: that is, when doing Thank you for doing measurements; it sounds like this approach is the way to go then. The only (?) other option would be to consider use of |
Depends on the level on Concurrency you expect. |
Ok: I fixed by using |
Thanks for the prompt response @cowtowncoder. Given the nature of the issue, I think it would be better to release this fix as soon as possible in order to replace version 2.4.1.1 because the problem leads to a NullPointerException on J9 and also affects OpenJDK as well (I saw the code inside OpenJDK and I found out that this error may also lead to a broken linked list inside the LinkedHashMap but I think that there is a smaller chance of exceptions). Since I'll not be able to wait for the next release of the library, I'm already using a private build of the library that I forked from version 2.4.1.1 as an emergency solution for my project. Given that, it would be better to the project to include this fix on the next planned version. Anyway, I'll help you to evaluate the performance of your fix against the synchronization options I already tested and as soon I get the results, I'll send them to you. I may also contribute with the program I'm using to reproduce the issue and evaluating the fixes as well. |
I just finished my first batch of tests. The preliminary results are: Old implementation using synchronized: Total time:32069 The most important thing about the new implementation is that it works as expected. My test is not complete yet because it executes a lot of serializations of 3 distinct classes using 50 threads (it was enough to trigger the bug on J9). Given that, the removal of the classes from the cache were never tested. I'll improve the test program in order activate the cache pruning as well. |
@fjtc Thank you for evaluating these options. I agree in that correctness is the key requirement (obviously), and at least ConcurrentHashMap has no worse performance. Interesting to see that old with basic synchronized still performs about as well. As to release, since pushing a micro-release 2.4.1.2 is very easy (being just one jar, not all 20+ jackson components), I will just go ahead and push it. Full 2.4.2 is still couple of weeks off (should be done by end of july, but there are a few other issues that should be resolved in other components), and anyone hitting this issue will want a fix ASAP. |
And now 2.4.1.2 is released to Sonatype OSS repo, from which it should sync up to Maven Central Repo shortly. |
Thanks @cowtowncoder. I'll start to use the version 2.4.1.2 in the next build our project. As I told you before, I performed a more aggressive test designed to force the cache to purge instances. I modified my test program to run a lot of serializations of 300 distinct classes using 50 threads on each cycle, the program peeks a random class to serialize (they all have the same structure but they are not related). The new results are: New implementation using ConcurrentHashMap: Total time:124876 This results are quite interesting because this version do forces the cache pruning a few times and the old synchronized version uses a true LRU approach while the new version purges the cache. It is curious to see that the full purge implemented by the new version is more effective on high load than the selective purge/LRU cache management used by the older version. The only problem with my test is that each of those 300 classes contains only 2 string fields so I don't know if this situation will hold for a set of more complex objects. l also performed a few tests with distinct values for concurrencyLevel. Since my computer has 4 cores, the best performance was achieved with this parameter set as 4. Smaller values reduces the performance a little bit while increasing the value also produces small performance penalties but it appears that increasing the value is worst then reducing it. Given that, I think that for the current processor standards, 4 is indeed the best value. Thanks again for the prompt response. |
@fjtc Very interesting! Thank you for running these tests and sharing results. It sounds like tests verify worst-case scenario, since handling of values of small classes have relatively higher overhead from type cache. It is indeed interesting that full purge is faster; my guess was that it should not be slower, but I wouldn't have predicted it to be faster. It is also good to have some indicate that choice of 4 is reasonable, although obviously tests with more cores could have slightly different results. But I suspect that benefit from higher branching factor would be nominal, even with significantly more cores, at least for typical use cases. |
This seems to have broken something in Jersey 2.10.1 + jackson-jaxrs-json-provide 2.4.1: javax.ws.rs.ProcessingException: com.fasterxml.jackson.databind.util.LRUMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; |
I wish we had cascading build to catch these... thank you for reporting this, otherwise I'd have found it when trying to build 2.4.2. |
Interesting. So this is partly due to half-baked nature of |
Ok. While I could remove dependency from JAX-RS itself, there is another one from XML module, so in the end I ended up changing I also pushed jackson-databind-2.4.1.3 with the fix; otherwise I'd have to push revisions of JAX-RS, XML and JAXB annotations modules. 2.4.2 will work properly as well. Apologies for the problem here. |
No worries, the tests pass after updating to 2.4.1.3. Another possible solution is https://code.google.com/p/concurrentlinkedhashmap/ which is used by Cassandra. |
@azell Thanks for the link. Looks very useful, and will keep that in mind for future reference; the only concern being additional dependency (but could shade that in). I'll probably use that for some of my own caching outside of Jackson now that I am aware of it; esp. ability to bound by value size (not just entry count) is something I have been looking for. |
By the way, the PR to fix this bug seems to have dropped the LRU semantics of LRUMap, now it clears the entire thing every time it hits max capacity. Is this the intended behavior? That seems to be very weird semantics for a cache. |
@aarondav Yes, this is intentional. JDK's LRU behavior is not thread-safe; so choice is to either use synchronization which appears to have major negative impact, or to just flush the whole cache (or, try incrementally clearing arbitrary entries -- I guess that'd be third way). But as you can see from earlier results, full flush turns out to be not only simple but nicely performant; sometimes simplest approachs works best. |
We just found an issue inside com.fasterxml.jackson.databind.util.LRUMap that result in a cache corruption when running on IBM J9 on high load.
The error manifests with the following stack (jackson-databind-2.4.1.1):
After analyzing the issue, we found out that the implementation of com.fasterxml.jackson.databind.util.LRUMap.get(Object) incorrectly assumes that the superclass implementation of this method is read-only. This is not true because the constructor of LRUMap sets the accessOrder to true, thus allowing get() to promote changes in the internal structure of the map in order to optimize the access (see http://docs.oracle.com/javase/6/docs/api/java/util/LinkedHashMap.html for further details).
Given that, the implementation of com.fasterxml.jackson.databind.util.LRUMap.get(Object) must be protect the call with a write lock instead of a read lock:
It is also possible to solve this issue by changing the constructor parameter accessOrder to false instead and leave the current implementation of the get() as is. This however will disable the LRU behavior of this implementation.
I know that this change will result in performance issues due to the need of synchronization. I'll test the performance of other synchronization options as well in order to determine what is the best solution for this problem.
As far as I could see, this issue is closely related to the issue #435.
The text was updated successfully, but these errors were encountered: