Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issue inside com.fasterxml.jackson.databind.util.LRUMap.get(Object) #503

Closed
fjtc opened this issue Jul 10, 2014 · 22 comments
Closed
Milestone

Comments

@fjtc
Copy link

fjtc commented Jul 10, 2014

We just found an issue inside com.fasterxml.jackson.databind.util.LRUMap that result in a cache corruption when running on IBM J9 on high load.

The error manifests with the following stack (jackson-databind-2.4.1.1):

7799         at java.util.LinkedHashMap.get(LinkedHashMap.java:333)
7800         at com.fasterxml.jackson.databind.util.LRUMap.get(LRUMap.java:58)
7801         at com.fasterxml.jackson.databind.type.TypeFactory._fromClass(TypeFactory.java:707)
7802         at com.fasterxml.jackson.databind.type.TypeFactory._constructType(TypeFactory.java:387)
7803         at com.fasterxml.jackson.databind.type.TypeFactory._fromParamType(TypeFactory.java:801)
7804         at com.fasterxml.jackson.databind.type.TypeFactory._constructType(TypeFactory.java:391)
7805         at com.fasterxml.jackson.databind.type.TypeBindings.resolveType(TypeBindings.java:102)
7806         at com.fasterxml.jackson.databind.introspect.Annotated.getType(Annotated.java:60)
7807         at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._constructWriter(BeanSerializerFactory.java:703)
7808         at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.findBeanProperties(BeanSerializerFactory.java:559)
7809         at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.constructBeanSerializer(BeanSerializerFactory.java:344)
7810         at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.findBeanSerializer(BeanSerializerFactory.java:263)
7811         at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:222)
7812         at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:152)
7813         at com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1077)
7814         at com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1037)
7815         at com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:445)
7816         at com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:599)
7817         at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:92)
7818         at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2866)
7819         at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:2289)

After analyzing the issue, we found out that the implementation of com.fasterxml.jackson.databind.util.LRUMap.get(Object) incorrectly assumes that the superclass implementation of this method is read-only. This is not true because the constructor of LRUMap sets the accessOrder to true, thus allowing get() to promote changes in the internal structure of the map in order to optimize the access (see http://docs.oracle.com/javase/6/docs/api/java/util/LinkedHashMap.html for further details).

Given that, the implementation of com.fasterxml.jackson.databind.util.LRUMap.get(Object) must be protect the call with a write lock instead of a read lock:

@Override
public V get(Object key) {
    _writeLock.lock();
    try {
        return super.get(key);
    } finally {
        _writeLock.unlock();
    }
}

It is also possible to solve this issue by changing the constructor parameter accessOrder to false instead and leave the current implementation of the get() as is. This however will disable the LRU behavior of this implementation.

I know that this change will result in performance issues due to the need of synchronization. I'll test the performance of other synchronization options as well in order to determine what is the best solution for this problem.

As far as I could see, this issue is closely related to the issue #435.

@fjtc
Copy link
Author

fjtc commented Jul 10, 2014

Just to clarify, the line LinkedHashMap.java:333 inside J9 is related to the update of the internal structure when the entry is found and accessOrder is true.

@cowtowncoder
Copy link
Member

Ouch. These are nasty problems. Thank you for reporting this and analysing the problem.
This is one of those problems that is quite obvious in retrospect (need to modify ordering due to LRU nature), but not so at the time ("well, that IS a read operation for me..." -> except that from impl perspective it is a modifying operation).

You are right in that there are performance implications, so I wonder if use of read/write lock is pointless given this issue; basically all access becomes write-access, and so simpler locking makes sense.

@fjtc
Copy link
Author

fjtc commented Jul 11, 2014

By the way, I just tested a few solutions and the best one is to make all methods synchronized.

@cowtowncoder
Copy link
Member

@fjtc Another possibility is to just make it plain HashMap, and drop LRU aspect, but retain bounded size (clear up when max size reached, new entry to add). I suspect it would perform fine for two use cases that exist (Root name lookup, type resolver), and have lower sync overhead with r/w lock.

@fjtc
Copy link
Author

fjtc commented Jul 11, 2014

I already tested a solution that leaves the r/w lock as it is now but I changes the accessOrder to false. It works better than any other version with accessOrder set to true but I discarded it as valid a solution because it drops the LRU behavior.

If the LRU behavior is not mandatory, I think that the use of a HashMap with a r/w lock is indeed a better solution from the point of view of performance. By the other hand, the lack of the LRU behavior will let the cache to grow without control.

@cowtowncoder
Copy link
Member

My intent was to drop LRU aspect, but NOT the size limit: that is, when doing put(), size would be checked, and if it's already at max, call clear() (within write lock monitor). This is indeed an important part; not so much the least-recently-used, although ideally of course LRU would be retained too. But it is more an icing on cake;I suspect that in most cases all entries fit within max size; and even if not, in the end I don't see it being a catastrophe to simply flush contents.

Thank you for doing measurements; it sounds like this approach is the way to go then.

The only (?) other option would be to consider use of ConcurrentHashMap; not sure if that'd have measurable perf benefit over r/w lock synced HashMap.

@pgelinas
Copy link
Member

Depends on the level on Concurrency you expect. ConcurrentHashMap spreads its buckets over multiple r/w locks pair and often you are able to write in multiple buckets without any contention. Also, usually read operation can overlap write operation since they may not operate on the same bucket. I think ConcurrentHashMap is preferable to a synced HashMap.

@cowtowncoder cowtowncoder added this to the 2.4.2 milestone Jul 11, 2014
@cowtowncoder
Copy link
Member

Ok: I fixed by using ConcurrentHashMap, since that is simplest to use, and should perform fine (using conc level of 4 as an educated guess). Please let me know how this performs; it will be in 2.4.2 at least, or, if there is need, in separate micro-path (2.4.1.2).

@fjtc
Copy link
Author

fjtc commented Jul 12, 2014

Thanks for the prompt response @cowtowncoder. Given the nature of the issue, I think it would be better to release this fix as soon as possible in order to replace version 2.4.1.1 because the problem leads to a NullPointerException on J9 and also affects OpenJDK as well (I saw the code inside OpenJDK and I found out that this error may also lead to a broken linked list inside the LinkedHashMap but I think that there is a smaller chance of exceptions).

Since I'll not be able to wait for the next release of the library, I'm already using a private build of the library that I forked from version 2.4.1.1 as an emergency solution for my project. Given that, it would be better to the project to include this fix on the next planned version.

Anyway, I'll help you to evaluate the performance of your fix against the synchronization options I already tested and as soon I get the results, I'll send them to you. I may also contribute with the program I'm using to reproduce the issue and evaluating the fixes as well.

@fjtc
Copy link
Author

fjtc commented Jul 12, 2014

I just finished my first batch of tests. The preliminary results are:

Old implementation using synchronized: Total time:32069
Old implementation using concurrent lock: Total time:45528
New implementation using ConcurrentHashMap: Total time:32674

The most important thing about the new implementation is that it works as expected.

My test is not complete yet because it executes a lot of serializations of 3 distinct classes using 50 threads (it was enough to trigger the bug on J9). Given that, the removal of the classes from the cache were never tested. I'll improve the test program in order activate the cache pruning as well.

@cowtowncoder
Copy link
Member

@fjtc Thank you for evaluating these options. I agree in that correctness is the key requirement (obviously), and at least ConcurrentHashMap has no worse performance. Interesting to see that old with basic synchronized still performs about as well.

As to release, since pushing a micro-release 2.4.1.2 is very easy (being just one jar, not all 20+ jackson components), I will just go ahead and push it. Full 2.4.2 is still couple of weeks off (should be done by end of july, but there are a few other issues that should be resolved in other components), and anyone hitting this issue will want a fix ASAP.

@cowtowncoder cowtowncoder modified the milestones: 2.4.1.2, 2.4.2 Jul 12, 2014
@cowtowncoder
Copy link
Member

And now 2.4.1.2 is released to Sonatype OSS repo, from which it should sync up to Maven Central Repo shortly.

@fjtc
Copy link
Author

fjtc commented Jul 13, 2014

Thanks @cowtowncoder. I'll start to use the version 2.4.1.2 in the next build our project.

As I told you before, I performed a more aggressive test designed to force the cache to purge instances. I modified my test program to run a lot of serializations of 300 distinct classes using 50 threads on each cycle, the program peeks a random class to serialize (they all have the same structure but they are not related). The new results are:

New implementation using ConcurrentHashMap: Total time:124876
Old implementation using synchronized: Total time:132381

This results are quite interesting because this version do forces the cache pruning a few times and the old synchronized version uses a true LRU approach while the new version purges the cache.

It is curious to see that the full purge implemented by the new version is more effective on high load than the selective purge/LRU cache management used by the older version. The only problem with my test is that each of those 300 classes contains only 2 string fields so I don't know if this situation will hold for a set of more complex objects.

l also performed a few tests with distinct values for concurrencyLevel. Since my computer has 4 cores, the best performance was achieved with this parameter set as 4. Smaller values reduces the performance a little bit while increasing the value also produces small performance penalties but it appears that increasing the value is worst then reducing it. Given that, I think that for the current processor standards, 4 is indeed the best value.

Thanks again for the prompt response.

@cowtowncoder
Copy link
Member

@fjtc Very interesting! Thank you for running these tests and sharing results. It sounds like tests verify worst-case scenario, since handling of values of small classes have relatively higher overhead from type cache. It is indeed interesting that full purge is faster; my guess was that it should not be slower, but I wouldn't have predicted it to be faster.

It is also good to have some indicate that choice of 4 is reasonable, although obviously tests with more cores could have slightly different results. But I suspect that benefit from higher branching factor would be nominal, even with significantly more cores, at least for typical use cases.

@azell
Copy link

azell commented Jul 14, 2014

This seems to have broken something in Jersey 2.10.1 + jackson-jaxrs-json-provide 2.4.1:

javax.ws.rs.ProcessingException: com.fasterxml.jackson.databind.util.LRUMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
at com.fasterxml.jackson.jaxrs.base.ProviderBase.writeTo(ProviderBase.java:592)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:263)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:250)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1154)
at org.glassfish.jersey.client.ClientRequest.writeEntity(ClientRequest.java:500)
at org.glassfish.jersey.client.HttpUrlConnector._apply(HttpUrlConnector.java:315)
at org.glassfish.jersey.client.HttpUrlConnector.apply(HttpUrlConnector.java:227)
at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:224)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:672)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:424)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:668)

@cowtowncoder
Copy link
Member

I wish we had cascading build to catch these... thank you for reporting this, otherwise I'd have found it when trying to build 2.4.2.

@cowtowncoder
Copy link
Member

Interesting. So this is partly due to half-baked nature of java.util.Map interface; put() method uses raw types. This unfortunately happens to be source-compatible, but binary incompatible, nastiest kind of incompatibility.
It is compounded by the fact that JAX-RS module is accidentally including LRUMap from databind, instead of using implementation that base includes.

@cowtowncoder
Copy link
Member

Ok. While I could remove dependency from JAX-RS itself, there is another one from XML module, so in the end I ended up changing LRUMap use same signature as java.util.Map, and now tests pass.

I also pushed jackson-databind-2.4.1.3 with the fix; otherwise I'd have to push revisions of JAX-RS, XML and JAXB annotations modules. 2.4.2 will work properly as well.

Apologies for the problem here.

@azell
Copy link

azell commented Jul 15, 2014

No worries, the tests pass after updating to 2.4.1.3. Another possible solution is https://code.google.com/p/concurrentlinkedhashmap/ which is used by Cassandra.

@cowtowncoder
Copy link
Member

@azell Thanks for the link. Looks very useful, and will keep that in mind for future reference; the only concern being additional dependency (but could shade that in). I'll probably use that for some of my own caching outside of Jackson now that I am aware of it; esp. ability to bound by value size (not just entry count) is something I have been looking for.

@aarondav
Copy link
Contributor

aarondav commented Oct 2, 2014

By the way, the PR to fix this bug seems to have dropped the LRU semantics of LRUMap, now it clears the entire thing every time it hits max capacity. Is this the intended behavior? That seems to be very weird semantics for a cache.

@cowtowncoder
Copy link
Member

@aarondav Yes, this is intentional. JDK's LRU behavior is not thread-safe; so choice is to either use synchronization which appears to have major negative impact, or to just flush the whole cache (or, try incrementally clearing arbitrary entries -- I guess that'd be third way). But as you can see from earlier results, full flush turns out to be not only simple but nicely performant; sometimes simplest approachs works best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants