Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayIndexOutOfBoundsException in ArrayDeque.add in GrpcExporter #7019

Open
Mahoney opened this issue Jan 15, 2025 · 7 comments · May be fixed by #7027
Open

ArrayIndexOutOfBoundsException in ArrayDeque.add in GrpcExporter #7019

Mahoney opened this issue Jan 15, 2025 · 7 comments · May be fixed by #7027
Labels
Bug Something isn't working

Comments

@Mahoney
Copy link

Mahoney commented Jan 15, 2025

Describe the bug
We are seeing this stack trace in our live system:

java.lang.ArrayIndexOutOfBoundsException: Index 4240 out of bounds for length 4240
	at java.base/java.util.ArrayDeque.grow(Unknown Source)
	at java.base/java.util.ArrayDeque.addLast(Unknown Source)
	at java.base/java.util.ArrayDeque.add(Unknown Source)
	at io.opentelemetry.exporter.internal.otlp.traces.SpanReusableDataMarshaler.lambda$export$0(SpanReusableDataMarshaler.java:51)
	at io.opentelemetry.sdk.common.CompletableResultCode.succeed(CompletableResultCode.java:107)
	at io.opentelemetry.exporter.internal.grpc.GrpcExporter.onResponse(GrpcExporter.java:75)
	at io.opentelemetry.exporter.internal.grpc.GrpcExporter.lambda$export$0(GrpcExporter.java:64)
	at io.opentelemetry.exporter.sender.okhttp.internal.OkHttpGrpcSender$1.onResponse(OkHttpGrpcSender.java:158)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

It seems to leave the entire exporter in a state from which it cannot recover; we then see this stack repeatedly:

java.lang.ArrayIndexOutOfBoundsException: Index 5156 out of bounds for length 4240
	at java.base/java.util.ArrayDeque.elementAt(Unknown Source)
	at java.base/java.util.ArrayDeque.pollFirst(Unknown Source)
	at java.base/java.util.ArrayDeque.poll(Unknown Source)
	at io.opentelemetry.exporter.internal.otlp.traces.SpanReusableDataMarshaler.export(SpanReusableDataMarshaler.java:40)
	at io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter.export(OtlpGrpcSpanExporter.java:77)
	at io.opentelemetry.sdk.trace.export.SimpleSpanProcessor.onEnd(SimpleSpanProcessor.java:94)
	at io.opentelemetry.sdk.trace.SdkSpan.endInternal(SdkSpan.java:569)
	at io.opentelemetry.sdk.trace.SdkSpan.end(SdkSpan.java:538)
	at io.opentelemetry.instrumentation.api.instrumenter.Instrumenter.doEnd(Instrumenter.java:263)
	at io.opentelemetry.instrumentation.api.instrumenter.Instrumenter.end(Instrumenter.java:150)
	at io.opentelemetry.instrumentation.jdbc.datasource.OpenTelemetryDataSource.wrapCall(OpenTelemetryDataSource.java:169)
	at io.opentelemetry.instrumentation.jdbc.datasource.OpenTelemetryDataSource.getConnection(OpenTelemetryDataSource.java:94)
	at org.jooq.impl.DataSourceConnectionProvider.acquire(DataSourceConnectionProvider.java:87)
	at org.jooq.impl.DefaultExecuteContext.connection(DefaultExecuteContext.java:651)
	at org.jooq.impl.AbstractQuery.connection(AbstractQuery.java:388)
	at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:308)
	at org.jooq.impl.AbstractResultQuery.fetchLazy(AbstractResultQuery.java:301)
	at org.jooq.impl.AbstractResultQuery.fetchLazyNonAutoClosing(AbstractResultQuery.java:322)
	at org.jooq.impl.SelectImpl.fetchLazyNonAutoClosing(SelectImpl.java:3256)
	at org.jooq.impl.ResultQueryTrait.fetchOne(ResultQueryTrait.java:509)
	at mocklab.users.UserRepository.findByIdpUserId(UserRepository.kt:108)
	at mocklab.users.Users.findByIdpUserId(Users.kt:248)
	at mocklab.security.JwtOauthAuthenticator.authenticate(JwtOauthAuthenticator.kt:45)
	at mocklab.security.JwtDiscriminatingAuthenticator.authenticate(JwtDiscriminatingAuthenticator.kt:14)
	at mocklab.security.JwtDiscriminatingAuthenticator.authenticate(JwtDiscriminatingAuthenticator.kt:8)
	at io.dropwizard.auth.AuthFilter.authenticate(AuthFilter.java:144)
	at io.dropwizard.auth.oauth.OAuthCredentialAuthFilter.filter(OAuthCredentialAuthFilter.java:37)
	at io.dropwizard.auth.chained.ChainedAuthFilter.filter(ChainedAuthFilter.java:45)
	at org.glassfish.jersey.server.ContainerFilteringStage.apply(ContainerFilteringStage.java:108)
	at org.glassfish.jersey.server.ContainerFilteringStage.apply(ContainerFilteringStage.java:44)
	at org.glassfish.jersey.process.internal.Stages.process(Stages.java:173)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:266)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:397)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:349)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:312)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHolder.handle(ServletHolder.java:736)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1619)
	at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at io.dropwizard.jetty.ZipExceptionHandlingServletFilter.doFilter(ZipExceptionHandlingServletFilter.java:30)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
	at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at mocklab.analytics.ClientInfoCapturingFilter.doFilter$lambda$0(ClientInfo.kt:46)
	at mocklab.analytics.ClientInfoThreadLocal.with(ClientInfo.kt:108)
	at mocklab.analytics.ClientInfoCapturingFilter.doFilter(ClientInfo.kt:45)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at mocklab.system.SecurityHeadersFilter.doFilter(SecurityHeadersFilter.kt:43)
	at jakarta.servlet.http.HttpFilter.doFilter(HttpFilter.java:97)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at org.eclipse.jetty.ee10.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:317)
	at org.eclipse.jetty.ee10.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:270)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at mocklab.analytics.OpenTelemetryFilter.doFilter(OpenTelemetryFilter.kt:95)
	at jakarta.servlet.http.HttpFilter.doFilter(HttpFilter.java:97)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at mocklab.system.LogExceptionsFilter.doFilter(LogExceptionsFilter.kt:18)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1552)
	at org.eclipse.jetty.ee10.servlet.ServletChannel.dispatch(ServletChannel.java:819)
	at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:436)
	at org.eclipse.jetty.ee10.servlet.ServletHandler.handle(ServletHandler.java:469)
	at org.eclipse.jetty.ee10.servlet.SessionHandler.handle(SessionHandler.java:717)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:740)
	at io.dropwizard.request.logging.LogbackAccessRequestLogAwareHandler.handle(LogbackAccessRequestLogAwareHandler.java:14)
	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1060)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:740)
	at io.dropwizard.metrics.jetty12.AbstractInstrumentedHandler.handle(AbstractInstrumentedHandler.java:299)
	at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:41)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:611)
	at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:21)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:740)
	at org.eclipse.jetty.server.handler.GracefulHandler.handle(GracefulHandler.java:101)
	at org.eclipse.jetty.server.Server.handle(Server.java:182)
	at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:662)
	at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:418)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:322)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:480)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:443)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:311)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:979)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1209)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1164)
	at java.base/java.lang.Thread.run(Unknown Source)

and after c. 12 of those, just:

java.lang.ArrayIndexOutOfBoundsException: null

over and over again.

We are using the OtlpGrpcSpanExporter with the default memory mode of REUSABLE_DATA. It looks like somehow concurrent access is happening to SpanReusableDataMarshaler's marshalerPool.

Steps to reproduce
Unknown.

What version and what artifacts are you using?
Artifacts:

io.opentelemetry:opentelemetry-api:1.45.0
io.opentelemetry:opentelemetry-api-incubator:1.45.0-alpha
io.opentelemetry:opentelemetry-context:1.45.0
io.opentelemetry:opentelemetry-exporter-common:1.45.0
io.opentelemetry:opentelemetry-exporter-logging:1.45.0
io.opentelemetry:opentelemetry-exporter-otlp:1.45.0
io.opentelemetry:opentelemetry-exporter-otlp-common:1.45.0
io.opentelemetry:opentelemetry-exporter-sender-okhttp:1.45.0
io.opentelemetry:opentelemetry-sdk:1.45.0
io.opentelemetry:opentelemetry-sdk-common:1.45.0
io.opentelemetry:opentelemetry-sdk-extension-autoconfigure-spi:1.45.0
io.opentelemetry:opentelemetry-sdk-logs:1.45.0
io.opentelemetry:opentelemetry-sdk-metrics:1.45.0
io.opentelemetry:opentelemetry-sdk-trace:1.45.0

io.opentelemetry:opentelemetry-semconv:1.30.1-alpha - brought in transitively, no class name clashes

io.opentelemetry.semconv:opentelemetry-semconv:1.29.0-alpha
io.opentelemetry.semconv:opentelemetry-semconv-incubating:1.29.0-alpha

io.opentelemetry.instrumentation:opentelemetry-instrumentation-api:2.11.0
io.opentelemetry.instrumentation:opentelemetry-instrumentation-api-incubator:2.11.0-alpha
io.opentelemetry.instrumentation:opentelemetry-resources:2.11.0-alpha
io.opentelemetry.instrumentation:opentelemetry-apache-httpclient:5.2-2.11.0-alpha
io.opentelemetry.instrumentation:opentelemetry-jdbc:2.11.0-alpha

How did you reference these artifacts? (excerpt from your build.gradle, pom.xml, etc)
build.gradle:

    implementation(
        platform('io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom-alpha:2.11.0-alpha'),
        platform('io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom'),
        platform('io.opentelemetry:opentelemetry-bom'),
        'io.opentelemetry:opentelemetry-api',
        'io.opentelemetry:opentelemetry-context',
        'io.opentelemetry:opentelemetry-sdk',
        'io.opentelemetry:opentelemetry-sdk-common',
        'io.opentelemetry:opentelemetry-sdk-trace',
        'io.opentelemetry:opentelemetry-exporter-otlp',
        'io.opentelemetry:opentelemetry-exporter-logging',
        'io.opentelemetry.instrumentation:opentelemetry-jdbc',
        'io.opentelemetry.instrumentation:opentelemetry-apache-httpclient-5.2',
        'io.opentelemetry.instrumentation:opentelemetry-resources',
        'io.opentelemetry.semconv:opentelemetry-semconv:1.29.0-alpha',
        'io.opentelemetry.semconv:opentelemetry-semconv-incubating:1.29.0-alpha',
    )

Environment
Compiler & runtime: Temurin 23
OS: Alpine 3.20.5

Additional context
We're on 1.45.0 not 1.46.0 because the latest version (2.11.0) of opentelemetry-instrumentation-bom and opentelemetry-instrumentation-bom-alpha bring in 1.45.0.

@Mahoney Mahoney added the Bug Something isn't working label Jan 15, 2025
@jack-berg
Copy link
Member

Thanks for the report. I will investigate.

@Mahoney
Copy link
Author

Mahoney commented Jan 16, 2025

I've been investigating too, just dropping breakpoints that evaluate at the marshalerPool.poll and marshalerPool.add calls.

I see that marshalerPool.poll is called concurrently by the threads serving the request, because SdkSpan.end ultimately calls it, synchronized on SimpleSpanProcessor.exportLock.

However, the only synchronization I can see upstream of marshalerPool.add is in CompletableResultCode.succeed. Quite apart from being a different lock to SimpleSpanProcessor.exportLock this is a different lock every time, as we get a new CompletableResultCode instance on every export response, so no use for protecting marshalerPool. And while OkHttp unhelpfully gives all the threads the same name, examining the thread ids shows that OkHttp is using a pool of threads to make the export requests.

It would be much simpler to reason about if SpanReusableDataMarshaler managed concurrent access to marshalerPool itself with its own lock. ArrayDeque is quite explicit that:

Array deques ... are not thread-safe; in the absence of external synchronization, they do not support concurrent access by multiple threads.

Mahoney added a commit to Mahoney-forks/opentelemetry-java that referenced this issue Jan 16, 2025
ArrayDeque specifies that:

> Array deques ... are not thread-safe; in the absence of external
> synchronization, they do not support concurrent access by multiple
> threads.

`marshalerPool` is concurrently added to by the OkHttp threadpool
without synchronization, along with all threads that end spans (with
synchronisation in `SimpleSpanProcessor.exportLock`, which is not used
to synchronize with the OkHttp threadpool).

Just making the ArrayQueue synchronous internally removes all need to
worry about upstream locks.

Fixes open-telemetry#7019
@Mahoney Mahoney linked a pull request Jan 16, 2025 that will close this issue
Mahoney added a commit to Mahoney-forks/opentelemetry-java that referenced this issue Jan 16, 2025
ArrayDeque specifies that:

> Array deques ... are not thread-safe; in the absence of external
> synchronization, they do not support concurrent access by multiple
> threads.

`marshalerPool` is concurrently added to by the OkHttp threadpool
without synchronization, along with all threads that end spans (with
synchronisation in `SimpleSpanProcessor.exportLock`, which is not used
to synchronize with the OkHttp threadpool).

Just making the ArrayQueue synchronous internally removes all need to
worry about upstream locks.

Fixes open-telemetry#7019
@jack-berg
Copy link
Member

I think a fix is necessary, but I wasn't sure why since BatchSpanProcessor won't call SpanExporter#export concurrently. Then I noticed SimpleSpanProcessor#onEnd in the stack trace:

at io.opentelemetry.exporter.internal.otlp.traces.SpanReusableDataMarshaler.export(SpanReusableDataMarshaler.java:40)
at io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter.export(OtlpGrpcSpanExporter.java:77)
at io.opentelemetry.sdk.trace.export.SimpleSpanProcessor.onEnd(SimpleSpanProcessor.java:94)

Looking at the source code for SimpleSpanProcessor#onEnd, there's an attempt to avoid concurrent calls export, but it arguably doesn't meet spirit of the spec. Its not actually clear to me whether SimpleSpanProcessor is supposed to not only call export sequentially, but also wait for the returned CompletableResultCode to complete before calling again.

But I think you're fix is necessary because even if SimpleSpanProcessor is supposed to wait for the CompletableResultCode to complete, our OTLP exporters should be able to export multiple requests concurrently, so long as callers synchronize calls to OtlpHttpSpanExporter#export. I.e. its not necessary for callers to wait for the returned CompletableResultCode to complete before calling export again.

@Mahoney out of curiosity, why are you using SImpleSpanProcessor instead of BatchSpanProcessor? SimpleSpanProcessor is really only for trivial situations. Exporting one record per request is crazy inefficient for real world situations.

cc @jkwatson WDYT - should SimpleSpanProcessor wait for CompletableResultCode to complete before calling the next export?

@trask
Copy link
Member

trask commented Jan 16, 2025

Based on the spec language:

The processor MUST synchronize calls to Span Exporter's Export to make sure that they are not invoked concurrently.

I personally think SimpleSpanProcessor.java is following the spirit, based on the intent here:

This means that while an instance of an exporter should never have its Export() called concurrently it does not mean that the task of exporting can not be done concurrently.

@jack-berg
Copy link
Member

Yeah I saw that language too. The counter is that SimpleSpanProcessor is inconsistent with BatchSpanProcessor, which waits for the CompletableResultCode to complete. I can't find any language suggesting that simple and batch processors are intended to be different in this respect.

@trask
Copy link
Member

trask commented Jan 16, 2025

I think BatchSpanProcessor is the one that's wrong 😁: #4264

@Mahoney
Copy link
Author

Mahoney commented Jan 18, 2025

@Mahoney out of curiosity, why are you using SImpleSpanProcessor instead of BatchSpanProcessor? SimpleSpanProcessor is really only for trivial situations. Exporting one record per request is crazy inefficient for real world situations.

I don't know, lack of knowledge of the library I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants