-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apm-server drops lot of metrics. Unclear, where this happens #13541
Comments
I tried to reproduce it with your setup of telemetrygen + collector + apm-server (on main branch) but failed. ES is returning all 500k documents.
{"apm-server":{"acm":{"request":{"count":0},"response":{"count":0,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":0,"notmodified":0,"ok":0}},"unset":0},"agentcfg":{"elasticsearch":{"cache.entries.count":0,"cache.refresh.failures":0,"cache.refresh.successes":5,"fetch.es":0,"fetch.fallback":0,"fetch.invalid":0,"fetch.unavailable":0}},"jaeger":{"grpc":{"collect":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"sampling":{"event":{"received":{"count":0}},"request":{"count":0},"response":{"count":0,"errors":{"count":0},"valid":{"count":0}}}}},"otlp":{"grpc":{"logs":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"metrics":{"consumer":{"unsupported_dropped":0},"request":{"count":639},"response":{"count":639,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":639}}},"traces":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}}},"http":{"logs":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"metrics":{"consumer":{"unsupported_dropped":0},"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"traces":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}}}},"processor":{"metric":{"transformations":500003},"stream":{"accepted":0,"errors":{"invalid":0,"toolarge":0}}},"root":{"request":{"count":1},"response":{"count":1,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":1,"notmodified":0,"ok":1}},"unset":0},"sampling":{"transactions_dropped":0},"server":{"request":{"count":0},"response":{"count":0,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":0,"notmodified":0,"ok":0}},"unset":0}},"beat":{"cgroup":{"cpu":{"id":"session-c2.scope","stats":{"periods":0,"throttled":{"ns":0,"periods":0}}},"memory":{"id":"session-c2.scope","mem":{"usage":{"bytes":41311178752}}}},"cpu":{"system":{"ticks":950,"time":{"ms":950}},"total":{"ticks":10970,"time":{"ms":10970},"value":10970},"user":{"ticks":10020,"time":{"ms":10020}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":15},"info":{"ephemeral_id":"79a06409-db09-47ae-8e6d-98c28f66a49b","name":"apm-server","uptime":{"ms":142860},"version":"8.15.0"},"memstats":{"gc_next":199843960,"memory_alloc":173394320,"memory_sys":259242296,"memory_total":2350429840,"rss":264138752},"runtime":{"goroutines":48}},"libbeat":{"output":{"events":{"acked":500003,"active":0,"batches":126,"failed":0,"toomany":0,"total":500003},"type":"elasticsearch","write":{"bytes":2553816}},"pipeline":{"events":{"total":500003}}},"output":{"elasticsearch":{"bulk_requests":{"available":60,"completed":126},"indexers":{"active":1,"created":0,"destroyed":0}}},"system":{"cpu":{"cores":16},"load":{"1":5.28,"15":4.58,"5":4.51,"norm":{"1":0.33,"15":0.2863,"5":0.2819}}}}
The 2 "missing" documents could be some other internal metrics, and I wouldn't worry too much about it. But I agree that only 66.1% of datapoints are ingested may reveal a problem somewhere.
|
Hey @carsonip , thanks for reporting back. I will try to reproduces this without the collector in between. May I ask, how it took |
Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches
I know I'm not the GO expert and I adjusted Maybe something got fixed in the |
@carsonip what is the meaning of these metrics in the es output?
I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.
@rpasche are you saying that in your latest test the number of metrics ingested by an OTel collector and APM is the same? And this is also valid for traces and logs? telemetry-gen -> OTel colelctor -> APM server If yes, is it possible that the OTel collector is aggregating metrics somehow, before sending them into APM? |
@inge4pres no. What I wanted to say, is, that I only have that issue when sending And I had again the issue when using telemetry-gen -> APM. |
If you're asking how long it took, for me it took 2m7s. |
AFAIK it is not at all what it means. "available" should be the number of bulk requests waiting for work. I believe it is merely a coincidence. See apm-server code and docappender code
@rpasche where is this "1000" config? |
@carsonip I mentioned in the beginning issue open-telemetry/opentelemetry-collector-contrib#33427 where I noticed, that for
This is also the reason I wondered, how you were able to send those I was not yet able to check the lastest |
Just an update. Was able to build a new docker image based on latest |
If you are only experiencing this issue with metrics, but not logs or traces, here's a possible explanation. In apm-server, in the same otlp batch, under the same resource and scope metrics, we group metric data points by timestamp, data point attributes, and store them as 1 Elasticsearch document. In my experiment to reproduce the issue, the timestamps are never the same, but it is possible that some data points share the same timestamp due to reasons like a system clock with lower precision, which triggered the grouping behavior, and resulting in fewer documents than number of data points. This is expected. You may find the code here. Feel free to craft an input otlp dataset that has a different timestamp for every data point, and see if documents are still "missing". Closing this issue for now, as apm-server is behaving as expected. |
HI @carsonip , ok. I know the basic idea to group multiple metrics or datapoints with the same attributes or labels from the And it is - of course - good to group them, if possible, into a single document. The problem that "we" (maybe others) have, is, that is now unclear, if truly all datapoints have been processed. What would be good to see (from the internal metrics), is that in total, those To my understanding, Will try to come up with another test. |
I understand your concern. While it appears like 309287 have reached APM, the truth is that only 309287 data points are processed meaningfully, because in a synthetic test like telemetrygen, there are data points from same metric sharing the same timestamp and attributes, and they are effectively silently dropped in the process of grouping. The later data point takes precedence in the current implementation. However, I don't think data points with the same metric name, same timestamp, and same attributes are common in real environments. That said, I'll check with the team for any area to improve, especially in making it easier to observe if that really happens. |
APM Server version (
apm-server version
):8.13.3
Description of the problem including expected versus actual behavior:
APM-server seems to drop some metric points, but we cannot determine, which and why. We noticed this while testing to have the
otel-collector-contrib
running and sending all 3 signals to an APM-server viaotlp/gRPC
. The number of documents created in elastic did not match the number ofmetrics
generated and passed through the collector.When sending
traces
orlogs
, everything seems to work fine.Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.
http
server enabledotel-collector
running forwarding to APM (currently, I do not have a minimal setup, so I try to wright an example config here)metrics
(ortraces
orlogs
for cross-checks).Please note my issue at open-telemetry/opentelemetry-collector-contrib#33427 where - at least for me - sending high volume of
logs
ormetrics
was not really possibleProvide logs (if relevant):
Make sure, that the collector and APM server have been restarted (to clear the counters) and that nothing else is sending there.
Generated
500000
metrics withNote the
metrics generated {"worker": 0, "metrics": 500000}
In Grafana / Prometheus, check the metrics
Connect to the APM server
monitoring
port (default: 5066) and get thestats
(GET apm:5066/stats
)Compare the numbers from prometheus with counters provided by APM server. Also check the number of documents ingested in to final index within
elasticsearch
See my logs and pictures from Grafana and APM stats
otelcol_exporter_sent_metric_points
andotelcol_processor_batch_batch_send_size_sum
shows that also in total500000
metrics have beensent
and all of them have been sent inbatches
. The number ofbatches
is:132
See the output of the
stats
:otlp.grpc.metrics.request.count
should match132
(the number ofbatches
) butlibbeat.pipeline.events.total
is way lower. It's only332092
(same asprocessor.metric.transformations: 332092
)See also screenshot from elasticsearch (also seem to be missing
2
documents).Please also note: The APM index created in elasticsearch is NOT a
TSDS
(index.mode: time_series
)If I redo those tests with
logs
ortraces
, the number ofbatches
and total number oftraces
orlogs
reported by the OTEL collector and APM do match.Just to summarize: In this test, only
66.1%
of total datapoints sent have been "seen" (processed?) by APM. We also do not have any kind offilter
configured within APM.The text was updated successfully, but these errors were encountered: