Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Avoid using _source for OTel compatibility #189947

Closed
7 tasks
gregkalapos opened this issue Aug 6, 2024 · 10 comments
Closed
7 tasks

[APM] Avoid using _source for OTel compatibility #189947

gregkalapos opened this issue Aug 6, 2024 · 10 comments
Assignees
Labels
apm:opentelemetry APM UI - OTEL Work apm OpenTelemetry Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team technical debt Improvement of the software architecture and operational architecture v8.16.0

Comments

@gregkalapos
Copy link
Contributor

gregkalapos commented Aug 6, 2024

As we work towards OTel native support, we expect data to be stored in more OTel native format in Elasticsearch. E.g. see: elastic/elasticsearch#111091

The result of this is that the shape of the data will be different compared to what we currently have in the APM data streams.

At the same time, we also add a compatibility layer to make sure the current UI works with the new data. This layer is mainly based on aliases and passthrough fields.

The problem where this currently breaks is that the UI in some cases uses _source to accesses data. That is currently a blocker for the compatibly layer as some of these fields are not directly available under _source.

Specific example:

On the service summary page in this part the UI accesses fields from _source to populate the icons:

_source: [KUBERNETES, CLOUD_PROVIDER, CONTAINER_ID, AGENT_NAME, CLOUD_SERVICE_NAME],

In this example we have a field that stores agent name.

Here is how an OTel native data in Elasticsearch will look like:

{
  "@timestamp": "2024-08-05T18:31:19.828218000Z",
  "attributes": {
    "metricset.interval": "1m",
    "metricset.name": "service_transaction",
    "processor.event": "metric",
    "transaction.root": true,
    "transaction.type": "unknown"
  },
  "data_stream": {
    "dataset": "generic.otel",
    "namespace": "default",
    "type": "metrics"
  },
  "metrics": {
    "transaction.duration.histogram": {
      "counts": [
        1
      ],
      "values": [
        12500
      ]
    }
  },
  "resource": {
    "attributes": {
      "metricset.interval": "1m",
      "service.name": "sendotlp",
      "some.resource.attribute": "resource.attr",
      "telemetry.sdk.language": "go",
      "telemetry.sdk.name": "opentelemetry",
      "telemetry.sdk.version": "1.28.0",
      "agent.name": "opentelemetry/go",
      "agent.name.text": "opentelemetry/go"
    },
    "dropped_attributes_count": 0,
    "schema_url": ""
  },
  "scope": {
    "name": "otelcol/spanmetricsconnectorv2"
  }
}

See field resource.attributes.agent.name - that is how we store attributes in OTel native data. Everything under resource.attributes can be queried as a top level field, but those fields under _source are of course still under resource.attributes.*. So in practice there is an alias from agent.name to resource.attributes.agent.name.

Currently the query above does something like this:

{
               "track_total_hits": 1,
                "size": 1,
                "_source": [
                    "kubernetes",
                    "cloud.provider",
                    "container.id",
                    "agent.name",
                    "cloud.service.name"
                ],
            
               "query": {
                    "bool": {
                        "filter": [
                            {
                                "terms": {
                                    "processor.event": [
                                        "metric",
                                        "error",
                                        "metric"
                                    ]
                                }
                            }
                        ],
                        "must": [
                            //... rest of the query
}

Where agent.name will not be returned, because it's used from _source.

Question is: is using _source needed? If e.g. this would be rewritten to use the fields API, then this will work:

                "size": 1,
                "fields": [ //<--- here use `fields` instead of `source`
                    "kubernetes",
                    "cloud.provider",
                    "container.id",
                    "agent.name",
                    "cloud.service.name"
                ],
            
               "query": {
               //... rest of the query

Of course there may be other ways to do it and there may be some downside of using fields - which I don't know of.

So the 1. proposal is to check if using the fields API is acceptable and if the answer is yes, then the APM UI should move to using that instead of _source. If that's not possible, we should discuss other options.

Non exhaustive list of _source usages

Sub tasks

@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 6, 2024
@AlexanderWert AlexanderWert added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Aug 6, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Aug 6, 2024
@cauemarcondes
Copy link
Contributor

From what I could see we can change to fields.

@gregkalapos There are other places on APM where we use the _source response. Do we need to change all those places too?

@gregkalapos
Copy link
Contributor Author

From what I could see we can change to fields.

Nice 🎉 Great to hear that.

@gregkalapos There are other places on APM where we use the _source response. Do we need to change all those places too?

Yes, this issue is about considering completely moving away from _source in general as it can break our OTel effort. The one above was just one specific example to help understanding, but it's a general issue.

@felixbarny
Copy link
Member

From an efficiency perspective particularly in combination with synthetic _source, using fields is preferable over using _source filtering. That's because the full _source first needs to be synthesized using all fields and then only a subset of _source is returned. With synthetic _source, there's an overhead proportional to the number of fields that are fetched. Re-constructing _source needs to fetch all fields.

@smith smith added technical debt Improvement of the software architecture and operational architecture apm OpenTelemetry apm:opentelemetry APM UI - OTEL Work needs-refinement A reason and acceptance criteria need to be defined for this issue v8.16.0 labels Aug 8, 2024
@bryce-b bryce-b self-assigned this Aug 13, 2024
@carsonip
Copy link
Member

@carsonip
Copy link
Member

Same for get_error_group_main_statistics:

{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "error"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "term": {
                  "service.name": "sendotlp"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1724164013052,
                    "lte": 1724164913052,
                    "format": "epoch_millis"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "error_groups": {
      "terms": {
        "field": "error.grouping_key",
        "size": 500,
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "sample": {
          "top_hits": {
            "size": 1,
            "_source": [
              "trace.id",
              "error.log.message",
              "error.exception.message",
              "error.exception.handled",
              "error.exception.type",
              "error.culprit",
              "error.grouping_key",
              "@timestamp"
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        }
      }
    }
  }
}

I'm going to start a tasklist in this issue to capture _source usages we've encountered

@felixbarny
Copy link
Member

One usage of _source that we probably can't remove but need to adjust for OTel are span links. They're stored differently but only the _source will have the right ordering of the object array.

There are also other aspects of span links (in particular incoming links from other traces) that need adjustment for OTel.

@bryce-b
Copy link
Contributor

bryce-b commented Aug 22, 2024

I've run into a couple of queries that are a bit tricky and I'm not sure the best way to go about resolving them.
For example :

This returns an entire transaction with a nested format e.g.:

{
   ....
   transaction : {
     id: "1234",
     type: "page-load",
     duration : { 
       us: 123456
     }
     ....
   }
   ....
}

where fields will return :

{ 
  "transaction.id" : ["1234"],
  "transaction.type" : ["page-load"],
  "transaction.duration.us" : [1234567],
}

should the new field responses be marshaled into the nested format, or should downstream dependencies be rebuilt to use the new format?

@bryce-b
Copy link
Contributor

bryce-b commented Aug 29, 2024

I've got an initial PR covering a few APIs so far: #191647
I went with updating the downstream dependencies to avoid data processing in the browser, with is the preference of the UI team.

@AlexanderWert
Copy link
Member

closing in favour of #192606

@AlexanderWert AlexanderWert closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:opentelemetry APM UI - OTEL Work apm OpenTelemetry Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team technical debt Improvement of the software architecture and operational architecture v8.16.0
Projects
None yet
Development

No branches or pull requests

8 participants