Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add geo fields into attribute registry #1116

Merged
merged 15 commits into from
Nov 19, 2024

Conversation

gregkalapos
Copy link
Member

Fixes #1033

Changes

Adds geo fields into the attribute registry into its own file.

This is another take on #351, which basically copied the fields into both client and server. Here I aim for adding the fields only into the registry and not using it yet. At the same time, we have open-telemetry/build-tools#240 with a few suggestions from @trisch-me.

So this PR will add the fields into the registry - then once open-telemetry/build-tools#240 is implemented, we could use these geo fields by updating client.yaml and server.yaml and embed these fields into those.

Merge requirement checklist

@gregkalapos gregkalapos marked this pull request as ready for review June 3, 2024 16:37
@gregkalapos gregkalapos requested review from a team June 3, 2024 16:37
.chloggen/geo_fields.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
@trisch-me
Copy link
Contributor

Are there any comments from other maintainers regarding this new namespace?

Copy link
Member

@joaopgrassi joaopgrassi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to look into these further and think about changing the attributes to namespaces.

model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
andrzej-stencel pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Jun 27, 2024
**Description:** <Describe what has changed.> This PR adds an initial
implementation for the MaxMind GeoIP provider for later usage in the
`geoipprocessor`. The processor is still a nop, as no provider can be
configured yet. Providers configuration will be added in
#33268.

- Internal package for temporal attributes conventions reference (should
be removed in favor of
open-telemetry/semantic-conventions#1116):
`processor/geoipprocessor/internal/convention/attributes.go`
- [geoip2-golang](https://github.com/oschwald/geoip2-golang) package
used as a client for data retrieval. See recommended MaxMind clients:
https://dev.maxmind.com/geoip/docs/databases#api-clients

**Link to tracking Issue:** <Issue number if applicable>
#32663

**Testing:** <Describe what testing was performed and which tests were
added.> Database files are generated before running the unit tests
[using MaxMind
writer.](https://github.com/maxmind/MaxMind-DB/tree/0a9c1aa26cd7b91bee2efe27e7d174797d8211a6/test-data#how-to-generate-test-data).
The generation time of those files seems not to be an issue (0.189s):

```
$ go test -v
=== RUN   TestInvalidNewProvider
--- PASS: TestInvalidNewProvider (0.00s)
=== RUN   TestProviderLocation
=== PAUSE TestProviderLocation
=== CONT  TestProviderLocation
=== RUN   TestProviderLocation/nil_IP_address
=== RUN   TestProviderLocation/unsupported_database_type
=== RUN   TestProviderLocation/no_IP_metadata_in_database
=== RUN   TestProviderLocation/all_attributes_should_be_present_for_IPv4_using_GeoLite2-City_database
=== RUN   TestProviderLocation/subset_attributes_for_IPv6_IP_using_GeoIP2-City_database
--- PASS: TestProviderLocation (0.00s)
    --- PASS: TestProviderLocation/nil_IP_address (0.00s)
    --- PASS: TestProviderLocation/unsupported_database_type (0.00s)
    --- PASS: TestProviderLocation/no_IP_metadata_in_database (0.00s)
    --- PASS: TestProviderLocation/all_attributes_should_be_present_for_IPv4_using_GeoLite2-City_database (0.00s)
    --- PASS: TestProviderLocation/subset_attributes_for_IPv6_IP_using_GeoIP2-City_database (0.00s)
PASS
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/processor/geoipprocessor/internal/provider/maxmindprovider	0.189s
```

Testing database files are removed on sucessful test execution.

**Documentation:** <Describe the documentation added.> README.md file
@lmolkova
Copy link
Contributor

Would it be possible to add a use case for those?

Do we expect them to be on span/metrics attributes? If the goal is to record events, they may go into the payload and should not be defined in the registry then.

@rogercoll
Copy link
Contributor

Would it be possible to add a use case for those?

In the OpenTelemetry collector we are adding a processor that would add those attributes to any trace, metric or log. The processor looks for an IP within the resource attributes and appends the related geolocation attributes. The component is still a nop (we are just adding support for multiple geolocation providers): https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/geoipprocessor

Additional context: open-telemetry/opentelemetry-collector-contrib#32663

@lmolkova
Copy link
Contributor

thanks for the clarification @rogercoll

Looking into the collector:
source.address is not a resource attribute, it's dynamic and describes the network packet source.

### Source and destination attributes
These attributes may be used to describe the sender and receiver of a network exchange/packet. These should be used
when there is no client/server relationship between the two sides, or when that relationship is unknown.
This covers low-level network interactions (e.g. packet tracing) where you don't know if
there was a connection or which side initiated it.
This also covers unidirectional UDP flows and peer-to-peer communication where the
"user-facing" surface of the protocol / API does not expose a clear notion of client and server.

Is you intention to enrich individual telemetry signals or resources?

@AlexanderWert
Copy link
Member

AlexanderWert commented Jun 30, 2024

Is you intention to enrich individual telemetry signals or resources?

@lmolkova The intention of the GeoIP processor is to have a flexible processor that can derive geo information from an IP.

What the source of the IP is (i.e. which attribue it's coming from) and what the destination (i.e. namespace) depends on the use case.

For example, on web server access logs the geoIP processor can be used to derive the geo information for each individual request's source.ip, allowing endusers to visualize where (geographically) requests come from.

But there could also be other use cases, that would require to read the IP from other attributes (either signal level or resource), etc.

@lmolkova
Copy link
Contributor

lmolkova commented Jul 1, 2024

@AlexanderWert thanks for the context!

So if I understand correctly the intention is to

  • have ability to add geo attributes to any signal (traces, metrics, logs, resources)
  • based on the pre-existing arbitrary IP attribute on that signal

I.e. we do need to define them as attributes and not event payload fields.

Correct ?

Looking only from the semconv perspective, it could be somewhat misleading. I can have more than one IP on the telemetry item (e.g. host|device.ip, client.address, network.peer.address, ...) - if I add geo.*, it's not clear which one they apply to.
What if some instrumentation (or PII redactor) reuses geo attributes to report end-user geography?

I assume embedding could be useful here and we could embed the geo namespace under related IPs?

E.g. if I configure geo-processor for client.address, it adds client.geo.*, or if I do it for source.address, processor adds source.geo.*

Effectively, geo namespace becomes something that's usually embedded and not used as stand-alone.
This way we can enrich signals with geo information unambiguously and on any level (application, instr library, processor, backend).

Thoughts?

@trisch-me
Copy link
Contributor

trisch-me commented Jul 1, 2024

@lmolkova you are correct, geo namespace is usually embedded into parent namespace and enriches it with geo information

In ECS we don't use geo namespace as standalone one in the root of events

@AlexanderWert
Copy link
Member

Effectively, geo namespace becomes something that's usually embedded and not used as stand-alone.
This way we can enrich signals with geo information unambiguously and on any level (application, instr library, processor, backend).

Yes, that's absolutely right!

I think we should consider it as a two steps process to add geo attributes to semconv:

  1. This PR defines the core / plain geo attributes in the registry.
  2. Once we have the embedding functionality, we will use that to extend those to be used under source.*, client.*, etc.
groups:
  - id: client
    prefix: client
    type: attribute_group
    brief: ...
    embed_namespaces:
      - id: geo

BTW, @trisch-me what's the status with embed? Do we have any blockers with that?

@trisch-me
Copy link
Contributor

what's the status with embed?

An option to embed not the full namespace but just particular fields (for example create client.geo.lat) is merged and we are preparing weaver release for it.
Embedding the whole namespace at once is the next step

@lmolkova
Copy link
Contributor

lmolkova commented Jul 2, 2024

Thanks again for the context @AlexanderWert and @trisch-me !

So maybe we can move forward by adding some disclaimer into the group brief like

Note:
Geo attributes are typically used under another namespace, such as client.* and describe the location of the corresponding entity (device, end-user, etc). Semantic conventions that reference geo attributes (as a root namespace) or embed them (under their own namespace) SHOULD document what geo attributes describe in the scope of that convention.

?

Also, it would be great if we could start embedding some common attributes under client, source or other namespaces.

docs/attributes-registry/geo.md Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
@lmolkova
Copy link
Contributor

lmolkova commented Jul 9, 2024

related: #1228

@github-actions github-actions bot added the Stale label Aug 15, 2024
@rogercoll
Copy link
Contributor

I reckon this PR is still on review (unstale)

@trisch-me trisch-me added never stale PRs marked with this label will be never staled and automatically closed and removed Stale labels Aug 19, 2024
@lmolkova lmolkova mentioned this pull request Oct 18, 2024
3 tasks
Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some suggestions. The key part is geo.city.name (which I'm suggesting to remove or rename to geo.locality.name

model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
model/registry/geo.yaml Outdated Show resolved Hide resolved
@gregkalapos
Copy link
Member Author

LGTM with some suggestions. The key part is geo.city.name (which I'm suggesting to remove or rename to geo.locality.name

Agreed. ECS calls it city_name, so it's a bit unfortunate to see that this will diverge from ECS; on the other hand, I also wasn't satisfied with city - calling this locality is clearly an improvement. I'd be ok to proceed with that, so I kept the field and changed it to locality.

@gregkalapos
Copy link
Member Author

:(

  ERROR: 1 dead links found in ./docs/system/system-metrics.md !
  [✖] https://blogs.oracle.com/linux/post/understanding-linux-kernel-memory-statistics → Status: 0

@gregkalapos
Copy link
Member Author

I think this is ready to be merged - I'm not authorized, so I can't do so.

Could someone please hit merge?

@AlexanderWert AlexanderWert merged commit c5093dd into open-telemetry:main Nov 19, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
never stale PRs marked with this label will be never staled and automatically closed
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add Geo fields from Elastic Common Schema
7 participants