Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix es integration test 6094 #6157

Closed

Conversation

madmecodes
Copy link

This change fixes two issues in ES integration tests:

  1. Prevents index not found exceptions by proper index readiness checks
  2. Prevents duplicate span detection issues

The solution uses sync.Map for thread-safe:

  • Index existence caching
  • Span tracking

Fixes #6094

Which problem is this PR solving?

  • Resolves ES 8.x / v2 integration test often fails #6094
  • Fixes race conditions in ES integration tests where:
    • Index not found exceptions occur ~170ms after index creation
    • Duplicate spans are incorrectly detected (10 spans reduced to 5)

Description of the changes

  • Added thread-safe index readiness verification
    • Using sync.Map for index caching
    • Added proper index existence checks
    • Added wait/retry logic for index readiness
  • Fixed duplicate span detection
    • Added thread-safe span tracking using sync.Map
    • Improved logging around span operations

How was this change tested?

  • Ran ES integration tests multiple times with results:

This change fixes two issues in ES integration tests:
1. Prevents index not found exceptions by proper index readiness checks
2. Prevents duplicate span detection issues

The solution uses sync.Map for thread-safe:
- Index existence caching
- Span tracking

Fixes jaegertracing#6094

Signed-off-by: ayush-gupta-dev <[email protected]>
This change fixes two issues in ES integration tests:
1. Prevents index not found exceptions by proper index readiness checks
2. Prevents duplicate span detection issues

The solution uses sync.Map for thread-safe:
- Index existence caching
- Span tracking

Fixes jaegertracing#6094

Signed-off-by: ayush-gupta-dev <[email protected]>
@madmecodes
Copy link
Author

@yurishkuro, I hope you are well, will you review this when you have a moment?

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right solution or the right diagnosis of the problem. My hypothesis was that it is a race condition between creating templates and writing spans, not between creating indices and writing spans - our code never creates indices, it only creates index templates when factory is initialized.

@@ -42,6 +44,63 @@ type SpanWriter struct {
serviceWriter serviceWriter
spanConverter dbmodel.FromDomain
spanServiceIndex spanAndServiceIndexFn
indexCache sync.Map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete from 43

Comment on lines +51 to +54
if _, exists := s.indexCache.Load(indexName); exists {
return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following statement looks like a CAS operation, so why do you need this one above?

return fmt.Errorf("failed to check index existence: %w", err)
}

if !exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shortcircuit

if exists { return }

if !exists {
s.logger.Info("Creating index", zap.String("index", indexName))

// Set specific settings for the test environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin/storage/es/spanstore/writer.go is production code, there's no place for test settings here

}
}`

_, err = s.client().CreateIndex(indexName).Body(body).Do(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the writer should not be responsible for creating indices. The indices are created by ES automatically from templates on writes.

The templates are created by the factory:

if err := writer.CreateTemplates(spanMapping, serviceMapping, cfg.Indices.IndexPrefix); err != nil {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro

I see, now i understand the core issue better:

  1. it's about template creation timing vs span writing.
  2. Indices are automatically created by ES based on templates.
  3. The race condition occurs in factory.go where templates are created and then spans are written before templates are fully ready?

Would this be a better approach?

  1. Remove all index management code from SpanWriter
  2. Add template creation synchronization in factory.go's createSpanWriter,

something like, passing context in createSpanWriter and in Create Templates function we can handle the context.. I am taling about these 2 functions, if thinking in correct direction shall i proceed and test?

Copy link
Author

@madmecodes madmecodes Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visualizing like this (What do you think, shall i proceed with this?) I really think by passing context we can mititgate the issue.

Before:
Start -> Create Writer -> Start Template Creation -> Return Writer -> Write Spans (fails since template still creating)
[Templates Still Creating...]

After:
Start-> Create Writer -> Create Templates -> Retry if needed -> Templates Ready -> Return Writer -> Write Spans

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want the sequence in the After part, we just need to enforce that the templates finished creating before returning the writer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a new PR, #6229

@yurishkuro
Copy link
Member

different approach was attempted in another PR

@yurishkuro yurishkuro closed this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ES 8.x / v2 integration test often fails
2 participants