Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preallocate storage for log attributes on stack #1965

Merged
merged 75 commits into from
Aug 23, 2024

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Jul 25, 2024

towards #1920

Changes

  • The GrowableArray is introduced to replace the preallocated vector for storing log attributes, aiming to use stack allocation first and only allocate heap memory when necessary. TBD - Make it optional under feature flag.

The benchmark and stress results as below. A negligible improvement in perf for stress, while benchmark has further improved from 5%-10%.

Metric Main Branch PR Branch Improvement (%)
Throughput (iterations/sec) 42,239,400 to 44,855,600 43,380,200 to 44,410,600 -0.13%
ot_layer_enabled (ns) 218.92 to 219.50 206.11 to 207.56 5.57%
Logging_Comparable_To_Appender (ns) 96.679 to 96.822 85.466 to 85.646 11.51%
log_noop_processor (ns) 101.78 to 102.16 86.313 to 86.503 15.23%
log_cloning_processor (ns) 184.26 to 184.57 177.25 to 177.56 3.80%
log_clone_and_send_to_channel_processor (ns) 648.33 to 657.47 668.22 to 670.25 -2.36%
LogExporterWithFuture (ns) 130.25 to 130.58 133.11 to 133.64 -2.24%
LogExporterWithoutFuture (ns) 98.344 to 99.553 90.973 to 91.901 7.45%
  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@lalitb lalitb requested a review from a team July 25, 2024 05:20
@lalitb lalitb marked this pull request as draft July 25, 2024 05:21
@lalitb lalitb changed the title Preallocate storage for log attributes on stack [WIP] Preallocate storage for log attributes on stack Jul 25, 2024
@lalitb
Copy link
Member Author

lalitb commented Jul 27, 2024

Have also included the Boxing of AnyValue enum as part of this PR, as that overall affects the size of the preallocated buffer, and is a small change. There is an improvement with the bench, but stress throughput has dropped drastically. Need to look into that, along with memory stats.

full-log-with-4-attributes/no-context:
main: ~170 ns
PR: 132 ns

stress test:
main: 35,526,000 iterations/sec
PR: 21,113,200 iterations/sec

@lalitb
Copy link
Member Author

lalitb commented Aug 8, 2024

The change itself is very good. Awesome work and thanks for taking us one more step closer to achieving zero heap logging!

I left suggestion about not modifying test code to rely on the new internal implementation details. I want to fix #1995, and re-do all benchmarks, to gauge the actual gains from this change (and less variations, hopefully), before merging.

Thanks @cijothomas for review. Have resolved the tests to not rely on the internal implementation details of the attribute storage. Let me know if you still see any issues.

@cijothomas
Copy link
Member

@lalitb and I observed that in some machines the perf is getting de-graded (benchmarks and stress tests) with the change. We are trying to understand what might be causing that, even though we greatly reduce heap allocation. Holding off merge until we understand the source.

@lalitb
Copy link
Member Author

lalitb commented Aug 12, 2024

Adding the relevant part of flamegraph before (main branch) and after (PR branch):
Main branch:
main

PR branch:
staticallocationPR

The cost of execution of __memmove_avx_unaligned_erms (shown with arrow above) function has increased from 1.5% to 4.0%. This is the low-level C function, which typically represents that some data is being physically copied or moved in memory. In our case, the data movement is happening here:

let mut data = LogData {
record: log_record,
instrumentation: self.instrumentation_library().clone(),
};

i.e, LogRecord object is moved inside the LogData instance. It seems that switching the LogRecord::attributes type from a pre-allocated vector to a fixed-size array is changing the behavior of our code in terms of memory operations.

  • In the case of an array, the entire contiguous array is copied from one location to another.
  • In the case of vector, only its metadata fields (pointer, capacity, and length) are moved.

The cost of moving the data (i.e., the array) outweighs the benefit of initial creation and population. This leads to increased latency. The option could be using references or smart pointers (Rc/Arc) over attribute array to mitigate this issue, but they come with their own trade-offs - increased complexity, lifetime management challenge, potential overhead of indirections.

Also attaching the full flamegraph svg files if someone is interested in doing more analysis :)

Main branch:

flamegraph_ex_main

PR branch:

flamegraph_ex_pr

@cijothomas
Copy link
Member

Thanks @lalitb for getting to the root cause of the observed behavior! This is similar to what we faced in OTel .NET where using stack allocation to avoid heap (and gc pressure) actually increased hot path cost, due to increase in amount of stack to stack copying! https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Metrics/MetricsBenchmarks.cs#L42-L52

@utpilla
Copy link
Contributor

utpilla commented Aug 13, 2024

The cost of execution of __memmove_avx_unaligned_erms (shown with arrow above) function has increased from 1.5% to 4.0%. This is the low-level C function, which typically represents that some data is being physically copied or moved in memory. In our case, the data movement is happening here:

Great analysis @lalitb!

@lalitb
Copy link
Member Author

lalitb commented Aug 15, 2024

I have now modified the LogData export structure to contain the CoW of the LogRecord and InstrumentationLibrary:

Before:

#[derive(Clone, Debug)]
pub struct LogData {
    /// Log record
    pub record: LogRecord,
    /// Instrumentation details for the emitter who produced this `LogEvent`.
    pub instrumentation: InstrumentationLibrary,
}

After:

#[derive(Clone, Debug)]
pub struct LogData<'a> {
    /// Log record, which can be borrowed or owned.
    pub record: Cow<'a, LogRecord>,
    /// Instrumentation details for the emitter who produced this `LogEvent`.
    pub instrumentation: Cow<'a, InstrumentationLibrary>,
}

This allows us to avoid copying of LogRecord structure when used with a synchronous processor.

  • Simple processor (and also other concurrent processors) gets the reference to the LogData structure, borrows it, and sends it to the exporter. This involves NO copy/move. The exporter can decide whether to own it based on its logic.
  • The batch processor gets the reference to the LogData structure, converts it into the owned version, and pushes it to the queue. This involves clone, which effectively copies the attributes too. This can't be avoided.

I can see ~20% improvement with this:

Appender:
PR: ot_layer_enabled        time:   [210.69 ns 211.52 ns 212.46 ns]
Main: ot_layer_enabled        time:   [253.75 ns 254.01 ns 254.33 ns]

SDK:
PR:  [104.99 ns 105.09 ns 105.21 ns]
Main: [136.19 ns 136.30 ns 136.41 ns]

Stress:
PR: 
Throughput: 43,832,200 iterations/sec
Throughput: 44,060,200 iterations/sec
Throughput: 44,089,600 iterations/sec
Throughput: 43,246,400 iterations/sec
Throughput: 42,510,000 iterations/sec

Main:
Throughput: 39,540,600 iterations/sec
Throughput: 39,462,800 iterations/sec
Throughput: 39,574,800 iterations/sec
Throughput: 39,052,600 iterations/sec

@cijothomas
Copy link
Member

https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-sdk/benches/log_processor.rs#L9-L10 Can you check the effect on this benchmark with/without latest commit? (Trying to see if our benchmarks are actually helping us make more informed decisions!)

@lalitb
Copy link
Member Author

lalitb commented Aug 15, 2024

Before:
log_noop_processor time: [133.99 ns 134.10 ns 134.23 ns]
log_cloning_processor time: [229.78 ns 229.93 ns 230.09 ns]
log_clone_and_send_to_channel_processor time: [617.49 ns 618.84 ns 620.54 ns]

After:
log_noop_processor time: [100.27 ns 100.62 ns 101.26 ns]
log_cloning_processor time: [301.43 ns 301.76 ns 302.09 ns]
log_clone_and_send_to_channel_processor time: [697.19 ns 697.86 ns 698.57 ns]

We don't do cloning anymore in Simple/Concurrent processor, so benchmark 2 won't be used. bench 3 is the batch scenario, and this will now do the stack memory copy, and so more cost.

@cijothomas
Copy link
Member

I am also seeing same perf results!(42 M/sec in stress test) Thanks for continuously striving to get us better and better!

I just compared with OTel .NET's similar setup in same machine, and OTel .NET gets around 46 M/sec, so we are getting pretty close to match .NET (which is heavily optimized already and GA for years). With the next set of planned improvements, we should easily exceed that as well.

I requested to extract out the changes to LogData/lifetime into own PR, so this one just has only the change to stack-allocate attributes and nothing more.

@lalitb
Copy link
Member Author

lalitb commented Aug 23, 2024

The PR now only contains the growable array related changes. Should be ready to merge if there are no further comments. The feature-flag for heap allocated attributes would be added subsequently.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a nit about benchmark results not in sync with the PR desc.
A combined changelog highlighting all perf improvements could be added in a follow up?

@lalitb
Copy link
Member Author

lalitb commented Aug 23, 2024

LGTM. Left a nit about benchmark results not in sync with the PR desc. A combined changelog highlighting all perf improvements could be added in a follow up?

Have fixed the benchmark result as per the desc, along with the comment. Yes planning to add the combined changelog as follow up. With this, we've hit the 100th comment on this PR :)

@cijothomas
Copy link
Member

LGTM. Left a nit about benchmark results not in sync with the PR desc. A combined changelog highlighting all perf improvements could be added in a follow up?

Have fixed the benchmark result as per the desc, along with the comment. Yes planning to add the combined changelog as follow up. With this, we've hit the 100th comment on this PR :)

merged the other PR, so this need to handle the conflict. Sorry I wasn't paying attention to the merge order, Hopefully this is an easy conflict to resolve and we are good merge.

This is comment 102 🤣 ! (We still haven't broken the records of some PRs in OTel specs)

@lalitb
Copy link
Member Author

lalitb commented Aug 23, 2024

merged the other PR, so this need to handle the conflict. Sorry I wasn't paying attention to the merge order, Hopefully this is an easy conflict to resolve and we are good merge.

Yes, fixed the merge conflict now :)

@cijothomas
Copy link
Member

The idea of using stack for attributes is not super common, and may be undesired in some scenarios. We do have precedence in Go, .NET and even in Rust for similar changes. But this change is done as a purely internal implementation detail, so it is possible to change this in the future, without any breaking change.
With that, I'll hit the merge button now!

Thanks Lalit for continuously pursuing better performance and thanks everyone who helped review.

@cijothomas cijothomas merged commit b24bb9c into open-telemetry:main Aug 23, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants