Preallocate storage for log attributes on stack #1965

lalitb · 2024-07-25T05:20:58Z

towards #1920

Changes

The GrowableArray is introduced to replace the preallocated vector for storing log attributes, aiming to use stack allocation first and only allocate heap memory when necessary. TBD - Make it optional under feature flag.

The benchmark and stress results as below. A negligible improvement in perf for stress, while benchmark has further improved from 5%-10%.

Metric	Main Branch	PR Branch	Improvement (%)
Throughput (iterations/sec)	42,239,400 to 44,855,600	43,380,200 to 44,410,600	-0.13%
ot_layer_enabled (ns)	218.92 to 219.50	206.11 to 207.56	5.57%
Logging_Comparable_To_Appender (ns)	96.679 to 96.822	85.466 to 85.646	11.51%
log_noop_processor (ns)	101.78 to 102.16	86.313 to 86.503	15.23%
log_cloning_processor (ns)	184.26 to 184.57	177.25 to 177.56	3.80%
log_clone_and_send_to_channel_processor (ns)	648.33 to 657.47	668.22 to 670.25	-2.36%
LogExporterWithFuture (ns)	130.25 to 130.58	133.11 to 133.64	-2.24%
LogExporterWithoutFuture (ns)	98.344 to 99.553	90.973 to 91.901	7.45%

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

opentelemetry-sdk/src/logs/record.rs

lalitb · 2024-07-27T02:59:24Z

Have also included the Boxing of AnyValue enum as part of this PR, as that overall affects the size of the preallocated buffer, and is a small change. There is an improvement with the bench, but stress throughput has dropped drastically. Need to look into that, along with memory stats.

full-log-with-4-attributes/no-context:
main: ~170 ns
PR: 132 ns

stress test:
main: 35,526,000 iterations/sec
PR: 21,113,200 iterations/sec

…emetry-rust into preallocate-attribute-mem

lalitb · 2024-08-08T19:01:41Z

The change itself is very good. Awesome work and thanks for taking us one more step closer to achieving zero heap logging!

I left suggestion about not modifying test code to rely on the new internal implementation details. I want to fix #1995, and re-do all benchmarks, to gauge the actual gains from this change (and less variations, hopefully), before merging.

Thanks @cijothomas for review. Have resolved the tests to not rely on the internal implementation details of the attribute storage. Let me know if you still see any issues.

cijothomas · 2024-08-09T22:46:07Z

@lalitb and I observed that in some machines the perf is getting de-graded (benchmarks and stress tests) with the change. We are trying to understand what might be causing that, even though we greatly reduce heap allocation. Holding off merge until we understand the source.

lalitb · 2024-08-12T23:30:06Z

Adding the relevant part of flamegraph before (main branch) and after (PR branch):
Main branch:

PR branch:

The cost of execution of __memmove_avx_unaligned_erms (shown with arrow above) function has increased from 1.5% to 4.0%. This is the low-level C function, which typically represents that some data is being physically copied or moved in memory. In our case, the data movement is happening here:

opentelemetry-rust/opentelemetry-sdk/src/logs/log_emitter.rs

Lines 276 to 279 in d583695

    
           let mut data = LogData { 
        
               record: log_record, 
        
               instrumentation: self.instrumentation_library().clone(), 
        
           };

i.e, LogRecord object is moved inside the LogData instance. It seems that switching the LogRecord::attributes type from a pre-allocated vector to a fixed-size array is changing the behavior of our code in terms of memory operations.

In the case of an array, the entire contiguous array is copied from one location to another.
In the case of vector, only its metadata fields (pointer, capacity, and length) are moved.

The cost of moving the data (i.e., the array) outweighs the benefit of initial creation and population. This leads to increased latency. The option could be using references or smart pointers (Rc/Arc) over attribute array to mitigate this issue, but they come with their own trade-offs - increased complexity, lifetime management challenge, potential overhead of indirections.

Also attaching the full flamegraph svg files if someone is interested in doing more analysis :)

Main branch:

PR branch:

cijothomas · 2024-08-13T16:56:29Z

Thanks @lalitb for getting to the root cause of the observed behavior! This is similar to what we faced in OTel .NET where using stack allocation to avoid heap (and gc pressure) actually increased hot path cost, due to increase in amount of stack to stack copying! https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Metrics/MetricsBenchmarks.cs#L42-L52

utpilla · 2024-08-13T17:39:19Z

The cost of execution of __memmove_avx_unaligned_erms (shown with arrow above) function has increased from 1.5% to 4.0%. This is the low-level C function, which typically represents that some data is being physically copied or moved in memory. In our case, the data movement is happening here:

Great analysis @lalitb!

lalitb · 2024-08-15T02:54:27Z

I have now modified the LogData export structure to contain the CoW of the LogRecord and InstrumentationLibrary:

Before:

#[derive(Clone, Debug)]
pub struct LogData {
    /// Log record
    pub record: LogRecord,
    /// Instrumentation details for the emitter who produced this `LogEvent`.
    pub instrumentation: InstrumentationLibrary,
}

After:

#[derive(Clone, Debug)]
pub struct LogData<'a> {
    /// Log record, which can be borrowed or owned.
    pub record: Cow<'a, LogRecord>,
    /// Instrumentation details for the emitter who produced this `LogEvent`.
    pub instrumentation: Cow<'a, InstrumentationLibrary>,
}

This allows us to avoid copying of LogRecord structure when used with a synchronous processor.

Simple processor (and also other concurrent processors) gets the reference to the LogData structure, borrows it, and sends it to the exporter. This involves NO copy/move. The exporter can decide whether to own it based on its logic.
The batch processor gets the reference to the LogData structure, converts it into the owned version, and pushes it to the queue. This involves clone, which effectively copies the attributes too. This can't be avoided.

I can see ~20% improvement with this:

Appender:
PR: ot_layer_enabled        time:   [210.69 ns 211.52 ns 212.46 ns]
Main: ot_layer_enabled        time:   [253.75 ns 254.01 ns 254.33 ns]

SDK:
PR:  [104.99 ns 105.09 ns 105.21 ns]
Main: [136.19 ns 136.30 ns 136.41 ns]

Stress:
PR: 
Throughput: 43,832,200 iterations/sec
Throughput: 44,060,200 iterations/sec
Throughput: 44,089,600 iterations/sec
Throughput: 43,246,400 iterations/sec
Throughput: 42,510,000 iterations/sec

Main:
Throughput: 39,540,600 iterations/sec
Throughput: 39,462,800 iterations/sec
Throughput: 39,574,800 iterations/sec
Throughput: 39,052,600 iterations/sec

opentelemetry-sdk/src/logs/log_emitter.rs

cijothomas · 2024-08-15T04:01:09Z

https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-sdk/benches/log_processor.rs#L9-L10 Can you check the effect on this benchmark with/without latest commit? (Trying to see if our benchmarks are actually helping us make more informed decisions!)

lalitb · 2024-08-15T04:42:04Z

Before:
log_noop_processor time: [133.99 ns 134.10 ns 134.23 ns]
log_cloning_processor time: [229.78 ns 229.93 ns 230.09 ns]
log_clone_and_send_to_channel_processor time: [617.49 ns 618.84 ns 620.54 ns]

After:
log_noop_processor time: [100.27 ns 100.62 ns 101.26 ns]
log_cloning_processor time: [301.43 ns 301.76 ns 302.09 ns]
log_clone_and_send_to_channel_processor time: [697.19 ns 697.86 ns 698.57 ns]

We don't do cloning anymore in Simple/Concurrent processor, so benchmark 2 won't be used. bench 3 is the batch scenario, and this will now do the stack memory copy, and so more cost.

cijothomas · 2024-08-17T01:43:46Z

I am also seeing same perf results!(42 M/sec in stress test) Thanks for continuously striving to get us better and better!

I just compared with OTel .NET's similar setup in same machine, and OTel .NET gets around 46 M/sec, so we are getting pretty close to match .NET (which is heavily optimized already and GA for years). With the next set of planned improvements, we should easily exceed that as well.

I requested to extract out the changes to LogData/lifetime into own PR, so this one just has only the change to stack-allocate attributes and nothing more.

lalitb · 2024-08-23T09:15:43Z

The PR now only contains the growable array related changes. Should be ready to merge if there are no further comments. The feature-flag for heap allocated attributes would be added subsequently.

opentelemetry-appender-tracing/benches/logs.rs

opentelemetry-sdk/src/logs/record.rs

cijothomas

LGTM. Left a nit about benchmark results not in sync with the PR desc.
A combined changelog highlighting all perf improvements could be added in a follow up?

lalitb · 2024-08-23T15:51:34Z

LGTM. Left a nit about benchmark results not in sync with the PR desc. A combined changelog highlighting all perf improvements could be added in a follow up?

Have fixed the benchmark result as per the desc, along with the comment. Yes planning to add the combined changelog as follow up. With this, we've hit the 100th comment on this PR :)

cijothomas · 2024-08-23T16:53:16Z

LGTM. Left a nit about benchmark results not in sync with the PR desc. A combined changelog highlighting all perf improvements could be added in a follow up?

Have fixed the benchmark result as per the desc, along with the comment. Yes planning to add the combined changelog as follow up. With this, we've hit the 100th comment on this PR :)

merged the other PR, so this need to handle the conflict. Sorry I wasn't paying attention to the merge order, Hopefully this is an easy conflict to resolve and we are good merge.

This is comment 102 🤣 ! (We still haven't broken the records of some PRs in OTel specs)

lalitb · 2024-08-23T16:54:54Z

merged the other PR, so this need to handle the conflict. Sorry I wasn't paying attention to the merge order, Hopefully this is an easy conflict to resolve and we are good merge.

Yes, fixed the merge conflict now :)

cijothomas · 2024-08-23T17:05:16Z

The idea of using stack for attributes is not super common, and may be undesired in some scenarios. We do have precedence in Go, .NET and even in Rust for similar changes. But this change is done as a purely internal implementation detail, so it is possible to change this in the future, without any breaking change.
With that, I'll hit the merge button now!

Thanks Lalit for continuously pursuing better performance and thanks everyone who helped review.

lalitb and others added 11 commits July 22, 2024 19:35

inital commit

7fee0fc

more

06aee01

more changes

ea7b2f8

changes

bfb1297

fix build

fcb0ed8

changes

40b2b5f

revert defaults

da8b8de

optionally create vec

5019086

add intial capacity for vector

17d1628

create vec of certain capacity

781c3ed

changes

b9a90c2

lalitb requested a review from a team July 25, 2024 05:20

lalitb marked this pull request as draft July 25, 2024 05:21

lalitb changed the title ~~Preallocate storage for log attributes on stack~~ [WIP] Preallocate storage for log attributes on stack Jul 25, 2024

Merge branch 'main' into preallocate-attribute-mem

2bfde32

cijothomas reviewed Jul 25, 2024

View reviewed changes

opentelemetry-sdk/src/logs/record.rs Outdated Show resolved Hide resolved

cijothomas reviewed Jul 25, 2024

View reviewed changes

opentelemetry-sdk/src/logs/record.rs Outdated Show resolved Hide resolved

cijothomas reviewed Jul 25, 2024

View reviewed changes

opentelemetry-sdk/src/logs/record.rs Show resolved Hide resolved

lalitb and others added 6 commits July 25, 2024 16:37

cont..

a95e681

changes..

9dc8a45

box enums

df66985

make growable add inline

8f62fa3

update bench

fb7a44c

Merge branch 'main' into preallocate-attribute-mem

8611bc0

lalitb added 5 commits July 26, 2024 22:35

remove experimenal metadata as default

2fabca1

changes.

e52d002

Merge branch 'main' into preallocate-attribute-mem

92d8689

revert back to option

877996e

Merge branch 'preallocate-attribute-mem' of github.com:lalitb/opentel…

8a86be3

…emetry-rust into preallocate-attribute-mem

fix

f3b63fb

cijothomas reviewed Aug 15, 2024

View reviewed changes

opentelemetry-sdk/src/logs/log_emitter.rs Outdated Show resolved Hide resolved

lalitb force-pushed the preallocate-attribute-mem branch from 56074a2 to f3b63fb Compare August 19, 2024 00:59

lalitb and others added 5 commits August 19, 2024 05:01

Merge branch 'main' into preallocate-attribute-mem

4fa708b

merge conflict

e95d550

Merge branch 'main' into preallocate-attribute-mem

2e066fc

fix lint

93b4775

update results

e697b0d

cijothomas reviewed Aug 23, 2024

View reviewed changes

opentelemetry-appender-tracing/benches/logs.rs Outdated Show resolved Hide resolved

cijothomas reviewed Aug 23, 2024

View reviewed changes

opentelemetry-sdk/src/logs/record.rs Outdated Show resolved Hide resolved

cijothomas approved these changes Aug 23, 2024

View reviewed changes

lalitb and others added 2 commits August 23, 2024 07:58

Merge branch 'main' into preallocate-attribute-mem

9e8871e

comment

e516abe

Merge branch 'main' into preallocate-attribute-mem

90dff08

cijothomas merged commit b24bb9c into open-telemetry:main Aug 23, 2024
25 checks passed

lalitb mentioned this pull request Aug 27, 2024

LogRecord Attributes - avoid heap allocations #1920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preallocate storage for log attributes on stack #1965

Preallocate storage for log attributes on stack #1965

lalitb commented Jul 25, 2024 •

edited

Loading

lalitb commented Jul 27, 2024 •

edited

Loading

lalitb commented Aug 8, 2024

cijothomas commented Aug 9, 2024

lalitb commented Aug 12, 2024 •

edited

Loading

cijothomas commented Aug 13, 2024

utpilla commented Aug 13, 2024

lalitb commented Aug 15, 2024 •

edited

Loading

cijothomas commented Aug 15, 2024

lalitb commented Aug 15, 2024 •

edited

Loading

cijothomas commented Aug 17, 2024

lalitb commented Aug 23, 2024

cijothomas left a comment

lalitb commented Aug 23, 2024

cijothomas commented Aug 23, 2024

lalitb commented Aug 23, 2024

cijothomas commented Aug 23, 2024

Preallocate storage for log attributes on stack #1965

Preallocate storage for log attributes on stack #1965

Conversation

lalitb commented Jul 25, 2024 • edited Loading

Changes

lalitb commented Jul 27, 2024 • edited Loading

lalitb commented Aug 8, 2024

cijothomas commented Aug 9, 2024

lalitb commented Aug 12, 2024 • edited Loading

cijothomas commented Aug 13, 2024

utpilla commented Aug 13, 2024

lalitb commented Aug 15, 2024 • edited Loading

cijothomas commented Aug 15, 2024

lalitb commented Aug 15, 2024 • edited Loading

cijothomas commented Aug 17, 2024

lalitb commented Aug 23, 2024

cijothomas left a comment

Choose a reason for hiding this comment

lalitb commented Aug 23, 2024

cijothomas commented Aug 23, 2024

lalitb commented Aug 23, 2024

cijothomas commented Aug 23, 2024

lalitb commented Jul 25, 2024 •

edited

Loading

lalitb commented Jul 27, 2024 •

edited

Loading

lalitb commented Aug 12, 2024 •

edited

Loading

lalitb commented Aug 15, 2024 •

edited

Loading

lalitb commented Aug 15, 2024 •

edited

Loading