possible to resolve the issue for golang? #23

kolinfluence · 2022-12-01T16:33:09Z

i can sponsor a few coffees for this.

also, the server is running in "pre-fork mode", means one process per thread. so there's thread isolation "goroutine is using gnet's ants" https://github.com/panjf2000/ants or bytedance's gopool

pls do a golang binding for this and reduce the issues for golang with the use case scenario i've mentioned above.

with ref to this:
#22

thx in advance! really appreciate this. will repay ur kindness in future in other ways.

simonhf · 2022-12-10T20:06:04Z

Thanks for the comment, @ultperf .

Not quite sure what is meant by "pre-fork mode" because Golang also doesn't have fork()ing, right?

"means one process per thread" Do you mean "one thread per process"? One thread per Golang process would make it easier to create a binding for SharedHashFile for a Golang process running like this. But who runs their Golang processes like this? What are you hoping to achieve with SharedHashFile that cannot be achieved without it e.g. using a different architecture and/or components?

kolinfluence · 2022-12-14T01:13:54Z

yes "one thread per process".

lots of high performance golang users run it this way.
because it's running in prefork mode so that's why shared memory will be faster.

p.s. : possible to make this happen? i can sponsor a few coffees for this.

simonhf · 2022-12-15T22:41:58Z

By prefork do you mean like here [1] where it says "Preforks master process between several child processes increases performance, because Go doesn't have to share and manage memory between cores." Are there any experiments to back up this statement about performance?

Ideally ultra high performing C programs can still use threads and be in one process, but you want to avoid context switching overhead, and you also want to avoid memory allocation overhead and cleaning up the memory overhead (AKA GC for Golang).

In theory a Golang program could use as many Goroutines as CPU cores, so there'd be no context switching overhead. And the same program could also pre-allocate most or all of the memory it uses, thus nullifying the "have to share and manage memory between cores" statement above. In this scenario you wouldn't need SharedHashFile because all the Goroutines can see the same memory in the process anyway.

But you also wouldn't want to use Golang's built in hash table either. Why not? It relies on allocating memory which later has to be GC'd and thus does not nullify the "have to share and manage memory between cores" statement. And general purpose LRU caches for Golang (e.g. [2]) suffer from the same issue because they are effectively built on top of Golang's built in hash table / associative array which will churn memory and GC at run-time.

But if you implemented your own Golang hash table using pre-allocated memory to avoid GC then it might end up looking something like herocache / heropool [3]. Note: Here hero means (he)ap ze(ro) meaning that when using herocache then it's neutral to the heap and thus GC neutral too :-) This means you could put 100M items in herocache and the heap wouldn't grow (because all memory herocache uses is pre-allocated) and there'd also be no resulting GC.

So I'm not saying that I won't port SharedHashFile to Golang, but I'm questioning / debating the proposed architecture it would go into, and whether prefork is "good" / necessary / best for performance. Motto: "Measure twice, cut once" :-)

[1] https://pkg.go.dev/github.com/valyala/fasthttp/prefork#section-readme
[2] https://github.com/hashicorp/golang-lru
[3] https://github.com/onflow/flow-go/blob/aa0be2d8cf77e116f4e41258ae80fde99db9b1a1/module/mempool/herocache/backdata/heropool/pool.go

kolinfluence · 2022-12-17T12:19:24Z

https://www.techempower.com/benchmarks/#section=data-r21&test=cached-query

check all prefork mode of golang vs non prefork

possible to get the binding done for golang?

it's not critical but really hope to see it going though

simonhf · 2022-12-30T00:31:52Z

@ultperf, thanks for sending the link to the techempower round 21 cached query results.

Looking at the 'Cached queries' results for '20 queries (bar)', 'fasthttp-prefork' managed 353,362 responses per second, and 'fasthttp' only 68,312 responses per second. So the prefork version is about 5.2 times faster?

And if I understand you correctly, the prefork version runs n x 1 thread processes instead of the non-prefork version running a 1 x n thread process?

The difference seems to be so big that I'm wondering why one version is so much bigger than the other? You hinted before that it's about Golang's GC. Is that the or one of the reasons and has anybody published an analysis online somewhere?

Having a look at the prefork crode here [1], I noticed that its README publishes wrk stats with (104,861 requests per second) and without prefork (97,553 requests per second). This is only 1.07 times faster for prefork.

So why is the techempower benchmark 5.2 times faster for prefork but prefork's own benchmark is only 1.07 times faster?

[1] https://github.com/valyala/fasthttp/tree/master/prefork

kolinfluence · 2022-12-30T05:32:58Z

@simonhf

prefork mode is 1 thread per process (with so reuse port), non-prefork is 1 process running multiple threads inside (without using soreuse port)

Having a look at the prefork crode here [1], I noticed that its README publishes wrk stats with (104,861 requests per second) and without prefork (97,553 requests per second). This is only 1.07 times faster for prefork

in areas of memory use / caching / with regards to GC:
2. this is why i've mentioned "sharedhashfile" golang binding can significantly help in "prefork" mode. the 1.07 times faster is "ideal" condition with negligible (or insignificant) "memory use".

once memory is used, the issue with golang as a GC + single process, multithreaded go routine features, the mem handling between multi threads will be doing a lot of context switches between those threads with consideration to
a) multiple go routine (check bytedance/gopkg#144, which aims to solve go routine mem issue)
b) in real world program (large ones too), the performance and memory use will be highly evident in creating lots of problems without prefork mode. also, the 1.07 benchmark is using 4 core cpu, purely designed as bare bones fasthttp usage. it's totally impractical to be used as a useful benchmark.

for server production use, the "cached query" of techempower is more "real life" because it's using 20+ core CPU as benchmark and caching for golang program is the reason why i am asking u for the "sharedhashfile" golang binding.

pls help? i can sponsor a few coffees for this? truly need this... not sure if u can write in in pure go for cgo. preferably pure go binding or c->go asm or goasm

p.s. : it will be great if u can optionally make it a shared mem ipc for golang. that'll really be something this module will be used everywhere.

simonhf · 2023-01-02T19:53:15Z

@ultperf, thanks for the info and clarifications.

More questions for clarification :-)

Let's say for example sake that the test is running on the 20+ core CPU and the number of processes or threads is 20.
So running in prefork mode there would be 20 processes each with one thread and presumably one gorountine?

And running in non-prefork mode there would be 1 process with 20 threads and presumably 20 goroutines or 1 goroutine per thread?

Are the 'cached queries' just cached in memory upon startup in whatever internal data / memory format is convenient to the implementation?

Presumably the prefork mode code starts, loads the cached query data into memory, and then fork()s 20 times? In this way it uses a similar amount of memory to the non-prefork mode because the cached query data is in shared memory due to the fork()ing?

mem handling between multi threads will be doing a lot of context switches between those threads

Confused by this statement: In non-prefork mode, if the number of goroutines is kept less than the number of CPU cores, surely there wouldn't be any context switches, or?

Presumably the prefork and non-prefork modes will generate the same amount of garbage for GC -- assuming the same amount of incoming queries -- at run-time?

So, all other things being similar for processing a query, we might assume that the difference in performance, and the reason the prefork mode code is ~ 5 times faster, is largely due to the higher efficiency of the prefork mode GC at run-time?
Why might the prefork GC be more efficient?

When Golang performs GC then it needs to loop through every heap allocation whether it's going to be GC'd or not. This means the bigger the heap, the slower the GC. So this is already bad for the non-prefork mode code. It's heap will be ~ 20 times bigger than the heap of a single prefork process? Which means a 20 times longer concurrent GC?

Also, although the Golang GC is largely concurrent to minimize 'stop the world' GC processing tasks, in tests I have found in the past that during concurrent GC then new Golang heap allocations become much slower, and thus, the Golang concurrent GC processing causes regular heap allocations to occur much slower and thus slows down regular code running.

The prefork mode code does exactly the same thing as the non-prefork code but: 1. Presumably the heap for each process is going to be 20 times smaller, meaning the concurrent GC is going to be 20 times faster? And 2. Assuming not all 20 processes GC at the same time, there will be a good chance of a new query being handled by a process not currently slowed down by concurrent GC, and presumably the query will be handled faster?

It would be interesting to experiment further with the prefork mode code to dynamically disable Golang GC altogether while handling queries, and periodically enable GC but disable handling queries. This way, queries will only ever be handled by processes guaranteed not to be subject to concurrent GC, and in theory giving overall better performance? There could also be some kind of IPC mechanism to ensure that GC always happens evenly spread out between the 20 processes?

That's enough questions and speculation for now :-)

kolinfluence · 2023-01-03T10:21:18Z

@ultperf, thanks for the info and clarifications.

More questions for clarification :-)

Let's say for example sake that the test is running on the 20+ core CPU and the number of processes or threads is 20. So running in prefork mode there would be 20 processes each with one thread and presumably one gorountine?

no, the goroutine is thread specific so each thread will spawn their own goroutines.
in the context of goroutine run in terms of c language instead of golang, it's just a "pause" the goroutine while other things are running. everything is using one thread. so u can think of goroutine in single threaded mode as function context switching.

And running in non-prefork mode there would be 1 process with 20 threads and presumably 20 goroutines or 1 goroutine per thread?

for fasthttp, there's only 1 goroutine i think but in terms of the use of "goroutine" in 1 process with 20 threads, the goroutine can choose which cpu to run on (assuming we disregard numa). normally spawning multiple goroutines will create a lot of memory issues so that's why panjf/ants was created to address this problem.

how golang works depends on developer execution (and how he writes the code), with regards to sharedhashfile, it is kind of irrelevant to non-expert golang coder, either on goroutine usage or feature / function placement etc

Are the 'cached queries' just cached in memory upon startup in whatever internal data / memory format is convenient to the implementation?

i have no idea. but in prefork mode, the reason it's faster is mostly because there's no context switching and memory management between 20+ cpu threads. i dont want to confuse u further but mostly because golang doesnt have context switching memory management between multiple threads in a single process (non pre-fork). the overhead to manage non-prefork mode is much higher than just pure single thread per cpu.

Presumably the prefork mode code starts, loads the cached query data into memory, and then fork()s 20 times? In this way it uses a similar amount of memory to the non-prefork mode because the cached query data is in shared memory due to the fork()ing?

in prefork mode, each cached query data is being used by the thread that spawns it only. they are not shared between forked processes. no memory sharing between processes in prefork mode.
each prefork mode thread has the same fixed memory sizes for each and they cant see each other's memory segment.

that's why i took notice of your sharedhashfile.

mem handling between multi threads will be doing a lot of context switches between those threads

Confused by this statement: In non-prefork mode, if the number of goroutines is kept less than the number of CPU cores, surely there wouldn't be any context switches, or?

there still will be. each goroutine is like spawning a separate thread. each thread means context switching and the overhead of keeping such cpu core threads irrespective of num of cpu cores.

Presumably the prefork and non-prefork modes will generate the same amount of garbage for GC -- assuming the same amount of incoming queries -- at run-time?

keeping the overhead at 1 thread managing the single garbage collection "stuff" is cheaper than single process, multi threaded garbage collection "cycles".

talking about real world programs, garbage will be there for golang programs (real world), unavoidable. so prefork mode with sharedhashfile can be significantly better.

however, the only overhead will be the CGO call at around 100ns per call.

i've tested lru cache written in c interfaced with cgo and it's more consistent though 10-20x lower throughput but more predictable in usage instead of latencies created by GC.

So, all other things being similar for processing a query, we might assume that the difference in performance, and the reason the prefork mode code is ~ 5 times faster, is largely due to the higher efficiency of the prefork mode GC at run-time? Why might the prefork GC be more efficient?

techempower benchmark shows 1 side BUT depending on the skill of the programmer for real life program. i'm talking big programs here not the ones written for benchmarking purposes,

single thread prefork only need to manage the thread's GC. single GC because CPU is pinned to the thread/process.
in non prefork mode, when CPU is not pinned to any particular thread, the CPU could be GCing other CPU's thread's mem. cross numa stuff. it's more evident with 20+ core cpus.

When Golang performs GC then it needs to loop through every heap allocation whether it's going to be GC'd or not. This means the bigger the heap, the slower the GC. So this is already bad for the non-prefork mode code. It's heap will be ~ 20 times bigger than the heap of a single prefork process? Which means a 20 times longer concurrent GC?

sort of i think but GC is supposed to be working in the background so if u want to think that way, i guess it's "right".

i think u also have to think about the fact that when experts use prefork mode, they already have the knowhow to work around other areas of efficiencies... like looking at your sharedhashfile etc.

Also, although the Golang GC is largely concurrent to minimize 'stop the world' GC processing tasks, in tests I have found in the past that during concurrent GC then new Golang heap allocations become much slower, and thus, the Golang concurrent GC processing causes regular heap allocations to occur much slower and thus slows down regular code running.

expert coder will have ways to work around such issues etc.

The prefork mode code does exactly the same thing as the non-prefork code but: 1. Presumably the heap for each process is going to be 20 times smaller, meaning the concurrent GC is going to be 20 times faster? And 2. Assuming not all 20 processes GC at the same time, there will be a good chance of a new query being handled by a process not currently slowed down by concurrent GC, and presumably the query will be handled faster?

there are plenty ways to manage GC and to tame it. so depending on what kind of app is being written and how it's ran, u can write it in ways to take GC into considerations

It would be interesting to experiment further with the prefork mode code to dynamically disable Golang GC altogether while handling queries, and periodically enable GC but disable handling queries. This way, queries will only ever be handled by processes guaranteed not to be subject to concurrent GC, and in theory giving overall better performance? There could also be some kind of IPC mechanism to ensure that GC always happens evenly spread out between the 20 processes?

for IPC, i'm actually waiting for bytedance to opensource their ShmIPC this year.

can u do a golang binding for ur sharedhashfile to test first?

That's enough questions and speculation for now :-)

kolinfluence closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible to resolve the issue for golang? #23

possible to resolve the issue for golang? #23

kolinfluence commented Dec 1, 2022

simonhf commented Dec 10, 2022

kolinfluence commented Dec 14, 2022

simonhf commented Dec 15, 2022

kolinfluence commented Dec 17, 2022

simonhf commented Dec 30, 2022

kolinfluence commented Dec 30, 2022 •

edited

Loading

simonhf commented Jan 2, 2023

kolinfluence commented Jan 3, 2023 •

edited

Loading

possible to resolve the issue for golang? #23

possible to resolve the issue for golang? #23

Comments

kolinfluence commented Dec 1, 2022

simonhf commented Dec 10, 2022

kolinfluence commented Dec 14, 2022

simonhf commented Dec 15, 2022

kolinfluence commented Dec 17, 2022

simonhf commented Dec 30, 2022

kolinfluence commented Dec 30, 2022 • edited Loading

simonhf commented Jan 2, 2023

kolinfluence commented Jan 3, 2023 • edited Loading

kolinfluence commented Dec 30, 2022 •

edited

Loading

kolinfluence commented Jan 3, 2023 •

edited

Loading