-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new(proposal): disable support for syscall enter events #2068
base: master
Are you sure you want to change the base?
new(proposal): disable support for syscall enter events #2068
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Andreagit97 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
80e1377
to
de342d4
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2068 +/- ##
=======================================
Coverage 75.08% 75.08%
=======================================
Files 274 274
Lines 34290 34290
Branches 5926 5926
=======================================
Hits 25748 25748
Misses 8542 8542
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Perf diff from master - unit tests
Heap diff from master - unit tests
Heap diff from master - scap file
Benchmarks diff from master
|
any feedback on this? I continuously see many users with a lot of drops... |
@falcosecurity/libs-maintainers I will try to create a branch ASAP to show you some concrete code on this |
I would love to join you for this enormous task 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sold on the idea. As I mention on another comment, this proposal is basically saying that we should rewrite all drivers and parsing logic. Even if that is not the case, it will take tons of changes in them which can cause bugs, regressions and what-have-yous. The proposal states that we will be sending half the data to userspace in some parts, but if all data is moved to exit events, that is not actually true, we will send half the events that will be almost twice as big, meaning we will be saving the length of the enter event header, maybe a bit more if we find redundant data on some events.
I also don't see the need to move ALL syscalls to use only exit events, I would argue most of the performance impact surely comes from syscalls like read/open/send/recv and their relatives, since they are the most frequently triggered (the syscall fingerprint even shows these are triggered at least twice as much as other syscalls), why not try changing only some of those and see if that is good enough?
IDK, it all seems like a huge amount of efforts, a big risk and I don't see how the enormous gains the proposal is being sold on will manifest themselves. I'd be happy to be proven wrong though.
* Update our three drivers to collect all parameters in the exit fillers and remove the enter ones. | ||
* Update scap-file management to convert all enter events generated over the years into new exit events. | ||
* Update our userspace engines to support only exit events. | ||
* Update the sinsp parsing and extraction logic to use only exit events. | ||
* Update all tests (drivers, libscap, unit, e2e) to use only exit events. | ||
* Update all documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of work, and I mean A LOT of work. From where I stand, this is basically a rewrite of the parsing logic of sinsp and all 3 drivers. This is a ton of risk and I'm not convinced moving everything to exit events will have the huge impact you are expecting it to have. In my mind, for me to sign off on this, you'd need to have CPU and data savings of way above 50%, otherwise it's too much risk of having regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I can fully understand you. This is a lot of work.
The point of this proposal is that today we don't scale on machines with more than 96 CPUs (an example here https://gist.github.com/stevenbrz/69391aa71b22d205b6add88ae10fb905#file-gistfile1-txt-L126-L166) and we need to find a solution. What is proposed here is not the final solution but at least I see it as a first step toward it. IMO Going from 10 Mevt/s to 5 Mevt/s is already something
you'd need to have CPU and data savings of way above 50%
Yes, we probably won't reach something like this but we will process half of the event in userspace and this sounds like already something good. Unfortunately having precise data before doing the change would be very complicated...
@Molter73
I know andre has more data to back his proposal for sure :) |
Fair, but the amount of data and fields being parsed would still be almost the same, right? What I mean is, you would only skip the enter header, then go on to parse the same number of fields effectively, you'd just be doing that in one go, so the total processing done in userspace would be the same.
You are very optimistic about the amount of gains we'll get, I don't think it's gonna get even close to that, but again, I'd be happy to be proven wrong.
I don't see what this has to do with moving all fields to the exit side. Why couldn't we do this with the existing model?
These 3 are fair points, though they still require a massive rewrite that I'm not fully OK with. If you guys can show me how we would get 30% performance improvement from just shuffling things around, I'll gladly eat my words, for now, all I see is a massive rewrite that will be a pain to go through and put us in a position for a bunch of new bugs to pop up. |
Uhm more or less. Today for the most frequent events we store the whole event in memory and we reuse it in the exit. libs/userspace/libsinsp/parsers.cpp Line 133 in e1999d0
This is not only a cost (in scenarios where we have 4 Million open/s we save 2 Millions of OPEN_E per second, same for close, socket, etc...) but it's also a useless complication and a risk... sometimes if we drop the enter event we are not able to correctly process the exit one and so we end up doing a sort of approximation without any reason. Why should the information be shared between enter and exit events?
Moreover, we need to consider that there is not only the parsing phase, after the parsing the event is sent to all the filter checks and we need to evaluate all the filter checks on events that don't carry a real value...
Well, that's true we have no data to discuss about this and unfortunately, it's very difficult to obtain them without implementing the solution...
This would require generating new event pairs for almost all the events that have at least one between
Well, I see your point, really. Doing this work just for one event I've already spotted some inconsistencies/bugs, so it's true that we will introduce for sure some new bugs but today we already have many of them hidden somewhere. So at the end of the work, I'm not sure what would be the balance of introduced/fixed bugs... BTW in any case I would proceed with a separate branch, doing PR after PR would be very difficult and dangerous as you correctly highlighted. Apart from the huge amount of changes I don't see many other downsides
But yes I see how these changes can be scary... |
Just because we talked about byte saving i would add a bit of information. This is not strictly related to this initiative but if we decide to go in this way we could think about it... Today event schema /*
* SCAP EVENT FORMAT:
*
* +------------------------------------------+
* | Header | u16 | u16 | param1 | param2 |
* +------------------------------------------+
* ^ ^ ^
* | | |
* ppm_hdr | |
* lengths_arr--+ |
* raw_params---------------------------------+
*/
// 26 bytes
struct ppm_evt_hdr {
uint64_t ts; /* timestamp, in nanoseconds from epoch */
uint64_t tid; /* the tid of the thread that generated this event */
uint32_t len; /* the event len, including the header */
uint16_t type; /* the event type */
uint32_t nparams; /* the number of parameters of the event */
}; Why don't we do something like this? /* +----------+
* | Header |
* +----------+
*/
typedef uint16_t evt_hdr;
/* Concrete example */
[PPME_SYSCALL_CLOSE_X] = {"close",
EC_IO_OTHER | EC_SYSCALL,
EF_DESTROYS_FD | EF_USES_FD | EF_MODIFIES_STATE,
2,
{{"res", PT_ERRNO, PF_DEC},
{{"fd", PT_FD, PF_DEC}}},
/* +--------------------------------------------+
* | TYPE_CLOSE_X | len | tid | ts | res | fd |
* +--------------------------------------------+
*
* - Total len: 16 + 16 + 32 + 64 + 32 + 32 -> (we save 4 extra bytes because we don't need the len of `res` and `fd`). Note that this is called `2/3 Million/s`)
*/
/* +----------------------------------------------------------------------------------------------+
* | TYPE_OPEN_X | len | tid | ts | res | fd | flags | mode | dev | ino | len_name (u16) | name |
* +----------------------------------------------------------------------------------------------+
*
* - Total len: 16 + 16 + 32 + 64 + 32 + 32 + 32 + 32 + 32 + 64 + 16 + strlen(name) -> we save 12 extra bytes
*
*/
/* +----------------------------------------------------------------------------------------------+
* | TYPE_MMAP_X | len | tid | ts | res | ... |
* +----------------------------------------------------------------------------------------------+
*
* - This design would allow us to handle special cases, for example mmap has a return value on 64 bit and we can easily handle it because sinsp will know it when it will parse an event `TYPE_MMAP_X`
*
*/
Just to give a complete full example of a possible saving. Consider the PPME_SYSCALL_OPENAT_2_X event: [PPME_SYSCALL_OPENAT_2_X] = {"openat",
EC_FILE | EC_SYSCALL,
EF_CREATES_FD | EF_MODIFIES_STATE,
7,
{{"fd", PT_FD, PF_DEC},
{"dirfd", PT_FD, PF_DEC},
{"name", PT_FSRELPATH, PF_NA, DIRFD_PARAM(1)},
{"flags", PT_FLAGS32, PF_HEX, file_flags},
{"mode", PT_UINT32, PF_OCT},
{"dev", PT_UINT32, PF_HEX},
{"ino", PT_UINT64, PF_DEC}}}, Today: 26 (header) + 14 (len array) + 8 (fd on 8 bit) + 8 + X (for the name) + 4 + 4 + 4 + 8 = 76 Bytes + X With the proposed approach: 16 (type+len+tid+ts) + 4 (fd on 4 bytes) + 4 (dirfd on 4 bytes) + 4 + 4 + 4 + 8 + 2 (only the name field needs the len) + X = 46 bytes + X We save 30 bytes! |
Taking a step back: okay, we can't handle $MILLIONS evt/sec right now, but why exactly? Is the bottleneck in scap_next (and dropping the enter events will help a lot) or is it in the sinsp parsers (and dropping half the events will help only a little bit, as we save on dispatch but the data fields we process are basically the same)? If you happen to have a 96 core machine to play with, it might be interesting to see how fast we can get with just scap (e.g. scap-open with any sort of printing disabled) vs sinsp (sinsp-example, again, with printing disabled to keep the playing field level). My expectation is that raw scap will scale to an absurd extent (no reason it wouldn't be faster than memcpy, it's basically an atomic fetch and add per event batch) and it's sinsp that's slow, but I'm ready to be surprised. (if you do run this test, please gather perf profiles and make some pretty flame graphs :)) Note that we have a mostly random sleep inside (kmod's) scap_next, so on less than full load the perf won't be ideal. @Andreagit97 I like the proposed new layout a lot, though I'm mildly worried about having a variable-length res field. I'd sleep minimally better if it was just native register size (8 bytes these days) unconditionally. |
Yes, I know about the storing logic, that's not ideal but it's basically a memcpy, I'm sure playing around with allocators could give better performance (arena allocators, use something like tcmalloc, etc..), again, this doesn't need a large refactoring. I do, on the other hand, strongly agree that we shouldn't drop entire events when we drop the enter side, this is a way stronger selling point for the proposal than performance IMO.
I'm not super familiar with filterchecks, but I would argue this is a mistake on the filtercheck logic which should ignore events that are not useful, not on the driver and parsing logic.
I recently read an interesting article on how new code is way more likely to have bugs and vulnerabilities than old code, so if you've already spotted bugs on one syscall, I think we can assume we are gonna come out the other end with a negative balance. I'm not trying to completely push back here, but we do need to be aware of the risk of doing such a large refactoring.
Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.
How do you handle driver <-> scap compatibility like this? If the schema changes in any way, you will no longer be able to parse an event from an older driver, right? If a field changes from 32 to 64 bits, how would scap know to only read 32 bits when the older driver is being used? I agree with everything in @gnosek's comment BTW, running the tests he proposes would be very interesting. |
I took some data on 80 CPUs in the past for a completely different reason (ringbuffer scraping) but I think they still report something relevant. https://docs.google.com/document/d/1CLIP4cdAGa6MQnMffQKeVT7Z18rzAyX6XkLZ-HFns1E/edit
TL;DR; the solution proposed here is not a definitive solution, but it should allow us to gain some time while we come up with a definitive solution (multi-threading, kernel filtering, whatever...). Moreover, apart from halving the number of events we send to userspace, this solution brings some important optimization
To be honest, aside from the major refactor, I see only possible improvements... |
that makes sense but instead of using a common [PPME_SYSCALL_OPEN_X] = {"open", ..., {{"res", PT_FD, PF_DEC},
[PPME_SYSCALL_MMAP_X] = {"mmap", ..., {{"res", PT_POINTER, PF_HEX}, And so on, instead of a generic |
What I'm concerned about is having to look up the event schema before we even know if the event is successful or not. Maybe it's not a real issue, idk 🤷 |
That's a good point. |
If we see this is an issue I agree with you that we should fall back to something standard like always 64 bits |
Alternatively, we can have a global (per sinsp) flag to drop the enter events and introduce support for it gradually (merging each step to master, with passing tests for both scenarios). This won't require a long-lived side branch that will be a pain to maintain before it's merged (ask me how I know). In the end, we can remove the flag (permanently dropping enter events) or leave it as is if we want. |
Since the discussion is becoming interesting, I'm joining in with my two cents 😸 I've been talking with Andrea about this proposal for a while, and even though I tend to be conservative, I think there's an opportunity for a substantial improvement here. I guess the core point here is the issue of drops. We have been evaluating many solutions for years, and honestly, anything we have approached has just mitigated the situation. Still, we have never seen a real improvement. The root cause is likely the legacy design, which we have never tried to change radically. This proposal favors structural changes that may result in a substantial improvement. So, I'm not against it unless the numbers show something different. Regarding bugs, I'm not so worried since we are already in a situation where it's tough to discover and fix them (primarily due to the legacy design). We have to deal with bugs anyway, so this should not be a blocker. Somewhat the new solution should reduce the moving parts a bit, which might even result in easier maintenance and fewer bugs in the long term. That said, I would like to give more reasons once we have a PoC. Now that we don't plan to introduce significant new features, I would definitely avoid stagnation. |
👍 |
That could work too, I just don't want a 50k+ lines changed PR to suddenly show up in the repo 🙂 |
In the last weeks, I did some research for this workstream. Here is the branch (https://github.com/Andreagit97/libs/pull/new/remove_sys_enter) I used to have an idea of the possible changes that will be required, kernel side + userspace side. This branch will never see the light but it was useful to me to evaluate possible approaches. The best idea I can come up with is to proceed with incremental PRs against libs master:
static std::unordered_map<ppm_event_code, conversion_info> g_conversion_table = {
{PPME_SYSCALL_OPEN_E, {.instr = {{C_ACTION_SKIP}}}},
{PPME_SYSCALL_OPEN_X,
{.desired_type = PPME_SYSCALL_OPEN,
.valid_param_nums = {4, 6},
.instr = {{C_FROM_OLD_EVENT | C_MOD_TO_32, 0},
{C_FROM_OLD_EVENT, 1},
{C_FROM_OLD_EVENT, 2},
{C_FROM_OLD_EVENT, 3},
{C_FROM_OLD_EVENT, 4},
{C_FROM_OLD_EVENT, 5}}}},
...
} Of course, each converter should be paired with its own unit tests. You can see some examples in the above branch. Of course, if we realize that this approach is too complex or too unstable we can switch to a feature branch and work on that with PRs. We leave the master in a stable state and we continue the work in the branch. Please let me know what you think about this approach and if you have alternative ideas/suggestions! Side note n1: I'm happy we already spotted some bugs/inconsistency looking a little bit deeper inside filterchecks/parsers + there was also room for some small cleanups! See:
Side note n2: The schema rework we talked about here #2068 (comment) is not part of this initiative. It could be a follow-up but it's not strictly related. |
Agree with the proposed method of incrementally adding new events to the master branch, but I'm not 100% sure on the "all or nothing" approach to switching handling. Think it'd be worth it to make it possible to change the implementation being used by using some sort of configuration flag (for those syscalls that have it, obviously), since that would allow progressive testing as new implementations are added. Maybe the implementation can happen on a separate event table and we can choose between the 2? Or, if kept ordered at the end of the existing table, we could just use an offset? I don't want to block on this, but I would like to see some thought put into it. |
Do you mean when we will need to switch drivers to the new approach?
It would make sense, but if I understood correctly, having both implementations in parallel on the kernel side wouldn't be manageable (not sure, I'm just guessing). In the end, a feature branch is a good fallback solution if we reach a point where things become too cumbersome to manage. That said, I'm okay with both approaches. I guess we are still in a too-early stage, and we will likely find which fits better later. |
Once we have sinsp working with both flavors maybe we could rework the drivers in a feature branch so that every consumer can test the new logic from the feature branch without touching the drivers in libs. In this way, we can do the final switch only when we are really sure that everything is working. WDYT? We could also do everything in the feature branch, so also the sinsp changes, but in my opinion, if possible, it would be better to use master to:
If we see that this becomes a nightmare of instability we should probably go with a feature branch only. What i don't love of partial changes in the drivers is:
|
After another iteration, I came up with something that probably is even more conservative. The idea is to work only in libs/master without forks, using incremental steps. Each PR will contain the following:
/////////////////////////// FROM
[PPME_SYSCALL_READ_E] = {"read",
EC_IO_READ | EC_SYSCALL,
EF_USES_FD | EF_READS_FROM_FD,
2,
{{"fd", PT_FD, PF_DEC}, {"size", PT_UINT32, PF_DEC}}},
[PPME_SYSCALL_READ_X] = {"read",
EC_IO_READ | EC_SYSCALL,
EF_USES_FD | EF_READS_FROM_FD,
2,
{{"res", PT_ERRNO, PF_DEC},
{"data", PT_BYTEBUF, PF_NA}}},
/////////////////////////// TO
[PPME_SYSCALL_READ_E] = {"read",
EC_IO_READ | EC_SYSCALL,
EF_USES_FD | EF_READS_FROM_FD,
2,
{{"fd", PT_FD, PF_DEC}, {"size", PT_UINT32, PF_DEC}}},
[PPME_SYSCALL_READ_X] = {"read",
EC_IO_READ | EC_SYSCALL,
EF_USES_FD | EF_READS_FROM_FD,
4,
{{"res", PT_ERRNO, PF_DEC},
{"data", PT_BYTEBUF, PF_NA},
{"fd", PT_FD, PF_DEC},
{"size", PT_UINT32, PF_DEC}}},
At the end of the work, we should have libs master that works in 2 modes:
Hope this could be a good trade-off between our previous ideas. Side note: this workstream won't include any event-schema refactor so no new event types and no new event header. I'm not a fan of this but I can see how this could slow down the whole initiative. Note: Using exit + enter will have a slightly greater cost than today because the exit event will send more parameters kernel side. |
This work could cause some redefinitions of fillers (you will see in the next PR) that's the reason why I raised this #1283 (comment).
the bump of |
As I wrote in the comment on the other issue, I'm against this (unless there's some aspect I'm not fully understanding). |
0a14985
to
9786842
Compare
Signed-off-by: Andrea Terzolo <[email protected]>
9786842
to
f9c58c4
Compare
Just updated the proposal with the latest decisions/updates in this thread |
What type of PR is this?
/kind design
Any specific area of the project related to this PR?
/area proposals
Does this PR require a change in the driver versions?
No
What this PR does / why we need it:
This document proposes disabling support for syscall enter events in our codebase. The primary reason behind this proposal is to reduce the throughput of events that our userspace needs to process. As the number of syscalls increases, we are no longer able to support them, and we start dropping events. See the full proposal for more details!
I'm curious to hear some feedback on this, we talked many times about it and maybe now is the right time to tackle it!
Some related discussions:
base_syscalls.exclude_enter_exit_set
config falco#2960Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?: