[nftables] remediation component shutdowns after a failed response #369

LaurenceJJones · 2024-05-16T11:23:05Z

What happened?

When the remediation component fails to connect to LAPI currently with nftables, the whole service comes down and flushes the nftables set

time="10-05-2024 11:06:07" level=info msg="Processing new and deleted decisions . . ."
time="10-05-2024 11:07:07" level=error msg="http code 504, invalid body: invalid character '<' looking for beginning of value"
time="10-05-2024 11:07:07" level=info msg="Shutting down backend"
time="10-05-2024 11:07:07" level=info msg="flushing 'crowdsec-blacklists' set in 'crowdsec' table"
time="10-05-2024 11:07:07" level=info msg="flushing 'crowdsec6-blacklists' set in 'crowdsec6' table"
time="10-05-2024 11:07:07" level=fatal msg="process terminated with error: bouncer stream halted"
time="10-05-2024 11:07:17" level=info msg="Starting crowdsec-firewall-bouncer v0.0.28-debian-pragmatic-af6e7e25822c2b1a02168b99ebbf8458bc6728e5"
time="10-05-2024 11:07:17" level=info msg="backend type : nftables"
time="10-05-2024 11:07:17" level=info msg="nftables initiated"

This is not what we want as the IP's currently within set are useful to the service.

What did you expect to happen?

Remediation component should allow for failures to connect to LAPI after the service has started, EG connect first if failed at startup then yes restart but after that should be resilient

How can we reproduce it (as minimally and precisely as possible)?

Bring up a LAPI and firewall remediation, currently user has reported if the response code > 500 the service comes down

Anything else we need to know?

No response

version

remediation component version:

$ crowdsec-firewall-bouncer --version
# paste output here

crowdsec version

crowdsec version:

$ crowdsec --version
# paste output here

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-16T11:23:19Z

@LaurenceJJones: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

Check Documentation to see if your issue can be self resolved.
You can also join our Discord

Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

github-actions · 2024-05-16T11:23:20Z

@LaurenceJJones: There are no 'kind' label on this issue. You need a 'kind' label to start the triage process.

/kind feature
/kind enhancement
/kind bug
/kind packaging

Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

dolgovas · 2024-05-21T14:26:19Z

Hello! We met this trouble. Do you have any update about this trouble?

mr1jingles · 2024-05-22T13:26:32Z

UPDATE. This only happens if the bouncer is restarted. If the api does not respond when bouncer is running, bouncer tries to get new solutions and continues to work.

One more question: Why does bouncer reset nftables set on restart?

LaurenceJJones · 2024-05-22T14:37:22Z

UPDATE. This only happens if the bouncer is restarted. If the api does not respond when bouncer is running, bouncer tries to get new solutions and continues to work.

Yes, this is the current design, as if the remediation component doesn't get an initial connection, then it could be a bad configuration

One more question: Why does bouncer reset nftables set on restart?

We remove the set because it takes ten times more time to do an initial load if we have to check if each element already exists. So, to be more efficient, we remove the set and then reinstate it upon restart

mr1jingles · 2024-05-22T14:53:24Z

But if the host is under attack and clearing the nftables set can negatively affect the server.

It is also not entirely clear, if bouncer clears the nftables set, why does it pull all decisions (also outdated) if the set is cleared?

LaurenceJJones · 2024-05-22T16:58:16Z

But if the host is under attack and clearing the nftables set can negatively affect the server.

Yes, but this should only happen if you restart the service when under attack. As the service should be running for a long time unless there is a reason not to run it.

Most likely, the way crowdsec sends decisions, bouncers don't have a direct influence on what they get sent unless it's filtered. There is no impact on performance. You just see an unesscary log line that's all

mr1jingles · 2024-05-23T09:44:40Z

If the host is under attack, then it is possible that free memory runs out and the OOM process can kill bouncer, so when restarting bouncer clears the table, thereby provoking even more load on the server.

I think it's reasonable to add an option that allows you to compare the data received from the API instead of clearing the table when restarting

mr1jingles · 2024-05-23T09:46:45Z

About decisions. When I restarted a large number of bouncers, I saw a large load on the database on the API server. This led to a memory leak and complete unavailability of the API

LaurenceJJones · 2024-05-23T15:14:16Z

This led to a memory leak and complete unavailability of the API

Memory as a spike does not equal a memory leak it just means the api is handling the requests, and because it holds decisions in memory whilst it queries, then it will spike.

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

If you can capture the memory leak via pprof, we look into it.

https://docs.crowdsec.net/docs/next/observability/pprof

I understand the OOM part, and we can improve this in the future, but currently, we have no resources to look at this, so contributions are welcome.

LaurenceJJones · 2024-05-23T15:32:29Z

/kind enhancement
/accepted

mr1jingles · 2024-05-23T16:10:55Z

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

Should I enable this flag on the API server?

Correct me if I'm wrong. Does this feature allow you to send decisions in a batch?

LaurenceJJones · 2024-05-24T12:10:05Z

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

Should I enable this flag on the API server?

Correct me if I'm wrong. Does this feature allow you to send decisions in a batch?

Exactly, so instead of getting all decisions in memory, it will fetch X amount then write to stream, then fetch next batch and write to stream and so on and so on. It may become standard for next releases currently it behind a feature flag since we wanted to ensure stability but we have a large enterprise using it in production for over 2 minor releases with no issues reported from their side.

mr1jingles · 2024-05-24T12:23:50Z

And if I use MySQL as a database server, will it work for it too?

LaurenceJJones · 2024-05-24T12:31:57Z

And if I use MySQL as a database server, will it work for it too?

Yes works for all databases

github-actions bot added the needs/triage label May 16, 2024

github-actions bot added the needs/kind label May 16, 2024

github-actions bot added kind/enhancement and removed needs/kind labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nftables] remediation component shutdowns after a failed response #369

[nftables] remediation component shutdowns after a failed response #369

LaurenceJJones commented May 16, 2024

github-actions bot commented May 16, 2024

github-actions bot commented May 16, 2024

dolgovas commented May 21, 2024

mr1jingles commented May 22, 2024

LaurenceJJones commented May 22, 2024 •

edited

Loading

mr1jingles commented May 22, 2024

LaurenceJJones commented May 22, 2024

mr1jingles commented May 23, 2024

mr1jingles commented May 23, 2024

LaurenceJJones commented May 23, 2024

LaurenceJJones commented May 23, 2024 •

edited

Loading

mr1jingles commented May 23, 2024

LaurenceJJones commented May 24, 2024 •

edited

Loading

mr1jingles commented May 24, 2024

LaurenceJJones commented May 24, 2024

[nftables] remediation component shutdowns after a failed response #369

[nftables] remediation component shutdowns after a failed response #369

Comments

LaurenceJJones commented May 16, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

version

crowdsec version

OS version

github-actions bot commented May 16, 2024

github-actions bot commented May 16, 2024

dolgovas commented May 21, 2024

mr1jingles commented May 22, 2024

LaurenceJJones commented May 22, 2024 • edited Loading

mr1jingles commented May 22, 2024

LaurenceJJones commented May 22, 2024

mr1jingles commented May 23, 2024

mr1jingles commented May 23, 2024

LaurenceJJones commented May 23, 2024

LaurenceJJones commented May 23, 2024 • edited Loading

mr1jingles commented May 23, 2024

LaurenceJJones commented May 24, 2024 • edited Loading

mr1jingles commented May 24, 2024

LaurenceJJones commented May 24, 2024

LaurenceJJones commented May 22, 2024 •

edited

Loading

LaurenceJJones commented May 23, 2024 •

edited

Loading

LaurenceJJones commented May 24, 2024 •

edited

Loading