Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in configuration have non-deterministic effect #610

Closed
1 of 4 tasks
leopoul opened this issue Sep 20, 2021 · 7 comments · Fixed by #662
Closed
1 of 4 tasks

Changes in configuration have non-deterministic effect #610

leopoul opened this issue Sep 20, 2021 · 7 comments · Fixed by #662
Labels
bug Something isn't working detection

Comments

@leopoul
Copy link
Contributor

leopoul commented Sep 20, 2021

Describe the bug
Config changes to cause pseudo-hijacks make BGP updates appear as hijacks even when origin AS is restored to the legitimate one. The provided config is an example with prefixes from Google, Cloudflare, Neustar, etc. The bug has been triggered with different combinations of prefixes/ASes not only the ones used in this report.

Affected Component(s)

  • Back-End (Database, Microservices, Containers, etc)
  • Front-End (Flask, API, etc)
  • Docs
  • Build System

To Reproduce
Steps to reproduce the behavior:

  1. On a clean instance of Artemis apply the initial configuration as shown below. It expected to cause hijack detection due to ASNs being deliberately mis-typed.
    Start Artemis as:
    docker-compose up -d --scale prefixtree=4 --scale database=4 --scale detection=4

Configuration:

# Start of Prefix Definition
prefixes:
  target_prefixes: &target_prefixes
  - 2001:4860:4805::/48
  - 2606:4700:4700::/48
  - 2001:4860::/32
  - 1.1.1.0/24
  - 8.8.8.0/24
  - 64.6.64.0/24

asns:
  origin_asns: &origin_asns
  - 10512
  - 13336
  - 15167
  - 43516

# End of ASN Definitions
monitors:
  riperis: ['']
  bgpstreamkafka:
    host: stream.routeviews.org
    port: 9092
    topic: '^routeviews.*\.bmp_raw'
# End of Monitor Definition
# Start of Rule Definition
rules:

- prefixes:
  - *target_prefixes
  origin_asns:
  - *origin_asns

  1. Let BGP updates flow for some time. Make sure there are hijacks detected for v4 and v6 prefixes for the same AS - see Screenshot 1.
  2. Start fixing mistypes ASes in config. Start with AS13336. Switch it to 13335 and monitor. It is expected that existing hijacks will be marked as outdated. Config now should be:
# Start of Prefix Definition
prefixes:
  target_prefixes: &target_prefixes
  - 2001:4860:4805::/48
  - 2606:4700:4700::/48
  - 2001:4860::/32
  - 1.1.1.0/24
  - 8.8.8.0/24
  - 64.6.64.0/24

asns:
  origin_asns: &origin_asns
  - 10512
  - 13335
  - 15167
  - 43516

# End of ASN Definitions
monitors:
  riperis: ['']
  bgpstreamkafka:
    host: stream.routeviews.org
    port: 9092
    topic: '^routeviews.*\.bmp_raw'
# End of Monitor Definition
# Start of Rule Definition
rules:

- prefixes:
  - *target_prefixes
  origin_asns:
  - *origin_asns
  1. Check Hijacks tab. While initial hijacks for 13336 have been marked as outdated, there are plenty new updates marked as hijacks for 13335. See Screenshots 2 and 3.
  2. There are also some weird hijack IDs, as if multiple hijack IDs are assigned to a single update. See Screenshot 4.
    76 Wait for some minutes, in my case 10mins or so and watch for BGP updates. Some BGP updates from 1.1.1.0/24 appear as non-hijacks, others appear as hijacks. See Screenshot 5.

Attempts to fix

  1. Because prefixtree microservice seems to be responsible for translating the config into an object that is then attached to each BGP update, try to restart prefixtree so that a refresh of state is forced:
docker-compose restart prefixtree
Restarting artemis_prefixtree_1 ... done
Restarting artemis_prefixtree_3 ... done
Restarting artemis_prefixtree_2 ... done
Restarting artemis_prefixtree_4 ... done

No effect.

  1. Bring all microservices down and restart them:
docker-compose down && docker-compose up -d --scale prefixtree=4 --scale database=4 --scale detection=4

Wait for some time and check hijacks page. There are still hijacks which are considered as ongoing. See Screenshot 6.

Expected behavior
Changes in configuration should be reflected in BGP updates wrt hijack detection.

Screenshots
image
Screenshot 1.

image
Screenshot 2.

image
Screenshot 3.

image
Screenshot 4.

image
Screenshot 5.

image
Screenshot 6.

System (please complete the following information):

  • OS: [CentOS Linux release 8.3.2011]
  • Browser [N/A happens with all browsers]
  • Version [latest at the moment of filling the issue]

Additional context
Add any other context about the problem here.

@vkotronis
Copy link
Member

vkotronis commented Sep 23, 2021

@leopoul thanks for reporting this! Very detailed information! Could you also try the same without using any scaling parallelism?
i.e., docker-compose down && docker-compose up -d?
I suspect that we are missing sth when using parallel instances, but first let's verify this. You are correct in that we tie the prefix tree rule info to the message being passed down to the detectors. I will also replicate locally without any parallelism to determine the cause of the non-determinism you observe. Will update the thread here.

@leopoul
Copy link
Contributor Author

leopoul commented Sep 25, 2021

@vkotronis my pleasure!

Issue happened again without scaling. Steps:

  • Removed postgres and mongo folders to have a clean env.
  • I made sure my initial config that causes pseudo hijacks is in local_configs/backend/config.yml.
  • Started without scaling, same steps as above.
  • I let it run for 10-15 minutes making sure there are hijacks detected for v4 and v6 for the same AS, in this case 13335.

I observe the same issue. Upon editing the config and switch from 13336 to 13335, I see is that the v6 prefix hijack is considered outdated yet the v4 one still persists. After the change I let it ran for 10 minutes or so making sure I receive BGP updates for both v4 and v6 for the same origin AS. As I ran the above steps multiple times I observed that in some cases the v4 prefix was considered outdated and the v6 one was not, it was random - see last screenshot.
Don't know if this is related to the fact that there are 2 (or potentially more) prefixes from the same origin AS which are affected and the prefix tree is refreshed only for one.

Before change:
image

After change:
Hijacks:
image

Hijacks during one of my tests showing v4 considered outdated and v6 considered as hijacked:
image

I will let it run for some hours and will report back if I see any changes.

@leopoul
Copy link
Contributor Author

leopoul commented Sep 26, 2021

No changes 36 hours after modifying the config.
image

@vkotronis
Copy link
Member

@leopoul I have not forgotten about this issue, just checking ways to replicate it since it seems to be non-deterministic.

@vkotronis vkotronis added bug Something isn't working detection labels Oct 21, 2021
@vkotronis
Copy link
Member

might be correlated with #611

@leopoul
Copy link
Contributor Author

leopoul commented Oct 21, 2021

Could be related, yes. I can try only v4 and then only v6.

@leopoul
Copy link
Contributor Author

leopoul commented Oct 12, 2022

I did some more digging into this and submitted a fix: #662. The issue hits whenever there is a config change while Artemis is running. It does not occur if one shuts down all containers and starts again. It looks as if prefixtree remains stale for either v4 or v6 after changes. With this in mind here is what I found:

I have done extensive testing with a large number of prefixes and various prefixtree scaling settings (1,2,4). In all cases the issue I had initially reported does not occur anymore.
An alternative solution to what I have submitted would be to run a loop over v4 and v6 and force the upgrade of both here: https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L669. I can try it as well and submit a PR for this approach; let me know if you want to go that route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detection
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants