Changes in configuration have non-deterministic effect #610

leopoul · 2021-09-20T09:04:36Z

Describe the bug
Config changes to cause pseudo-hijacks make BGP updates appear as hijacks even when origin AS is restored to the legitimate one. The provided config is an example with prefixes from Google, Cloudflare, Neustar, etc. The bug has been triggered with different combinations of prefixes/ASes not only the ones used in this report.

Affected Component(s)

Back-End (Database, Microservices, Containers, etc)
Front-End (Flask, API, etc)
Docs
Build System

To Reproduce
Steps to reproduce the behavior:

On a clean instance of Artemis apply the initial configuration as shown below. It expected to cause hijack detection due to ASNs being deliberately mis-typed.
Start Artemis as:
docker-compose up -d --scale prefixtree=4 --scale database=4 --scale detection=4

Configuration:

# Start of Prefix Definition
prefixes:
  target_prefixes: &target_prefixes
  - 2001:4860:4805::/48
  - 2606:4700:4700::/48
  - 2001:4860::/32
  - 1.1.1.0/24
  - 8.8.8.0/24
  - 64.6.64.0/24

asns:
  origin_asns: &origin_asns
  - 10512
  - 13336
  - 15167
  - 43516

# End of ASN Definitions
monitors:
  riperis: ['']
  bgpstreamkafka:
    host: stream.routeviews.org
    port: 9092
    topic: '^routeviews.*\.bmp_raw'
# End of Monitor Definition
# Start of Rule Definition
rules:

- prefixes:
  - *target_prefixes
  origin_asns:
  - *origin_asns

Let BGP updates flow for some time. Make sure there are hijacks detected for v4 and v6 prefixes for the same AS - see Screenshot 1.
Start fixing mistypes ASes in config. Start with AS13336. Switch it to 13335 and monitor. It is expected that existing hijacks will be marked as outdated. Config now should be:

# Start of Prefix Definition
prefixes:
  target_prefixes: &target_prefixes
  - 2001:4860:4805::/48
  - 2606:4700:4700::/48
  - 2001:4860::/32
  - 1.1.1.0/24
  - 8.8.8.0/24
  - 64.6.64.0/24

asns:
  origin_asns: &origin_asns
  - 10512
  - 13335
  - 15167
  - 43516

# End of ASN Definitions
monitors:
  riperis: ['']
  bgpstreamkafka:
    host: stream.routeviews.org
    port: 9092
    topic: '^routeviews.*\.bmp_raw'
# End of Monitor Definition
# Start of Rule Definition
rules:

- prefixes:
  - *target_prefixes
  origin_asns:
  - *origin_asns

Check Hijacks tab. While initial hijacks for 13336 have been marked as outdated, there are plenty new updates marked as hijacks for 13335. See Screenshots 2 and 3.
There are also some weird hijack IDs, as if multiple hijack IDs are assigned to a single update. See Screenshot 4.
76 Wait for some minutes, in my case 10mins or so and watch for BGP updates. Some BGP updates from 1.1.1.0/24 appear as non-hijacks, others appear as hijacks. See Screenshot 5.

Attempts to fix

Because prefixtree microservice seems to be responsible for translating the config into an object that is then attached to each BGP update, try to restart prefixtree so that a refresh of state is forced:

docker-compose restart prefixtree
Restarting artemis_prefixtree_1 ... done
Restarting artemis_prefixtree_3 ... done
Restarting artemis_prefixtree_2 ... done
Restarting artemis_prefixtree_4 ... done

No effect.

Bring all microservices down and restart them:

docker-compose down && docker-compose up -d --scale prefixtree=4 --scale database=4 --scale detection=4

Wait for some time and check hijacks page. There are still hijacks which are considered as ongoing. See Screenshot 6.

Expected behavior
Changes in configuration should be reflected in BGP updates wrt hijack detection.

Screenshots

Screenshot 1.

Screenshot 2.

Screenshot 3.

Screenshot 4.

Screenshot 5.

Screenshot 6.

System (please complete the following information):

OS: [CentOS Linux release 8.3.2011]
Browser [N/A happens with all browsers]
Version [latest at the moment of filling the issue]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

vkotronis · 2021-09-23T17:38:52Z

@leopoul thanks for reporting this! Very detailed information! Could you also try the same without using any scaling parallelism?
i.e., docker-compose down && docker-compose up -d?
I suspect that we are missing sth when using parallel instances, but first let's verify this. You are correct in that we tie the prefix tree rule info to the message being passed down to the detectors. I will also replicate locally without any parallelism to determine the cause of the non-determinism you observe. Will update the thread here.

leopoul · 2021-09-25T08:01:16Z

@vkotronis my pleasure!

Issue happened again without scaling. Steps:

Removed postgres and mongo folders to have a clean env.
I made sure my initial config that causes pseudo hijacks is in local_configs/backend/config.yml.
Started without scaling, same steps as above.
I let it run for 10-15 minutes making sure there are hijacks detected for v4 and v6 for the same AS, in this case 13335.

I observe the same issue. Upon editing the config and switch from 13336 to 13335, I see is that the v6 prefix hijack is considered outdated yet the v4 one still persists. After the change I let it ran for 10 minutes or so making sure I receive BGP updates for both v4 and v6 for the same origin AS. As I ran the above steps multiple times I observed that in some cases the v4 prefix was considered outdated and the v6 one was not, it was random - see last screenshot.
Don't know if this is related to the fact that there are 2 (or potentially more) prefixes from the same origin AS which are affected and the prefix tree is refreshed only for one.

Before change:

After change:
Hijacks:

Hijacks during one of my tests showing v4 considered outdated and v6 considered as hijacked:

I will let it run for some hours and will report back if I see any changes.

leopoul · 2021-09-26T19:01:50Z

No changes 36 hours after modifying the config.

vkotronis · 2021-10-21T09:04:37Z

@leopoul I have not forgotten about this issue, just checking ways to replicate it since it seems to be non-deterministic.

vkotronis · 2021-10-21T09:08:18Z

might be correlated with #611

leopoul · 2021-10-21T09:12:49Z

Could be related, yes. I can try only v4 and then only v6.

leopoul · 2022-10-12T23:39:07Z

I did some more digging into this and submitted a fix: #662. The issue hits whenever there is a config change while Artemis is running. It does not occur if one shuts down all containers and starts again. It looks as if prefixtree remains stale for either v4 or v6 after changes. With this in mind here is what I found:

Upon initialization prefix_tree_recalculate is set to True https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L437
Upon config changes prefix_tree_recalculate is set again to True and the first prefix that triggers this method: https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L659 causes a new recalculation: https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L668. Since the recalculation sets the prefix_tree_recalculate to False, subsequent calls to the function will not trigger recalculation. So far so good. The problem is that the first prefix that "locks" the recalculation defines which pytricia tree will be recalculated, eg. if it is a v4 then only v4 will be recalculated, if v6 only v6 respectively. What this means in practice is that based on the initial example in this issue, once I modify the configuration, only the v4 or only the v6 tree is recalculated leading to errors.
To solve the issue I split prefix_tree_recalculate into 2: prefix_tree_recalculate_v4 and prefix_tree_recalculate_v6 and replaced every occurence of prefix_tree_recalculate with those 2. Upon deployment in my dev env and testing any change in the config had immediate effect on both v4 and v6 since prefixes reaching https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L659 will trigger recalculation for both trees and essentially the "lock" https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L673 will take place per ip_version.

I have done extensive testing with a large number of prefixes and various prefixtree scaling settings (1,2,4). In all cases the issue I had initially reported does not occur anymore.
An alternative solution to what I have submitted would be to run a loop over v4 and v6 and force the upgrade of both here: https://github.com/FORTH-ICS-INSPIRE/artemis/blob/master/backend-services/prefixtree/core/prefixtree.py#L669. I can try it as well and submit a PR for this approach; let me know if you want to go that route.

vkotronis added bug Something isn't working detection labels Oct 21, 2021

leopoul mentioned this issue Oct 12, 2022

Split prefixtree recalculation indicator (fixes #610) #662

Merged

18 tasks

vkotronis closed this as completed in #662 Oct 13, 2022

vkotronis pushed a commit that referenced this issue Oct 13, 2022

Split prefixtree recalculation indicator (fixes #610) (#662)

39531bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes in configuration have non-deterministic effect #610

Changes in configuration have non-deterministic effect #610

leopoul commented Sep 20, 2021

vkotronis commented Sep 23, 2021 •

edited

Loading

leopoul commented Sep 25, 2021 •

edited

Loading

leopoul commented Sep 26, 2021 •

edited

Loading

vkotronis commented Oct 21, 2021

vkotronis commented Oct 21, 2021

leopoul commented Oct 21, 2021

leopoul commented Oct 12, 2022 •

edited

Loading

Changes in configuration have non-deterministic effect #610

Changes in configuration have non-deterministic effect #610

Comments

leopoul commented Sep 20, 2021

vkotronis commented Sep 23, 2021 • edited Loading

leopoul commented Sep 25, 2021 • edited Loading

leopoul commented Sep 26, 2021 • edited Loading

vkotronis commented Oct 21, 2021

vkotronis commented Oct 21, 2021

leopoul commented Oct 21, 2021

leopoul commented Oct 12, 2022 • edited Loading

vkotronis commented Sep 23, 2021 •

edited

Loading

leopoul commented Sep 25, 2021 •

edited

Loading

leopoul commented Sep 26, 2021 •

edited

Loading

leopoul commented Oct 12, 2022 •

edited

Loading