Skip to content
This repository has been archived by the owner on Feb 17, 2022. It is now read-only.

Review ETL processes for sanitization, robustness and correctness #45

Open
seanshahkarami opened this issue Sep 7, 2017 · 1 comment

Comments

@seanshahkarami
Copy link
Member

seanshahkarami commented Sep 7, 2017

We should do a simple review of the main processes involved in loading data into the databases, processing it, etc. Some examples of what we're looking for are things like:

  1. Do they apply sanitization? For example, ensure consistent node_ids, encoding, naming, etc.
  2. Do they handle invalid data correctly? At least one process just drops bad blobs on failure. We probably would like to flag that data and have it put into an error queue or something for later inspection.
  3. Are they tolerant to database and broker delays, timeouts, etc? This means things like not crashing immediately if the database is busy, ensuring proper message acknowledgements are being done, etc.
  4. Are they relatively efficient in their implementation?

This is worth looking at and getting correct now, as these will be part of our architecture regardless of how we redesign beehive.

@seanshahkarami seanshahkarami changed the title Review ETL processes for data sanitization, robustness and correctness Review ETL processes for sanitization, robustness and correctness Sep 7, 2017
@gemblerz gemblerz self-assigned this Sep 8, 2017
@seanshahkarami
Copy link
Member Author

All the workers now have proper connection retries when starting, so that should cut down on the crashing immediately and restarting if the message broker is down.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants