Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeno v2 #166

Draft
wants to merge 249 commits into
base: main
Choose a base branch
from
Draft

Zeno v2 #166

wants to merge 249 commits into from

Conversation

equals215
Copy link
Member

No description provided.

equals215 and others added 30 commits November 18, 2024 15:10
CorentinB and others added 30 commits January 12, 2025 19:58
* postprocess: corrected some smells

* postprocess: renamed some variables and corrected forloop variables

* postprocess: postprocessItem args

* postprocess: never set the state of the parent before adding a child, this is done via AddChild() method

* item: reinforced CheckConsistency method

* global: enforcing stricter state and consistency check for items throughout stages in the pipeline

* item: corrected CheckConsistency() and made more unit tests

* item&finisher: make use of CompleteAndCheck() method on an item to parse the tree before handling further

* item: CompleteAndCheck() overlooked return conditions

* pre/postprocess: trying to fix the flow of childs

* dumper: add a Dump() function to properly dump an Item for further debugging

* preprocessor: correct exclusion logic

* item.Dedupe: corrected an edge case where a completed child has the same URL as the seed and dedupe was trying to remove the seed

* postprocess: correct failed outlink extraction behaviour

* Add more detailed pyroscope information

* postprocess: add more debug logging to troubleshoot an unknown bug

* preprocess: add itemId in panic

* postprocess: always postprocess an item EVEN IF ASSETS CAPTURE IS DISABLED

* archiver: close spooledBuffer if error happened during body processing

* postprocess: close all bodies of an item tree before continuing in the pipeline

* archiver: try to write bodies only on disk

* add: small memory optimization for URLToString & encodeQuery

* chore: upgrade Go version & dependencies

* chore: bump warc lib to v.0.8.62

* fix: usage of spooledtempfile lib

* chore: bump warc lib to v.0.8.63

* postprocess: defer a closeBodies call on every item that goes through

* log: disable log queue full error message when TUI is used

* cmd: add no-stderr-log flag

* hq.consumer: replace previousBatch check with a reactor duplicate check

* pyroscope: bump upload rate from 15s to 5s

* fix: add panic for errors in startPipeline, retry indefinitely on HQ start error

* fix: not returning when hq.Start fails to init HQ client

* fix: typo

* fix: HQ Start failure marking init as already done

* fix: panic when HQ init fails

* add: truthsocial.com preprocessing & post-processing

* chore: bump warc lib to v.0.8.64

* add: more truthsocial.com special handling

* add: more truthsocial.com special handling

* add: more truthsocial.com special handling

* fix: variable scope for truthsocial special handling

* fix: domains crawl

* fix: set assets hops to their seed hop

* fix: extraction of outlinks on assets

---------

Co-authored-by: Jake L <[email protected]>
Co-authored-by: Corentin Barreau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants