-
Notifications
You must be signed in to change notification settings - Fork 1
June 9, 2022
alejandropaz edited this page Jun 9, 2022
·
2 revisions
- Shensong to write documentation on the following: crawler numbers for error registering and pausing the crawler, brake when queue goes to 0 and apify crawl in rounds.
- Shensong to send NYT archive politics crawl to Alejandro after postprocessing.
- Shensong to comment back to Apify developers so they are aware of limitations of error reporting.
- Shensong to continue working on the post-processor refactoring.
- Shensong to send Kirsta and Alejandro info re: data structure.
- old postprocessor: over thousand lines
- now 4 parts
- 1: input: load the scope csv into a dictionary; saved in /saved/ as JSON (eeasier to debug); another scope for Twitter: for postprocessor
- results: saved in parquet format
- 2: postprocessor: first postprocess twitter and domain separately to find citation alias, propagate tags, name, etc; then: cross reference domain and twitter data
- 3: post-postprocessor:
- 4: post-utils: helper files - write to files, given dictionary and row parser
- 1: input: load the scope csv into a dictionary; saved in /saved/ as JSON (eeasier to debug); another scope for Twitter: for postprocessor
- everything is now written in dask, dataframe partitions; everything is imported before
- dask allows for visualizations and graphs
- metascraper now saves to CSV - working fine
- benchmark: 40000 kpp data: 1 min to load, 1min3 sec to postprocess, and create output few seconds
- same structure but adding twitter counts (retweets/likes etc)
- small domain still crawling - 308,000 crawled thus far
- NYT politics archive - done, will postprocess with new postprocessor - benchmark each part
- theguardian - still going - about 800,000 urls crawled - 2 weeks with a few breaks and need to slow down
- done on stealth mode
- documentation and new repo for new postprocessor
- Twitter: embedded tweet issue
- testing new postprocessor on KPP & old NYT and new NYT data to see if discrepancy
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function
- dealing with embedded versus cited tweets