-
Notifications
You must be signed in to change notification settings - Fork 1
April 28, 2022
alejandropaz edited this page Apr 28, 2022
·
2 revisions
- postprocessor issue with text alias
- finalize clean up, updating, and documentation of methods of NYT crawl
- assessment of any updates needed for libraries
- kpp/mediacat postprocessed results
- postprocess NYT site crawl - think about NYT -- why cut off?
- memory issues with larger dataset
- old instance (16 CPU): couldn't do 900,000, needed to use larger instance (40 CPU)
- we predict that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is.
- this is another reason to do smaller crawls
- text alias issue: simple error, punctuation was being treated as part of the word
- Shengsong will send re-processed NYT Archive crawl results
- kpp/mediacat twitter data was processed without a hitch on larger instance
- documentation? - github page and domain crawler
- why stop at 900,000
- once postprocessing is done we can see if older articles have invalid links
- updates are done: removed unused dependencies, like metascraper
- apify v 2.30 update done
- basic language
- master crawler in python (timing, stopping script) - latest version
- JS: node.js already updated
- all major updates are done, could be a few smaller in postprocessor
- Shengsong will look at these next week
- needs to be checked
- Shengsong will put them in groups of 500,000
- 900,000 articles and finished postprocessor: 6,000 rows
- seems low, Shengsong will check
- had an issue: unhandled error causing it to stop
- will re-start and if error returns, Shengsong will look
- send re-processed NYT Archive crawl results
- document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
- finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
- look at smaller libraries in postprocessor to see if need updating
- group KPP/MediaCAT results in groups of 500,000
- re-start small domain crawl: if error returns, Shengsong will trouble-shoot
- check to see why postprocessor of small domain crawl only produced 6000 relevant hits
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- how to get multithreading with postprocessor
- what to do with htz.li
- small domain crawl
- Benchmarking
- finish documenting where different data are on our server
- finding language function
- image_reference function
- dealing with embedded versus cited tweets