-
Notifications
You must be signed in to change notification settings - Fork 1
May 12, 2022
alejandropaz edited this page May 19, 2022
·
5 revisions
- look at 403 - verify that problem is not way of crawling
- re-send KPP data with tags
- postprocess the small domain crawl - without the domains that didn't work
- look at using dask for postprocessing
- start crawl of cnn.com
- 403 issue
- blocked our IP address -- another instance, but probably speed is an issue
- Apify stealth mode: changes the fingerprint (combo of data points, like browser/ip/etc)
- maybe try this with mondoweiss and middleeasteye with random wait time of 1-2 sec
- is it possible to get a block of IPs or proxies using IPs
- additional problem is getting blocked with postprocessor call: stringifying html with crawler so that metascraper doesn't make additional crawl
- try either JSON.stringify (or Base64 algorithm to encode and decode) or metascraper/readability back to domain crawler
- some links for research https://stackoverflow.com/questions/22551586/write-html-string-in-json
- https://www.blackdown.org/best-datacenter-proxies/
- https://oxylabs.io/blog/rotate-ip-address
- another possibility: do a lot of domains and sequence the calls, but this requires customizing apify
- question: adding text aliases and re-running scope?
- CNN.com crawl?
- like NYT, stuck crawling a lot of less useful stuff, less than 3000
- could it be that we are blocked without 403
- one example from KPP data about embedded tweets -- not urgent
- dask?
- small domain crawl postprocessor?
- postprocessor is very messy, including many different data structures and old stuff that isn't useful
- try slower crawl with single call procedure (as discussed above)
- Alejandro: look at proxies for crawling: https://www.blackdown.org/best-datacenter-proxies/
- Monday meeting:
- finish documenting where different data are on our server
- question: adding text aliases and re-running scope?
- one example from KPP data about embedded tweets -- not urgent
- postprocessor refactoring -- to check back next week
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function
- dealing with embedded versus cited tweets