-
Notifications
You must be signed in to change notification settings - Fork 1
February 04, 2021
inikolaidis edited this page Feb 4, 2021
·
3 revisions
- Post-processor framework
- Connection running post-processor stopped with a "socket broke connection failed" error
- How could we make the post-processor not have to restart from the beginning if such an error rises?
- 1,000,000/5,000,000 entries completed
- Could approach this by dividing the content, multithreading
- Need the full dataset to establish relationships between items, but can the dataset be split up for actual post-processing?
- Create dictionary of user handles, and track their completion in this dictionary
- This dictionary would have to be written to a file so that you would have it if the post-processor fails
- Would have to find a way to share the dictionary between all processes
- Upgrade sudo for all instances on Compute Canada
- On hold for instance being used until it's done, should be scheduled in
- Crawler performance
- Sites asking for cookie permissions need to be handled
- I don't care about cookies extension
- Raiyan configured selective downloading so text available without having to wait through images, other media
- Need to find a way to handle popups
- Using APify Puppeteer rather than Puppeteer alone makes it difficult to add extensions
- Asynchronous crawl, but because we are waiting to resolve, it is by nature synchronous
- Writing asynchronously requires processing after crawl, not during
- Puppeteer (non-APify) needs to be revisited, as well as solving pop-up problem
- Puppeteer developers have implemented a "stealth" mode - needs to be looked into, maybe this could help with cookie handling
- Sites asking for cookie permissions need to be handled