-
Notifications
You must be signed in to change notification settings - Fork 1
January 14, 2021
inikolaidis edited this page Jan 14, 2021
·
2 revisions
- Crawler performance
- Puppeteer restarted with max concurrency, doesn't seem to address the performance issue
- Jacqueline is writing tests to identify why we are seeing this slow performance from Puppeteer
- Jacqueline and Raiyan got Cheerio working, and it retrieves the relevant data despite the lack of Javascript rendering
- Cheerio is being run on a test instance, in one day it looked at 13,000 links, and it self-stopped
- Cheerio does not have the "headless browser" and this could be an issue for site blocks
- Raiyan and Jacqueline will continue to investigate, with a sample of the scope that includes sites returning a 1 count
- Currently we are using puppeteer through Apify SDK - selecting media to not render would require making changes through Puppeteer directly without Apify SDK
- Puppeteer restarted with max concurrency, doesn't seem to address the performance issue
- Metascraper
- Alex added metascraper data columns, and created corresponding tests
- Mediacat Domain Crawler PR merged
- Post-processor
- Amy added logic to interest output for sorting
- Amy to run the post-processor on the whole Twitter output