-
Notifications
You must be signed in to change notification settings - Fork 1
July 14, 2021
alejandropaz edited this page Jul 14, 2021
·
3 revisions
- Twitter crawler diff -- what we found
- Domain crawler documentation (in-code & higher level to MVP), branch clean up, and pushing to master -- done?
- Chance to try 1+ instance on 1 server -- how far along?
- timelining:
- meeting of Alejandro & Raiyan to go over folders and instances in Compute Canada
- assess whether Raiyan will have time to upgrade to Apify 1.0 in next 2-3 weeks
- meeting of Alejandro & Kirsta to come up with work study plans & coop for Fall/Winter
- Hire a minimum of 1 workstudy for fall/winter
- Hire a minimum of 1 co-op placement for winter
- we didn't do the NYTimes twitter list, Alejandro will get the list from June
- apparently Danhua may have limited the number of twitter handles at a time, which may account for the missed twitter handles
- Raiyan is going through code; currently, the non-crawled are currently being crawled
- creating ticket to identify the issue: https://github.com/UTMediaCAT/mediacat-twitter-crawler/issues/10
- Raiyan did documentation and updated master branch; also updated to MVP
- Kirsta put together a diagram to show the flow of information for the domain crawler, several items were clarified
- Crawler is officially working!
- 160 hours of crawling continuously, no errors : steady 13 links per minute. (100,000+ JSON Files created)
- update Apify and assess crawler: Raiyan to look: https://github.com/UTMediaCAT/mediacat-domain-crawler/issues/34
- not sure how long it will take, because things may break
- Raiyan will communicate with Compute Canada about the reason for vulnerability, and the update to Apify
- Raiyan is working on this presently, to get them reading from the same queue
- Raiyan and Alejandro meeting, and meeting of Kirsta and Alejandro both set