-
Notifications
You must be signed in to change notification settings - Fork 1
January 6, 2022
alejandropaz edited this page Jan 6, 2022
·
5 revisions
- update on small site crawl (https://electronicintifada.net/, https://mondoweiss.net/, https://jewishjournal.com/, https://www.tabletmag.com/, http://jewishcurrents.org)
- does it need to be put through postprocessor? (John should have some notes)
- investigate whether Twint issue resolved: https://github.com/twintproject/twint/pull/1307
- several comments on that thread are no longer there -- so maybe the new commit worked?
- optimizing crawl speed with Javascript crawler (instead of redeploying Python crawler) (see https://github.com/UTMediaCAT/mediacat-docs/wiki/December-16%2C-2021)
- would be great to crawl NYTimes.com
- John confirmed that several crawls can occur on the same instance, and now just need to see if possible to crawl one domain with several processes
- optimizing CSV preparation (see https://github.com/UTMediaCAT/mediacat-docs/wiki/December-16%2C-2021)
- CSV format? better to have separate, smaller files, eg, one for each source site (like Haaretz.com)?
- would be really great to have the relevant hyperlink listed in each row somehow
- security issues from Compute Canada:
- If you have not applied OS updates recently, make time to schedule an outage of your instance to apply operating system and application updates.
- Review your security group rules and lock down access to services to as few remote IP prefixes as possible.
- Delete cloud instances you are no longer using/maintaining as those pose a security risk and are consuming valuable resources.
- issues with al-monitor crawl result, see: https://docs.google.com/document/d/1_306LFgJb0SheUyjN0vHwecaSHy93ntVfT68hABMKYM/edit#heading=h.rjswpmbmsyoa
- Shengsong will get into Compute Canada security issues as a way to get to know the resources:
- If you have not applied OS updates recently, make time to schedule an outage of your instance to apply operating system and application updates.
- Review your security group rules and lock down access to services to as few remote IP prefixes as possible.
- Delete cloud instances you are no longer using/maintaining as those pose a security risk and are consuming valuable resources.
- use this an opportunity to map out all the resources on the Compute Canada
- how many servers we have and what is storage, etc.
- e.g., terraform, vcp graph, see https://www.reddit.com/r/devops/comments/f6t6wt/best_open_source_way_to_visualize_infrastructure/
- here is the doc with information about our resources: https://docs.google.com/document/d/1X47ZSj8U6fVSQYuwgMe3kL9czuSuxA4n69txKXT0URQ/edit
- create a new instance, install the software, and crawl nytimes.com
- running the latest code on nytimes.com to see the benchmark speed of the current crawler
- best solution is to set up jupyter hub on Compute Canada
- Colin: port scripts from existing notebook into hub
- thus wouldn't need to optimize CSV preparation
- need to do forensics to figure out errors
- we will ask Shengsong to look at this after he has fully gotten to work on Compute Canada
- Shengsong: Compute Canada updates and mapping
- Shengsong: setting up an instance of the domain crawler for nytimes.com
- Colin: setting up a jupyter hub on our resources
- next steps for Shengsong:
- looking to see if Twint issues are resolved and a crawl can be done
- forensics on issues with al-monitor crawl (see link above)