-
Notifications
You must be signed in to change notification settings - Fork 1
August 9, 2022
alejandropaz edited this page Aug 9, 2022
·
2 revisions
- talk about conference presentation on last meeting; release: hope to meet with Kirsta and Nat before labor day
- finish testing of new URL expander (headless browser)
- finish code cleanup for Visualization environment, documentation is done: last meeting, record a session
- after NYT politics archive postprocessing is done, next is KPP postprocessing
- brainstorm: https://docs.google.com/document/d/15tuHL_MwW93lbZgH3BNod9OT8gXC7X-vF-sxU58BSQM/edit#heading=h.8zcw8dxxfbr
- make meeting week of Aug 29th?
- basically done, based on domain crawler, should be good - only test remaining is to see if it will flagged
- however, GET request isn't that unusual: too many
- any way to get unshortened url from Twitter API - not a lot of documentation on this
- create an issue with Twitter API to see what they say
- possible to r-sync to smaller instance to free up storage
- mount the larger storage to the larger instance
- storage volumes can be attached or de-attached, it's possible to connect both server instances to the larger storage.
- if this works, re-start Guardian crawl and WaPo/Foxnews twitter crawl
- small domain crawl: 1.8 million
- attach large instance to large storage, and if works, re-start Guardian crawl and WaPo/Foxnews twitter crawl
- once a week in August: check that crawls are functioning
- IMPORTANT FOR Sept: change storage distribution
- Twitter embed issue
- Re-start Guardian, WaPo/Foxnews
- Begin new crawls, Israeli Palestinian
- look at scope together
- what about news site URLs that no longer exist?
- what about preceding website or backslash website urls?
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function