-
Notifications
You must be signed in to change notification settings - Fork 1
August 2, 2022
alejandropaz edited this page Aug 9, 2022
·
4 revisions
- work on headless browser URL expander
- Twitter embed issue
- code cleanup on D3 vector diagram
- Israeli news site crawl
- look at scope together
- what about news site URLs that no longer exist?
- what about preceding website or backslash website urls?
- save these questions for September
- testing new headless browser -- should be done this week
- it will slow down the process a lot, but shouldn't be an issue if it's the same speed as the domain crawler
- not yet
- still working on the code cleanup
- re-do the KPP postprocessing -- not yet
- need to change the storage distribution: only 1.1 TB on large instance
- restart WaPo/Foxnews twitter crawl -- paused
- restart the postprocessing of NYT politics archive -- running, should be done by this week
- Guardian: paused, at 1.9 million
- small domain crawl - still running, small instance, 1.7 million -- this is probably best speed
- doesn't need to be re-started, should be fine running by itself
- no point in starting new crawl right now
- talk about conference presentation on last meeting; release: hope to meet with Kirsta and Nat before labor day
- finish testing of new URL expander (headless browser)
- finish code cleanup for Visualization environment, documentation is done: last meeting, record a session
- after NYT politics archive postprocessing is done, next is KPP postprocessing
- IMPORTANT FOR Sept: change storage distribution
- Twitter embed issue
- Re-start Guardian, WaPo/Foxnews
- Begin new crawls, Israeli Palestinian
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function