June 23, 2022

Agenda:

the guardian crawl: filter out comments urls
NYT Mid E archive: test on new postprocessor
postprocessor: adding twitter counts to data structure
update to metascraper to include db to deal with errors and with re-starting after being stopped
visualizations: figure out jupyter
Alejandro: need more examples of embedded tweet issue, and send list of visualizations

Twitter: embedded tweet issue:
- still working on it
NYT Mid E archive: test on new postprocessor
KPP postprocessor: missing a few thousand, maybe small code error, troubleshooting
postprocessor: adding twitter counts to data structure
- done
update to metascraper to include db to deal with errors and with re-starting after being stopped
- Shengsong will work on this in Arbutus while we await Graham to return
trying to add multithreading to postprocessor but then ended up slower -- why?, DASK supported
- Nat will look at the video of the postprocessor changes
DASK method to loop through the data was not very fast, for-loop is faster

making jupyter notebook work
some of the earlier work from Alice could be helpful hear, by domain and url for vector diagrams
Shengsong will send the stacked area charts and one way vector diagrams
can use D3 for visualizations -- can use it in local server

send the stacked area charts and one way vector diagrams from KPP data
troubleshooting errors on postprocessor discovered with KPP testing
update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
dask multithreading for postprocessor - trouble-shooting why slower
consider D3 or similar for visualizing vector diagrams
Twitter: embedded tweet issue:
when Graham back on: * the guardian crawl: filter out comments urls
for next week: consider borealis to store datasets

Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
finding language function
image_reference function