-
Notifications
You must be signed in to change notification settings - Fork 1
July 5, 2022
alejandropaz edited this page Jul 6, 2022
·
3 revisions
- visualizations: try D3 or other for better visualization library
- postprocessor:
- test changes to metascraper
- test changes with dask multithreading
- finalize trouble-shooting with postprocessor difference on KPP data (capital letters, scope issue, etc)
- make a private repo on Github and use to store our datasets
- Alejandro will make a spreadsheet with list of crawls, and information
- Twitter: embedded tweet issue:
- still looking through documentation of D3
- fix bugs with postprocessor connected to capital letters, question of inconsistency of scope
- for twitter handle, never use capital letters in defining scope
- dask multithreading: doesn't seem to work properly, not worth fiddling with it
- metascraper: it's all working
- after KPP data question, we'll test NYT Middle East Archive search and then if postprocess results aren't substantially different, new postprocessor will be merged to master
- theguardian & small domain crawls now working again
- theguardian: at about 1000,000
- small domain: 1,000,000
- Alejandro will send twitter handles
- started to move
- also possible to run metascraper on old datasets
- finalizing testing of new postprocessor and merge to master if working
- start NYT Politics Archive postprocessing if postprocessor is done
- continue learning D3 for edge-node
- start new crawl with twitter accounts Alejandro will send
- meet with Alejandro to finalize looking at datasets
- twitter: embedded tweet issue
- to discuss next meeting:
- how to cut a release
- writing a paper about MediaCAT and architecture
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function