-
Notifications
You must be signed in to change notification settings - Fork 1
March 10, 2022
alejandropaz edited this page Mar 10, 2022
·
3 revisions
- twitter API finalize crawl
- need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
- get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
- once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
- Alejandro will provide domain url's for 5 smaller domains
- plain text extraction move to postprocessor (as described above)
- difficulty checking results - large files
- automate making csv output files at 1 million for error checking
- script that will allow either 1 single output file from the Twitter API, or else break into multiple output files of maximum 1 million tweets (in order to open in Excel)
- url extender
- look into the extender that John developed, and think how it should be used: should it be added to the postprocessor, or kept as a separate script
- crawl with all options (geolocation etc)
- we are able to crawl with all the public metrics, including geo & withheld
- test Twitter API output with small file of around 60,000
- moved plain text extraction to postprocessor:
- looking over results, waiting on RA
- next step: postprocess twitter api
- need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
- documentation completed, couldn't edit padlet - ask Kirsta
- looking at depreciate issue: still method from before 1.0
- pre-hook: checks URL and filters it before crawling; has been updated quite a bit, therefore somewhat complicated
- so far no other issues
- new policy for url extender
- finalize puppeteer update
- finalize crawl of timelines for KPP/MediaCAT: 60,001 + tweets
- make postprocessor able to read twitter API output
- Alejandro/RA checking output from postprocessor extraction of plain text
- small domain crawl
- Benchmarking
- finish documenting where different data are on our server
- finding language function
- image_reference function
- dealing with embedded versus cited tweets