-
Notifications
You must be signed in to change notification settings - Fork 1
Aug 4, 2023
alejandropaz edited this page Aug 4, 2023
·
2 revisions
-
postprocessor
-
upload NYT archive crawler with brake as separate branch and document what the difference is with the earlier version - Gy
-
speed up small domain crawl a bit - Gy
-
do a count of the Israeli domain crawl - Gy
-
crawl of NYT "Israel" for the years 2006-2009, , and use article filter - Gy
-
continue with the postprocessor - Fr
- 2 problems were giving us trouble: one an additional header line and 2nd, copy/paste where the full last line wasn't being copied before running on the postprocessor
- figure out the right url (finally sent by digital alliance) and created a new instance
- if message comes with "all requests have been processed" with few results, likelihood is that the crawler is being blocked
- provide separate IP address, if something is flagged
- might be helpful if creating new address
- small domain crawl was separated out, and only Jewish Journal wasn't working (same one which is corrupted), but then after five days a couple more stopped working
- the 2 stopped working (Jewish Currents & Peter Beinart) were put together and using a new IP address
- on Arbutus server, the Jewish Journal data are not marked as corrupted
- Israeli domain crawl:
- really slow: 15,000 a week, some only a few results
- only 1 with lots of crawler results
- check crawl every 2 days - Gy
- update the MVP esp wrt format of data going into postprocessor and coming out, and then as input to the visualization environment - Gy/Fr
- push corrected postprocessor code to master - Gy/Fr
- postprocessor: document with instructions the order of utilities and steps to use the postprocessor - Gy
- backburner: figure out corruption in small domain crawl
- develop script and documentation to remove extra header lines from twitter crawl output as prior to postprocessing - Fr
- check URL extender to see if most updated - Fr
- run URL extender on test twitter crawl output (~23,000) and run postprocessor on the resulting output - Fr
- check results of postprocessor on test data - Al
- if results work, run URL extender on all twitter crawl (Fox News and Washington Post, keeping separate) and postprocess - Fr
- check if new IP address created with new instance - Gy
- pause Israeli domain crawl while testing other crawl technique - Gy
- set up individual crawls for Israeli domains to test crawl technique, and check regularly to see if multiple errors have cause brake - Gy
- if new IP address is created with new instance, try NYT archive crawl - Gy