Skip to content

September 23, 2021

alejandropaz edited this page Sep 23, 2021 · 2 revisions

Agenda

  • update on Compute Canada: both John and Colin understanding? Follow up questions to Raiyan?
  • update on postprocessor: "John will run the postprocessor, and record the size of data from TWINNT & Domain Crawler, and how long it takes, and size of output file"
  • single-site post-processor?
  • coop hiring
  • finalizing paperwork for Colin?

Postprocessor

  • output of Twinnt wasn't what postprocessor was expecting, so John wrote a bridging function
  • memory issue: crawled JSON were too big to read, not complete results
  • 10,000 JSONs small ones, 8 were skipped, and all the TWINNT
  • total time: 3151 seconds, 52 minutes
  • Output format does seem to meet the expectations of the spreadsheet output format
  • Problem: can't open the largest JSON (3Gb)
  • question: output (regular output) & interest-output (outside of scope):

Compute Canada Issues

  • John followed up with Raiyan and Raiyan said that the chrome browser should be killed when the crawler terminates, not sure not why happening; suggested reboot (John will test this)
  • there are some new files created in the crawler, but we will focus on the output that

Tasks:

  • kill the running processes of crawler
  • re-run postprocessor with full output of both NYT & twitter
  • what to do with the interest?
  • Alejandro will communicate with Amy about the "output" & "interest-output" distinction
  • Colin will attempt to stream (or wheatever its called) the interest.json
  • if time allows, Colin will attempt a visualization
Clone this wiki locally