Skip to content

January 20, 2022

alejandropaz edited this page Jan 20, 2022 · 7 revisions

Agenda:

  • move to issues to organize tasks
  • make Com Canada resource map private?
  • finalize any loose ends on Com Canada updates
  • benchmarking
  • look at txt for scope
  • Colin's suggestions about making spreadsheets
  • Twitter crawler estimate
  • If time: small site crawl? needs post-processing?

Compute Canada:

  • Graham Cloud running on latest OS as far as Shengsong can make out.
  • prepare email to CC cloud IT about whether any action needs to be taken to update OS for Graham and Arbutus instances? (seems like we're up to date)
  • figure out if SSH keys will be affected by a change to my CC password

Benchmarking

  • problem with storage -- Nat had a workaround suggestion with a soft linking
  • Colin will move some of the folders

Twitter crawler:

  • comment on TWINT #1295 that the fix from early December isn't working
  • research new Twitter crawlers:
    • question 1: how long would it take for you to integrate a new twitter crawler?
      • is the current twitter config sufficiently modular to easily swap a new (python?) crawler in?
    • question 2: would a javascript crawler that simulates human work better? (Is Apify a javascript crawler?)
    • others that Danhua considered: Twarc & Getoldtweets (see here, scroll down)
    • also look at Apify & Twitter API
      • does the Twitter API have a cost associated?
    • Other new twitter crawlers out there?
  • currently, it seems that none of these non-API twitter crawlers are working
  • academic research account:
    • archive search limit is 500 per request, see here
    • developer TOS here

Producing Spreadsheets

  • file much smaller without plain text
  • we're not sure whether the plain text error is produced by crawler or postprocessor

Action Items:

  • Shengsong: will move Compute Canada resources map to another space
  • Alejandro: send email to Compute Canada
  • Alejandro: look at scope text file
  • Alejandro: update server notes in new google doc & change password for Graham
  • Colin: delete SSH keys from old developers
  • Shengsong: set up firewall for our instance, link here
    • make sure to enable all ports that we are using, for example, for Jupyter & SSH
    • Nat recommends using USW
  • Colin: move folders as agreed
  • Shengsong: will try to re-start the benchmarking once the re-organization and soft-linking is done
  • Colin: will look to see if what's in the postprocessed spreadsheet is the same as in the JSON for plain text to see whether the error is produced by the postprocessor
Clone this wiki locally