Skip to content

November 25, 2021

kstapelfeldt edited this page Nov 25, 2021 · 17 revisions

Agenda

  • John:

    • utility to add relevant urls from scope to "found_url"
      • question: would we want a whole list of urls for each JSON file/node (article) or each row of csv (tweets)? Shouldn't the postprocessor decide which are relevant and which are not?
    • issue with postprocessor with distinct scope from crawler: from John's email: " especially where the cross-matching and output creation is done and I noticed that the ‘found_urls’ are never cross-referenced with the scope which is why the tweet in question is not included in referring ids in output.json. This is confirmed by the fact that all the tweets in the Haaretz CSV you sent to me earlier explicitly included “Haaretz” or “Ha’aretz” in the tweet text; since text aliases are matched, these tweets were included in output.json. Referrals coming via ‘found_urls’ in the tweets are only counted if the url was part of the crawled domain data."
      • example in email
    • document the twitter crawler feature of expanding short urls, and push to master
    • update on the al-monitor crawl on Monday
    • do some research on metascraper, looking into what Jacqueline had come up with and seeing if there is some other way
  • Colin:

    • generate csv with information from revised NYT twitter data as discussed (ie with expanded short url)
    • generate new stacked area chart (esp with rows in orange in file 2021 KPP-MediaCAT Scope Source Sites)
    • if time allows: look at python crawler
  • Alejandro: scope document: up-to-date scope without problematic aliases (NRG, Globes, The Marker, etc)

    • see MVP: "The current updated scope is entitled “MediaCAT Updated Complete Scope” (ask for permission to access). It is the scope that should be used when running projects for Alejandro Paz’s research. Before running, the scope needs to be formatted for MediaCATL the relevant columns are A, B, D, P, Q, and R. The following columns can be converted into one column with spikes (|) between the terms: E, F, G, X, Y."
    • question about how to see the results?

Questions from looking at results:

  • The following tweet was retweeted by another NYT journalist (within our crawl scope), @halbfinger: https://twitter.com/IKershner/status/1278240798165872645
    • However, we don’t see a result for @halbfinger.
    • Question: maybe retweets aren’t counted in result, only quote tweets? This actually makes sense.

Meeting Notes

  • John:

The Crawler and the issue of the missing Found URLs

  • We want the crawlers (both the twitter and domain crawler) to add all found URLs to 'found URL key' and then the post-processor will organize it.
  • Script is complete and running. Will run after crawled data. It was a bit more complicated than he thought. Checked against lost items and seems to be working. After you keep it running for a while it slows down, so needs to be restarted periodically (another script does this automatically). Most JSONs are done but the last 300 are taking too long, so John needs to check this out.
  • Push to Media-cat Back end (post-processor) so it would be a process that would run prior to the post processor.
  • Utilities folder of post-processor to be run on crawler data prior to post-processing.
  • Possible future improvement - manage concurrent running of crawler and this utility script.
  • Remaining task for John: What's causing the slow-down in the processing.

The Twitter Output and the issue of the missing Domains

There is no way to associate twitter links with domains so we can identify what is in scope. We need to determine how to go from listed URl to domain so that these aren't missing.

Unfurling or expanding Twitter crawls

document the twitter crawler feature of expanding short urls, and push to master - script afterward is pushed, but not modification to crawler to do this at runtime. There is an issue with the Twint module blocking progress so we need to look at this. See issue: https://github.com/twintproject/twint/issues/1295#issuecomment-976092425

Update on the al-monitor crawl on Monday

Has tried running a couple of times. Says it's still crawling but there are 100,000 files and this number hasn't changed. John will kill the process and we need to review the data.

Metascraper

  • Raiyan suggested a rework of the script without database, which John did, and got 7 out of 10 tried. Will try against a slightly larger sample and report back. Colin also used a library that made a call (a python library). This is the one he used: https://github.com/adbar/htmldate. Colin's script is in back-end utils. Not fully complete.

  • Colin:

Colin created the revised .csv and new stacked area chart.

Python Crawler

Could not find the python crawler (was in deprecated). We found it here: https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/Crawler.py - be careful and ignore all front-end stuff. Uses this library: https://newspaper.readthedocs.io/en/latest/

It may be worthwhile to utilize this crawler IF there is good coverage in the library for the domains in the scope.

Alejandro's scope document

  • Alejandro: scope document: up-to-date scope without problematic aliases (NRG, Globes, The Marker, etc)
    • see MVP: "The current updated scope is entitled “MediaCAT Updated Complete Scope” (ask for permission to access). It is the scope that should be used when running projects for Alejandro Paz’s research. Before running, the scope needs to be formatted for MediaCATL the relevant columns are A, B, D, P, Q, and R. The following columns can be converted into one column with spikes (|) between the terms: E, F, G, X, Y."
    • question about how to see the results?

Questions from looking at results:

  • The following tweet was retweeted by another NYT journalist (within our crawl scope), @halbfinger: https://twitter.com/IKershner/status/1278240798165872645
    • However, we don’t see a result for @halbfinger.
    • Question: maybe retweets aren’t counted in result, only quote tweets? This actually makes sense.

Action Items

  • John to work on issue of slow processing for Found URLs (utility script populating JSON)
  • John will look into getting the domain from link for twitter crawler to resolve issue of missing URLs
  • John to run some scripts on the output from al-monitor crawl after stopping the domain crawl to see if there are duplicates. If there are no duplicates, run through URL-finding utility script and then start the start post-processor on this data.
  • John to run the new, database-free metascraper script against 10 JSON from the NYT crawl that have the date in the URL, and respond back about success rate. Share these JSON with Colin, who will run html date against the same sample so we can compare the success.
  • John to delete old metascraper and commit new metascraper to utils folder for the post-processor.
  • Colin to document and commit his .csv processing script to a utils folder in the Front-End repository
  • Colin to look at the work involved in running this python crawler again: https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/Crawler.py particularly the underlying library. What would it take to refactor or redevelop this crawler, and what benefits would it bring to the project. One key piece is understanding the domains supported by the library and how many of those appear in Alejandro's scope. See documentation here: https://newspaper.readthedocs.io/en/latest/
Clone this wiki locally