October 13, 2020

Agenda

Ticket Review In Project
Post-Processing

Notes Start Here

New project in Github to track issues: MediaCat Refactor 2020
Twitter crawler - Danhua
- owner of getoldtwitter Python library has updated for new Twitter rules, will have a session to learn how to use this for our project
Researching date recognition - Amy & Jacqueline
- two pathways
  - estimating date from Google indexing would require use of Google API; limit to how many searches per day can be performed (max 100)
  - Python DateGuesser library to retrieve dates, also checking out Javascript libraries because it seems they are more maintained
- Jacqueline is writing tests for existing date-retrieving libraries to see what proportion of dates are captured. Best results come from the bigger sites that have the dates in the URL, but in multi-language there are more struggles to find the date.
- where does the time identifier belong?
  - if in Javascript can be part of the crawler
  - if in Python it will be a separate tool
- Jacqueline & Raiyan will make a decision about which method to use for date capture by next week
MediaCat Domain Crawler - Raiyan & Alex
- Alex's filter function checks: makes sure URL is in scope, is not the domain URL, and gets rid of repeat URLs
- crawl run (using filter function) for two domains - the IDF and Al Jazeera
  - retrieves text content, title, and html content for articles successfully
  - went 5 articles deep, and ended up on an Al Jazeera homepage (homepage URLs are not always a complete match to the default domain URL)
  - ignored urls that were outside of the domain were collected, grouped by domain name
  - definition of pseudoURL is how the links are determined as qualifying to be crawled - ex. aljazeera.com/news as pseudourl will not retrieve aljazeera.com
Postprocessor
- new issue created to address postprocessor development
Scope parser demo - Danhua
- created functions for checks (checks for valid url, valid twitter handle, type of source, valid csv)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

October 13, 2020

Agenda

Notes Start Here

MediaCat Wiki

Clone this wiki locally