-
Notifications
You must be signed in to change notification settings - Fork 1
October 13, 2020
inikolaidis edited this page Oct 13, 2020
·
4 revisions
- Ticket Review In Project
- Post-Processing
-
New project in Github to track issues: MediaCat Refactor 2020
-
Twitter crawler - Danhua
- owner of getoldtwitter Python library has updated for new Twitter rules, will have a session to learn how to use this for our project
-
Researching date recognition - Amy & Jacqueline
- two pathways
- estimating date from Google indexing would require use of Google API; limit to how many searches per day can be performed (max 100)
- Python DateGuesser library to retrieve dates, also checking out Javascript libraries because it seems they are more maintained
- Jacqueline is writing tests for existing date-retrieving libraries to see what proportion of dates are captured. Best results come from the bigger sites that have the dates in the URL, but in multi-language there are more struggles to find the date.
- where does the time identifier belong?
- if in Javascript can be part of the crawler
- if in Python it will be a separate tool
- Jacqueline & Raiyan will make a decision about which method to use for date capture by next week
- two pathways
-
MediaCat Domain Crawler - Raiyan & Alex
- Alex's filter function checks: makes sure URL is in scope, is not the domain URL, and gets rid of repeat URLs
- crawl run (using filter function) for two domains - the IDF and Al Jazeera
- retrieves text content, title, and html content for articles successfully
- went 5 articles deep, and ended up on an Al Jazeera homepage (homepage URLs are not always a complete match to the default domain URL)
- ignored urls that were outside of the domain were collected, grouped by domain name
- definition of pseudoURL is how the links are determined as qualifying to be crawled - ex. aljazeera.com/news as pseudourl will not retrieve aljazeera.com
-
Postprocessor
- new issue created to address postprocessor development
-
Scope parser demo - Danhua
- created functions for checks (checks for valid url, valid twitter handle, type of source, valid csv)