-
Notifications
You must be signed in to change notification settings - Fork 1
November 24, 2020
inikolaidis edited this page Nov 24, 2020
·
5 revisions
- Progress of the crawl
- Domain names to crawler function needs to be pushed and tested
- Crawler is only bringing back homepage
- Date crawler needs to be re-written
- Demo of post processor
- OS updates
- Progress on visualizations
- Ticket Review
-
Twitter crawl was completed, full historical data collected! This took ~1 week
- the alternate Twitter API-using version will be polished and provided publicly as part of MediaCAT
-
Crawler only bringing back homepage and exiting
- likely due to the limit of linkcrawling set at 20, reaching limit before actually getting to crawl articles
- defining this limit - since currently JSON is outputted after completed crawl, how do we define limits? or do we write to individual JSONs while crawl is conducted?
- do we want to use databases then (ex. MongoDB)?
- asynchronous nature of crawls & making the crawl "infinite" means that we may be writing to JSON as we are reading from them - this gives reason to consider using databases
- currently, one JSON file output for everything - this will create tracking issues in the future - this gives reason to change to one JSON for each link
- JSONs can be written to a directory as crawl is happening
- individual JSONs need to be created, but database is not a priority at this point
- this will affect the crawl output, as well as the postprocessing and date crawler
-
Postprocessor update
- the matching between domain and twitter data is working
- right now the code could read all the individual JSONs at once
- next: incorporating defined interval (ex. every 3 hours) updates on the # of URLs per domain in given timeframe that have been crawled
- to be re-run (manually triggered) at different intervals to create additional linkages
-
Ticket review
- Add a function to get the date from the commandline and validate the date arguments given
- this was implemented to verify the crawler dates
- Scope fix merged
- Add pre_processor for twitter output - to be reviewed
- Upgrade OS merged
- Scope parser validation - error checking to ensure http:// or https:// beginning needs to be added to the initial validation
- Accepting a .csv file from the parser to populate initial queue - in progress
- MediaCat Domain Crawler to-dos
- make multiple individual JSON files
- add UUIDs
- provide sample of individual JSONs to Amy for testing
- Create post-processor framework - in-progress and will need testing
- Add a function to get the date from the commandline and validate the date arguments given
-
New tickets
- task added to MediaCat Domain Crawler - write crawler progress (domain and # of URLs crawled) to CSV ex. every 3 hours
- New repository with API based code - for the Twitter API-supported version of twitter crawler
-
Resources
- Raiyan's helpful link on webscraping & blocks