November 03, 2020

Ticket Review
What should we do with all this extra Twitter Data?
Beginning the Crawl on SciNet!

Notes

We are delayed on finalizing the scope - deadline for end of this week
Twitter crawler code
- Additional columns to be incorporated in the output, ex. date, hashtag, language, mentions at other twitter users, retweets_count, likes_count
- the mentions and URLs have to be understood as links and references
- we need a way to handle errors and understand progress of the crawler - when an error has happened, right now the thread just exits
- in old Mediacat, even if a link was faulty or had an exception, they found a way to bypass it so the crawl could still continue
- exit signal may trigger domain crawler exiting - need to have a way of keeping track of which handle was crawled
Mediacat domain crawler
- implemented different regex for different domains - detects different links in each crawled site
- accepting CSVs
- test crawl on sample of 20 pages - grabbed links from each page, filtered outside-of-domain links, data outside of scope are in a separate JSON
Integrate PDF capture
- has been completed with UUIDs generated from the URL of a page
- PDF representation of pages is more costly - so this may need to be implemented as another service optionally run after post-processing (when only interlinked items in scope remain)
Review integration of metascraper into domain crawler
- Gives an issue with async so it will need to be incorporated as a separate crawl
Integrate date detection into crawler
- Resolve / reject issue has been resolved but the detection hangs after receiving many requests (unhandled promise exception)
- Jacqueline will work with Alex to examine this issue
Add twitter_crawler complier
- Travis tests are not passing (this was before project made private)
- waiting for response before merging
- Travis is only free on public repositories, so these tests will not work in the future
Create post-processor framework
- regex completed for text aliases, twitter handles of domain crawler
- waiting on Danhua's output as well as extraction of twitter handles using @ pattern to do the following:
  - creating linkages between references of twitter output and domain output data
  - creating JSON & CSV output that contains top twitter handles and domains out of scope
Modification of crawler to gather plain text of crawled articles
- for each link crawled, same URL added inside of it so every URL can check back to see what URL they were found on
- regex implementation to store domain name rather than full URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

November 03, 2020

Notes

MediaCat Wiki

Clone this wiki locally