Skip to content

This is a simple example how to get web archive from common crawland parse the data. In this example, The top 100 songs of the week on Billboard is parsed.

Notifications You must be signed in to change notification settings

ankitsagar/common-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Billboard top 100 songs of the week parser (Common crawl)

This is a simple example how to get web archive from common crawl and parse the data. In this example, The top 100 songs of the week on Billboard is parsed.

billboard.py

This script will search for domain in the common crawl index and record all json data and download the gzip file and uncompress it. It uses year_index.py to get the list of indexes and html_parser.py for parsing data. Data will recorded into csv files and the file names will be the week.

Structure of csv file: Rank, Song, Artist

html_parser.py

This will parse the given html and returns week,rank, song and artist.

year_index.py

This will index all the years from 2016 by crawling into common crwal website. The year is from 2016 because before that html site was different.

This program requires BeautifulSoup4 and request module

About

This is a simple example how to get web archive from common crawland parse the data. In this example, The top 100 songs of the week on Billboard is parsed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages