Billboard top 100 songs of the week parser (Common crawl)

This is a simple example how to get web archive from common crawl and parse the data. In this example, The top 100 songs of the week on Billboard is parsed.

billboard.py

This script will search for domain in the common crawl index and record all json data and download the gzip file and uncompress it. It uses year_index.py to get the list of indexes and html_parser.py for parsing data. Data will recorded into csv files and the file names will be the week.

Structure of csv file: Rank, Song, Artist

html_parser.py

This will parse the given html and returns week,rank, song and artist.

year_index.py

This will index all the years from 2016 by crawling into common crwal website. The year is from 2016 because before that html site was different.

This program requires BeautifulSoup4 and request module

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
README.md		README.md
billboard.py		billboard.py
html_parser.py		html_parser.py
year_index.py		year_index.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Billboard top 100 songs of the week parser (Common crawl)

billboard.py

html_parser.py

year_index.py

About

Releases

Packages

Languages

ankitsagar/common-crawl

Folders and files

Latest commit

History

Repository files navigation

Billboard top 100 songs of the week parser (Common crawl)

billboard.py

html_parser.py

year_index.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages