Downloads all GPL tarballs (and zips and rars!) from TP-Link by parsing https://www.tp-link.com/au/choose-your-location/
, then extracting country-specific support/gpl-code/
pages to get lists of tarballs
The pages are structured in such a way that they'll either have direct links to tar.gz
files or similar, or Javascript generates links to a page like https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20M5&appPath=kz
for each model and country code
Gets the list of countries and creates initial list of URLs.
- links/{country code}.json - Cache of productTree JSON from each GPL code page
- links/country code}.model.csv - Links to model pages, we need to parse these further to get tarballs
CSV that looks like
link,original_url,model_name,appPath
https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20X60&appPath=au,https://www.tp-link.com/au/support/gpl-code/,Deco X60,au
https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20X20&appPath=au,https://www.tp-link.com/au/support/gpl-code/,Deco X20,au
Used by second_pass.
- links/{country code}.tars.csv - Direct links to tarballs
CSV that looks like
link,original_url,model_name,appPath
https://static.tp-link.com/resources/gpl/GPL_X90_1.tar.gz,https://www.tp-link.com/au/support/gpl-code/,Deco X90,au
https://static.tp-link.com/resources/gpl/GPL_X68_1.tar.gz,https://www.tp-link.com/au/support/gpl-code/,Deco X68,au
Used by second_pass.
Uses warcio
's ability to wrap requests
to set up a nice little cache layer.
Download cache generated by using warcio
and requests
- uncompressed WARC 1.1 format
Future plans to dump this into SQLite and compress with https://github.com/phiresky/sqlite-zstd, but do we really need to?
- output/{sha256sum of url} - WARC file used as cache
Parses *.model.csv, downloads the additional model pages and parses them for more links to archives, which are then added to the corresponding *.tars.csv file
Run bash extract_exists.sh path/to/archives
in the path you want to extract to
-
Reduce amount of log spam - use https://gist.github.com/bdarnell/3118509 or similar
-
Rename output/ to cache/
-
second_pass: grab all links to tarballs, deduplicate, write metadata to sqlite (HEAD requests?), compare with already downloaded tarballs (?)
-
document third pass