tplink-grab

Downloads all GPL tarballs (and zips and rars!) from TP-Link by parsing https://www.tp-link.com/au/choose-your-location/, then extracting country-specific support/gpl-code/ pages to get lists of tarballs The pages are structured in such a way that they'll either have direct links to tar.gz files or similar, or Javascript generates links to a page like https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20M5&appPath=kz for each model and country code

first_pass.py

Gets the list of countries and creates initial list of URLs.

Output

links/{country code}.json - Cache of productTree JSON from each GPL code page
links/country code}.model.csv - Links to model pages, we need to parse these further to get tarballs

CSV that looks like

link,original_url,model_name,appPath
https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20X60&appPath=au,https://www.tp-link.com/au/support/gpl-code/,Deco X60,au
https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20X20&appPath=au,https://www.tp-link.com/au/support/gpl-code/,Deco X20,au

Used by second_pass.

links/{country code}.tars.csv - Direct links to tarballs

CSV that looks like

link,original_url,model_name,appPath
https://static.tp-link.com/resources/gpl/GPL_X90_1.tar.gz,https://www.tp-link.com/au/support/gpl-code/,Deco X90,au
https://static.tp-link.com/resources/gpl/GPL_X68_1.tar.gz,https://www.tp-link.com/au/support/gpl-code/,Deco X68,au

Used by second_pass.

cached_downloader.py

Uses warcio's ability to wrap requests to set up a nice little cache layer. Download cache generated by using warcio and requests - uncompressed WARC 1.1 format

Future plans to dump this into SQLite and compress with https://github.com/phiresky/sqlite-zstd, but do we really need to?

Output

output/{sha256sum of url} - WARC file used as cache

second_pass.py

Parses *.model.csv, downloads the additional model pages and parses them for more links to archives, which are then added to the corresponding *.tars.csv file

scripts/extract_exists.sh

Run bash extract_exists.sh path/to/archives in the path you want to extract to

TODO

Reduce amount of log spam - use https://gist.github.com/bdarnell/3118509 or similar
Rename output/ to cache/
second_pass: grab all links to tarballs, deduplicate, write metadata to sqlite (HEAD requests?), compare with already downloaded tarballs (?)
document third pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

tplink-grab

first_pass.py

Output

cached_downloader.py

Output

second_pass.py

scripts/extract_exists.sh

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

tplink-grab

first_pass.py

Output

cached_downloader.py

Output

second_pass.py

scripts/extract_exists.sh

TODO