Skip to content

Latest commit

 

History

History
58 lines (38 loc) · 2.42 KB

README.md

File metadata and controls

58 lines (38 loc) · 2.42 KB

tplink-grab

Downloads all GPL tarballs (and zips and rars!) from TP-Link by parsing https://www.tp-link.com/au/choose-your-location/, then extracting country-specific support/gpl-code/ pages to get lists of tarballs The pages are structured in such a way that they'll either have direct links to tar.gz files or similar, or Javascript generates links to a page like https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20M5&appPath=kz for each model and country code

first_pass.py

Gets the list of countries and creates initial list of URLs.

Output

  • links/{country code}.json - Cache of productTree JSON from each GPL code page
  • links/country code}.model.csv - Links to model pages, we need to parse these further to get tarballs

CSV that looks like

link,original_url,model_name,appPath
https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20X60&appPath=au,https://www.tp-link.com/au/support/gpl-code/,Deco X60,au
https://www.tp-link.com/phppage/gpl-res-list.html?model=Deco%20X20&appPath=au,https://www.tp-link.com/au/support/gpl-code/,Deco X20,au

Used by second_pass.

  • links/{country code}.tars.csv - Direct links to tarballs

CSV that looks like

link,original_url,model_name,appPath
https://static.tp-link.com/resources/gpl/GPL_X90_1.tar.gz,https://www.tp-link.com/au/support/gpl-code/,Deco X90,au
https://static.tp-link.com/resources/gpl/GPL_X68_1.tar.gz,https://www.tp-link.com/au/support/gpl-code/,Deco X68,au

Used by second_pass.

cached_downloader.py

Uses warcio's ability to wrap requests to set up a nice little cache layer. Download cache generated by using warcio and requests - uncompressed WARC 1.1 format

Future plans to dump this into SQLite and compress with https://github.com/phiresky/sqlite-zstd, but do we really need to?

Output

  • output/{sha256sum of url} - WARC file used as cache

second_pass.py

Parses *.model.csv, downloads the additional model pages and parses them for more links to archives, which are then added to the corresponding *.tars.csv file

scripts/extract_exists.sh

Run bash extract_exists.sh path/to/archives in the path you want to extract to

TODO

  • Reduce amount of log spam - use https://gist.github.com/bdarnell/3118509 or similar

  • Rename output/ to cache/

  • second_pass: grab all links to tarballs, deduplicate, write metadata to sqlite (HEAD requests?), compare with already downloaded tarballs (?)

  • document third pass