Trawler is an application to help in the facilitation of bulk downloads for datasheets and resources from vendor websites.
This comes about because vendors don't tend to host bulk downloads of their content, or something nice like an rsync mirror, rather we need to result to scraping.
Trawler is built around selenium, which allows us to pretend to be a user and let AJAX and other bits of JavaSctript run to allow us to interact with the vendor website and collect the datasheets.
Trawler has two types of adapters, source adapters, and meta adapters. The source adapters are responsible for the collection and download of datasheets, where as the meta adapters are for interacting with the cache data Trawler has.
The following source adapters are included with Trawler:
arm
- Download documentation fromhttps://developer.arm.com/documentation
xilinx
- Download the documentation from the Xilinx DocNav serviceusb-if
- Download the documentation fromhttps://www.usb.org/documents
renasas
- Download the documentation fromhttps://www.renesas.com/us/en/support/document-search
The following source adapters are planned:
ti
- Download the documentation for Texas Instruments.st
- Download the documentation from ST.microchip
- Download the documentation from Microchip.micron
- Download the documentation from Micron
If there is not an adapter in this list you want, feel free to open an issue or contribute it yourself!
The following meta adapters are implemented currently:
zorero
- Integration and sync with a local Zotero install
The following meta adapters are planned:
query
- Very trivial datasheet lookup by title / tagexport
- Export cache information in various formats
To use Trawler, in the most simple way, simply invoke the adapter for the datasheet source you want, like so:
trawler arm
This will cause Trawler to initialize everything it needs and then it will automatically start the entire acquisition process. This will absolutely take a long time, and the length of which heavily depends on the adapter.
Each adapter has their own settings and configuration in addition to the global settings, to see the settings the adapter support, simply issue --help
to it:
trawler arm --help
To list the adapters that Trawler knows about, simply pass --help
to Trawler by itself and it will let you know about all the adapters it has
trawler --help
Trawler supports the following settings globally:
--output / -o
- Specify the output directory for Trawler to use.--timeout / -t
- Specify the timeout duration in seconds for network operations.--retry / -r
- Specify the number of times to retry network operations.--delay / -d
- Specify the delay in seconds for network operations.--cache-database / -c
- Specify the location and name of the datasheet cache database Trawler uses.--skip-collect / -C
- Skip the datasheet collection stage for the adapter.--skip-extract / -E
- Skip the extraction stage for the adapter.--skip-download / -D
- Skip the download stage for the adapter.--user-agent / -A
- Specify the user-agent to use when downloading files.
The following settings are used for the WebDriver, and therefore only effect the adapters / stages that use it:
--profile-directory / -p
- Specify the WebDriver profile directory.--webdriver / -w
- Specify the WebDriver to use.--headless / -H
- Tell the WebDriver to run in headless mode.--headless-width / -X
- Specify the virtual width of the WebDriver instance.--headless-height / -Y
- Specify the virtual hight of the WebDriver instance.
The following settings are only applicable to the ARM adapter:
--arm-document-type / -A
- Specify the types of documents to collect and download.
The following settings are only applicable to the Xilinx adapter:
--dont-group / -G
- Don't group Datasheets into categories and groups when downloading.--collect-web-only / -W
- Allow Trawler to collect the web-only content.
The following settings are only applicable to the Zotero meta adapter:
--zotero-db-location
- Specify the location of the Zotero database if it's not the default.
The Zotero has the following actions it can take:
sync
- Sync the Trawler cache with the Zotero database.
The Zotero sync action has the following settings:
--backup
- Backup the Zotero database before performing the sync.--backup-dir
- Set the backup directory for the Zotero database.
With pip, all the needed dependencies for Trawler should be pulled in automatically
To install the current development snapshot, simply run:
pip3 install --user 'git+https://github.com/bad-alloc-heavy-industries/Trawler.git#egg=Trawler'
Or to install a local development copy:
git clone https://github.com/bad-alloc-heavy-industries/Trawler.git
cd Trawler
pip3 install --user --editable '.'
NOTE: The adapters that need a WebDriver will only work if you have one installed for selenium to use!
- Some adapters won't work if the WebDriver viewport is smaller than 1920x1080, you can possibly fix this by running the WebDriver headless with the correct virtual size if the WebDriver supports it.
Trawler is licensed under the BSD 3-Clause license, the full text of which can be found in the LICENSE
file.