Hello you,

Getting started

Setting up the Python environment

pipenv is used to manage the whole Python environment.

Install pipenv (this must be done only once)
Run pipenv install --dev to create a Python virtual environment if unexisting and to install the Scraper's dependencies (this will have to be done each time dependencies are modified)
Run pipenv shell to enter the Python virtual environment (this will have to be done each time you get back to work on the Scraper)

Authentication

Disclaimer: When Algolia runs the scraper on Algolia's infrastructure, no authentication methods are possible since DocSearch focus on publicly available websites.

WARNING: Please be aware that the scraper sends authentication headers to every scraped site, so use allowed_domains to adjust the scope accordingly!

Basic HTTP:

Basic HTTP authentication is supported by setting these environment variables:

DOCSEARCH_BASICAUTH_USERNAME
DOCSEARCH_BASICAUTH_PASSWORD

Cloudflare Access: Identity and Access Management

If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers. Values for these headers are taken from env variables CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.

In case of Google Cloud Identity-Aware Proxy, please specify these env variables:

IAP_AUTH_CLIENT_ID - # pick client ID of the application you are connecting to
IAP_AUTH_SERVICE_ACCOUNT_JSON - # generate in Actions -> Create key -> JSON

Installing Chrome Headless

Websites that need JavaScript for rendering are passed through ChromeDriver. Download the version suited to your OS and then update the CHROMEDRIVER_PATH in your .env file.

You should be ready to go.

Running

See the dedicated page on Algolia's documentation web site.

Lint code

The code is checked against linting rules by the CI, with pylint (which is installed by pipenv as a dev package).

To run the linter, run the following command at the root of your clone:

pipenv run pylint scraper cli deployer

Run the tests

To run the full test suite, run ./docsearch test.

Special instructions for Algolia employees

Ready to deploy

If you are Algolia employee and want to manage a DocSearch account, you'll need to copy and paste the required variables in your .env file.

Ping the @DocSearch team to get access to those credentials.

Clone the configurations repository in your docsearch-scraper directory. Command to run from the scraper root:

git clone git@github.com:algolia/docsearch-configs.git configs/public

The CLI will then have more commands for you to run.

Reason of a fail from the docker image

To spot why a crawl fail without watching the logs, we have defined some custom exit status:

Exit code	Reason
3	No record extracted from the crawl
4	Too much hits returned from the crawl
5	The configuration provided is not a valid JSON
6	The endpoint to call is incorrect
7	Credentials used to request are not set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Getting started

Setting up the Python environment

Authentication

Basic HTTP:

Cloudflare Access: Identity and Access Management

Installing Chrome Headless

Running

Lint code

Run the tests

Special instructions for Algolia employees

Ready to deploy

Reason of a fail from the docker image

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Getting started

Setting up the Python environment

Authentication

Basic HTTP:

Cloudflare Access: Identity and Access Management

Installing Chrome Headless

Running

Lint code

Run the tests

Special instructions for Algolia employees

Ready to deploy

Reason of a fail from the docker image