Scraper engine based on Selenium Python bindings. Uses Flask for its API and ReactJS for its client.
-
Set up Python enviromnent:
python -m venv .env source .env/bin/activate python -m pip install -r requirements.txt
Windows users should use
activate.bat
instead:.env\Scripts\activate.bat
-
Setup Node.js environment:
npm install
-
Build client:
npm run build
-
Edit
composer/standalone-chrome
with your VNC password. Replaceflaskscraper@123
. -
Edit
docker-compose.yml
with your PostgreSQL database. Afterwards start the service with:sudo docker compose up
Windows users should omit the
sudo
docker compose up
-
Navigate to home page:
http://localhost:5000
PYTHONUNBUFFERED
: Used to configure python. Set totrue
FLASK_ENV
: Used to configure Flask server. Set todevelopment
NODE_ENV
: Used to configure Webpack. Set todevelopment
DATABASE_URI
: Link to PostgreSQL database. Defaultpostgres://postgress:postgress@localhost/postgres
SELENIUM_URI
: Link to Selenium API server. Should not include a trailing slash. Defaulthttp:/localhost:4444
.
+-- src
| +-- client
| | +-- static
| | +-- index.htm # react single page app
| | +-- favicon.ico
| | +-- main.js # webpack bundle file
| +-- server
| +-- app.py # flask application file
| +-- conftest.py # pytest configuration file
| +-- routes
| +-- scrapper # scripts are stored here
+-- .browserlist # configuration used by babel-loader
+-- .babelrc # babel-loader configuration file
+-- docker-compose.yml # docker service configuration
+-- Dockerfile # docker file for flask container
+-- package.json # node.js configuration
+-- setup.py # python configuration
+-- requirements.txt # python configuration
React.js client files are found in the src/client
directory. These are compiled using Webpack into src/client/static
directory. See README.md for more information
Flask REST server files are found in src/server
directory. You can add new scripts by creating a folder in src/server/routes/scraper
directory. See README.md for more information
docker run \
--rm -d -p 4444:4444/tcp -p 5900:5900/tcp \
--name selenium \
-e SE_NODE_SESSION_TIMEOUT=240 \
-e SE_NODE_MAX_SESSIONS=16 \
-v /dev/shm:/dev/shm \
selenium/standalone-chrome:91.0
export NODE_ENV=development
export FLASK_ENV=development
source .env/bin/activate
npm run watch &
python -m flask run
$env:NODE_ENV=development
$env:FLASK_ENV=development
.env\Scripts\activate.bat
Start-Process -NoNewWindow npm -ArgumentList "run", "watch"
python -m flask run