Sample scrapy project intergrate with AWS Step Function to trigger all lambdas all at once, then save results to AWS S3 Bucket.
References:
- Another Way to Trigger a Lambda Function Every 5–10 Seconds
- Serverless Scraping with Scrapy, AWS Lambda and Fargate – a guide
With a couple of modifications for scrapy working with AWS Lambda
- Docker (Non-linux evironment, for building compatible python package for AWS Lambda environment)
- Python 3.9
- Pipenv
- NodeJs 16
- AWS cli + AWS profile
- Serverless CLI 3.22
Python packages is managed by Pipenv. Use Pipenv's pipenv install
to install required packages and pipenv shell
to start python development environment with those installed packages.
This repository is already a scrapy project. Any scrapy command can be used. For example scrape a spider that is already defined:
scrapy crawl quotes -o test.json
We can test the lambda function by invoke it locally:
serverless invoke local -f scrape_quotes
Change the stage
of the deployment in the serverless.yml
file.
With configued AWS CLI Profile, serverless deployment can be done by using
serverless deploy
Defined serverless deployment can be remove by using
serverless remove
Or Delete corresponding stack on CloudFormation
All of buckets created needs to be empty before removing resources, we can remove again if there's errors