Scrapex is a versatile scraping component designed to efficiently extract content from URLs. Leveraging the power of Playwright and Chrome, it ensures seamless support for Single Page Applications (SPAs) and content dependent on JavaScript execution. Initially developed for internal use by our AI Agents, Scrapex offers robust functionality for a wide range of scraping needs.
- Support for Multiple Output Formats: Scrapex can output data in HTML, Markdown, or PDF formats, catering to diverse requirements.
- Container Image deployment: For ease of deployment and scalability, Scrapex is fully compatible with Container environments such as Docker or Kubernetes.
- Customizable Settings: Through environment variables, as well as parameters in the extraction call, users can tailor the behavior of Scrapex to suit their specific scraping tasks.
Scrapex supports the following output formats:
- HTML: Direct extraction of HTML content.
- Markdown: Conversion of HTML to Markdown using
html-to-md
. - PDF: Generation of PDF documents utilizing Playwright's PDF functionality.
Configure Scrapex using the following environment variables:
Variable | Description | Default |
---|---|---|
PORT |
Port on which Node.js server listens | 3000 |
DEFAULT_WAIT |
Default milliseconds to wait on page load | 0 |
DEFAULT_USER_AGENT |
Default user agent for requests | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" |
LOG_LEVEL |
Logging level (debug , info , warn , error ) |
debug |
The simplest way to run Scrapex is using Docker. Here's an example docker-compose.yaml
:
version: "3"
services:
app:
container_name: scrapex
image: ghcr.io/cloudx-labs/scrapex:latest # it's better to pin down to a specific release version such as v0.1
environment:
- TZ=America/Argentina/Buenos_Aires
- PORT=3000
- LOG_LEVEL=debug
ports:
- "3003:3000"
To test Scrapex, you can send a request using curl as shown below:
curl --location 'http://localhost:3003/extract' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon",
"outputType": "pdf",
"wait": 0,
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"settings": {
"pdf": {
"options": {
"format": "A4"
}
}
}
}'
The following table describes the parameters included in the payload of the curl
example:
Parameter | Description | Example |
---|---|---|
url | URL of the page to scrape | https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon |
outputType | Desired output format | html / md / pdf |
wait | Milliseconds to wait before extraction | 2000 |
userAgent | User agent to use for the request | Mozilla/5.0 (Windows NT 10.0; Win64; x64)... |
settings | Additional settings for output formatting | { "pdf": { "options": { "format": "A4" } } } |
All available values for settings -> pdf -> options
can be found at: https://playwright.dev/docs/api/class-page#page-pdf
All available values for setting -> md -> options
can be found at: https://github.com/stonehank/html-to-md/blob/master/README-EN.md