This is a simple implementation of a web crawler in Go. It scans for all the URLs in the given domain and returns a JSON response.
- Scrapes all the url on the given domain
- Async scraping
- Uses super fast Gin framework for running an HTTP rest server
- No support for multi-domain scraping
- No support for web pages behind authentication/forms
- No support for dynamic/Ajax-based web pages
- The scraper does not support invisible scraping
- No caching support
- Go version 1.12 or above
To build the source, go must be installed with GOROOT and GOPATH set correctly. Read this to set up your go environment. Once the setup is complete, clone the repository or put the source directory into $GOPATH/src/
.
Now run go get ./...
inside the source directory (web-crawler). This will download all the dependencies into $GOPATH/pkg/
. If this does not work in rare case; install dependecies individually:
go get -u github.com/gin-gonic/gin
go get -u github.com/gocolly/colly/...
go get github.com/gin-contrib/gzip
After the dependencies are installed, run go build
from inside the source directory. This will create an executable file according to the host OS. To run the crawler service use
LINUX/MAC: $ ./web-crawler
WINDOWS: > web-crawler.exe
If you want to build for a different target machine then use the following command:
WINDOWS(64-bit): env GOOS=windows GOARCH=amd64 go build
LINUX(64-bit): env GOOS=linux GOARCH=amd64 go build
MAC(64-bit): env GOOS=darwin GOARCH=amd64 go build
By default, the gin mode is set to 'DebugMode' so that when you run the executable; all the registered endpoints can be seen. This can be changed by changing the Environment
var from config.go
.
To run tests, use go test
Once the service is running, it will expose GET /crawl
endpoint. Create a request as follows:
REQUEST: GET localhost:8888/crawl
HEADER: "Scrape": "https://wiprodigital.com"
It takes 8.6 secs to crawl 226 URLs of https://wiprodigital.com and create JSON response on an 8-core machine running windows 10 with 8GB of memory using Postman client. There is no guarantee that the first run will return all the 226 URLs but tests show that it takes 2-4 initial runs to produce a consistent result. This depends on many parameters like limits in domain, NS, etc.
The limitations mentioned will be removed in the next release(s).