The crawler finds and retrieves all the publiccode.yml files from the Organizations registered on Github/Bitbucket/Gitlab listed in the whitelistes, and then generates YAML files that are later used by the Jekyll build chain to generate the static pages of developers.italia.it.
- Elasticsearch for storing the data
- Kibana for internal visualization of data
- Prometheus for collecting metrics
- Docker
- Docker-compose
- Go >= 1.11
-
rename .env.example to .env and fill the variables with your values
- default Elasticsearch user and password are
elastic:elastic
- default Kibana user and password are
kibana:kibana
- default Elasticsearch user and password are
-
rename
elasticsearch/config/searchguard/sg_internal_users.yml.example
toelasticsearch/config/searchguard/sg_internal_users.yml
and insert the correct passwordsHashed passwords can be generated with:
docker exec -t -i developers-italia-backend_elasticsearch elasticsearch/plugins/search-guard-6/tools/hash.sh -p <password>
-
insert the
kibana
password inkibana/config/kibana.yml
-
configure the nginx proxy for the elasticsearch host with the following directives:
limit_req_zone $binary_remote_addr zone=elasticsearch_limit:10m rate=10r/s; server { ... location / { limit_req zone=elasticsearch_limit burst=20 nodelay; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_pass http://localhost:9200; proxy_ssl_session_reuse off; proxy_cache_bypass $http_upgrade; proxy_redirect off; } }
-
you might need to type
sysctl -w vm.max_map_count=262144
and make this permanent in /etc/sysctl.conf in order to start elasticsearch, as documented here -
start the Docker stack:
make up
cd crawler
- Fill your domains.yml file with configuration values (like specific host basic auth tokens)
- Rename config.toml.example to config.toml and fill the variables
- build the crawler binary:
make
- start the crawler:
bin/crawler crawl whitelist/*.yml
- configure in crontab as desired
bin/crawler updateipa
downloads IPA data and writes it into Elasticsearchbin/crawler download-whitelist
downloads orgs and repos from the onboarding portal and writes them to a whitelist file
-
From docker logs seems that Elasticsearch container needs more virtual memory and now it's
Stalling for Elasticsearch....
Increase container virtual memory: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode
-
When trying to
make build
the crawler image, a fatal memory error occurs: "fatal error: out of memory"Probably you should increase the container memory:
docker-machine stop && VBoxManage modifyvm default --cpus 2 && VBoxManage modifyvm default --memory 2048 && docker-machine stop
In order to access Elasticsearch with write permissions from the outside, you can forward the 9200 port via SSH using ssh -L9200:localhost:9200
and configure ELASTIC_URL = "http://localhost:9200/"
in your local config.toml.
- publiccode-parser-go: the Go package for parsing publiccode.yml files
- developers-italia-onboarding: the onboarding portal
Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.