docker-nba-etl

Running

This docker image can be used to run a import using the NBA ETL module. The by default the import data is not in the container. You should bindmount it.

docker run --rm \
  -v /path/to/your/data:/payload/data \
  -v /path/for/logs:/payload/software/log \
  --link your-es-container:es docker-nba-etl:V2.3 ./import-all

Your data directory (/path/to/your/data) should contain the folders

brahms
col
crs
geo
medialib
ndff
nsr

You can also overwrite settings with environment variables (from version V2.3-20-g461dab56 and up) posibilities (with defaults) are

ES_DNS=es
DEFAULT_SHARDS=12
NUM_REPLICAS=0
NBA_INDEX_NAME=nba
COL_YEAR=2016
PURL_BASE_URL=''
TEST_GENERA=#test_genera=malus,parus,larus,bombus,rhododendron,felix,tulipa,rosa,canis,passer,trientalis
DISABLE_TRUNCATE=FALSE

You also need an elasticsearch 5.1.2 instance.

Before running elasticsearch make sure you have your vm.max_map_count setting correct To check run

/sbin/sysctl vm.max_map_count

It should be 262144 or higher To modify run

/sbin/sysctl -w vm.max_map_count=262144

Run 5.1.2:

docker run --name my-es5-01 \
  -d -e ES_JAVA_OPTS="-Xms512m -Xmx512m" \
  elasticsearch:5.1.2 elasticsearch \
    -Ecluster.name="nba-cluster" \
    -Enetwork.host="_site_"

In a full nba cluster, elasticsearch should be already setup. If you want a own elasticsearch server, make sure you have enough disk and memory and modify the settings -Xms and -Xmx. The values of -Xms and -Xmx shoud be the same.

Running it with auto import

You can also now run it in a fully automated way, with auto download of source data. You then need to set the following values (defaults are show)

AUTO_IMPORT=TRUE (default false)
IMPORT_DATA_DIR=/payload/data (dont really change unless you are sure)
IMPORT_COMMAND=./import-all
GIT_URL_PREFIX=https://github.com/naturalis/
CONSOLE_LOG=TRUE (default is false)
REPOS="nba-brondata-nsr:master,nba-brondata-medialib:master,nba-brondata-crs:master,nba-brondata-col:master,nba-brondata-brahms:master,nba-brondata-geo:master"

so running:

docker run --name my-import-job -e AUTO_IMPORT=TRUE docker-nba-etl:V2.3-20-g461dab56

Will download data for nsr,medialib,crs,col and brahms and run ./import_all

The data directory to which the data will be cloned is /payload/data + the last part of the git repository name. So nba-brondata-nsr will be cloned to /payload/data/nsr

You can change the repository branch or tag, for example nba-brondata-media:prisma-test. If you want to download multiple repositores, seperate them with a comma.

If you for example just want to import Geo you run

docker run --name geo-import-job \
    -e REPOS="nba-brondata-geo:master" \
    -e IMPORT_COMMAND="./bootstrap GeoAreas && ./geo-import" \
    -e AUTO_IMPORT="TRUE" \
    --rm \
    -d \
    -v /local/path/for/logs:/payload/software/log \
    --link <your-es-container>:es \
    docker-nba-etl:V2.4

To find <your-es-container> run docker ps and search for the name of the elasticsearch container.

The container then will run the download the Geoarea data, run in the import and after the import the container will be stopped and deleted. The logs will be saved at /local/path/for/logs (you should change this to a path that exists)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
etl.yml		etl.yml
log4j2.xml		log4j2.xml
log_es_size.sh		log_es_size.sh
run.sh		run.sh
version		version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker-nba-etl

Running

Running it with auto import

About

Releases

Packages

Languages

License

AtzeDeVries/docker-nba-etl

Folders and files

Latest commit

History

Repository files navigation

docker-nba-etl

Running

Running it with auto import

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages