Skip to content

A combination of PDS software for data integrity, universal planetary coordinates, ingestion, services (POW/MAP2), etc.

License

Notifications You must be signed in to change notification settings

amystamile-usgs/PDS-Pipelines

 
 

Repository files navigation

PDS-Pipelines

A combination of PDS software for data integrity, universal planetary coordinates, ingestion, services (POW/MAP2), etc.

codecov

Using PDS-Pipelines Locally

PDS-Pipelines is not currently available via anaconda or other distribution platforms. PDS-Pipelines also requires anaconda, and docker. This use case is largely developer centric and will get technical with the tools used to run PDS-Pipelines

First cd into the cloned repository, and run:

conda env create -f environment.yml

This will create a fresh environment called PDS-Pipelines. Then activate said environment with:

conda activate PDS-Pipelines

You'll then want to actually install PDS-Pipelines with the following:

python setup.py develop

You will want to modify the config.py file within pds_pipelines and change the scratch variable to some path. Usually, this is just a path to the repo but you can make your "scratch" area anywhere on your machine. In this path, we'll call it "Path/to/scratch", you have to make the workarea folder.

First, you will want to replace the scratch variable in the config.py file within pds_pipelines with the "/Path/to/scratch":

scratch = /Path/to/scratch/

Then create the workarea folder under the "/Path/to/scratch" directory:

mkdir /Path/to/scratch/workarea

Next, we'll have to create the logs, and output folders. These folders usually exist at the root level of the PDS-Pipelines repository. Along with this, you will need to modify the root variable in the config.py file within pds_pipelines. We will change it to the full path to the cloned PDS-Pipelines repo. In other words we should set the following within config.py:

root = /Path/from/root/to/repo/PDS-Pipelines

Followed by:

mkdir /Path/from/root/to/repo/PDS-Pipelines/logs
mkdir /Path/from/root/to/repo/PDS-Pipelines/output

Then, run the following to setup the docker containers:

cd containers
export /Path/to/the/database docker-compose up

The above path should be a fresh, empty folder in a fairly large portion of some drive on your computer. This folder will contain the on disk storage for the database and allow you to load a previous database if one exists there already.

From here you can do one of two things, start processing data, or manually create the databases that UPC, MAP and POW2 depend on. It's recommended that you start by manually creating the databases as there are a few other idiosyncrasies that crop up while running locally.

First run the following in a python instance with the above conda env activated:

from pds_pipelines.models import upc_models, pds_models
upc_models.create_upc_database()
pds_models.create_pds_database()

This creates the necessary databases for both the DI database and the UPC database, from here things get a little more complicated. Within the PDS-Pipelines repo there is a file named PDSinfo.json. This file maps various missions/instruments to known archive within the DI database. Here is an example record within the json file:

"mro_ctx":
{
    "archiveid": "16",
    "path": "/pds_san/PDS_Archive/Mars_Reconnaissance_Orbiter/CTX/",
    "mission": "CTX",
    "bandbinQuery" : "FilterName",
    "upc_reqs": ["/data/", ".IMG"]
}

As you can see, you probably don't have the above path on your computer. To gain access to these files you will need to mirror the PDS_Archive that is maintained by various entities (such as the USGS). Some images that the PDS_Archive maintains can be found here. From here you will need to extract files from there respective missions/instruments into the path defined in the PDSinfo.json. You can do this one of two ways, either mirror the paths defined in the PDSinfo.json for the mission you are working with, OR you can update the path in the PDSinfo.json to point to where you have downloaded files. The latter is much easier but will likely not be able to take advantage of features supported by the pipelines.

Example (Using CTX and the above PDS nodes):

Navigate to the following:

https://pdsimage2.wr.usgs.gov/Missions/Mars_Reconnaissance_Orbiter/CTX/mrox_0602/data/

Pull an image from the archive, lets pick P20_008794_2573_XN_77N268W.IMG. The first image in the volume.

Now we make the above file structure:

mkdir -p /pds_san/PDS_Archive/Mars_Reconnaissance_Orbiter/CTX/mrox_0602/data/

Ideally, working at the root level should only be done on a personal system.

Now we move the file we have pulled to this location:

mv /Your/Downloads/Folder/P20_008794_2573_XN_77N268W.IMG /pds_san/PDS_Archive/Mars_Reconnaissance_Orbiter/CTX/mrox_0602/data/

or

wget -O /pds_san/PDS_Archive/Mars_Reconnaissance_Orbiter/CTX/mrox_0602/data/P20_008794_2573_XN_77N268W.IMG https://pdsimage2.wr.usgs.gov/Missions/Mars_Reconnaissance_Orbiter/CTX/mrox_0602/data/P20_008794_2573_XN_77N268W.IMG

Alternative (Using CTX and the above PDS nodes):

Either download the files to your downloads and move it to a new directory
or wget it directly to your new directory like:

wget -O /Path/to/CTX/area/mrox_0602/data/P20_008794_2573_XN_77N268W.IMG https://pdsimage2.wr.usgs.gov/Missions/Mars_Reconnaissance_Orbiter/CTX/mrox_0602/data/P20_008794_2573_XN_77N268W.IMG

Then update the PDSinfo.json record to the following:

"mro_ctx":
{
    "archiveid": "16",
    "path": "/Path/to/CTX/area/",
    "mission": "CTX",
    "bandbinQuery" : "FilterName",
    "upc_reqs": ["/data/", ".IMG"]
}

In both cases we want to keep the "volume" (mrox_0602) and data directories due to upc_reqs in the PDSinfo.json entry. The upc_reqs define what strings need to be present in a file path for that file to be considered for ingestion into the DI database. For MRO_CTX these requirements ensure that only files that contain a "data" directory and contain ".IMG" are ingested into the DI database. As such, to get our file into the DI database we can either retain the file structure, or update the upc_reqs by removing the "/data/" from the requirements. For the sake of simplicity, we are going to retain the file structure and try to minimize how much be change the PDSinfo.json file.

Ideally, if you have done either of the above two options, you will now be able run both ingest_queueing.py then ingest_proces.py. First you'll need to cd to the root of the repository:

cd /Path/to/repo/PDS-Pipelines

Then you should be able to run the following:

python pds_pipelines/ingest_queueing.py -a mro_ctx
python pds_pipelines/ingest_process.py

Now, lets interrogate the database to see if our file is there:

psql -h localhost -p 5432 di_test postgres

This will connect us to our local DI database, then we run the following to see what records are now documented within the DI database:

select * from files;

There should be one record in the table with mrox_0602/data/P20_008794_2573_XN_77N268W.IMG in the filename field and the upc_required field should be set to t for true. This means the file is recognized as a UPC eligible file that can be processed into the UPC database.

From here you will need to install ISIS as processing for UPC heavily relies on ISIS to generate the necessary data for the UPC database. Follow the tutorials presented by the ISIS team here. As a recommendation, install ISIS into a different conda environment, then point PDS-Pipelines at that conda environment.

We assume that you have ISIS installed correctly and have set ISISROOT to the conda environment created above. You will also need the ISIS data area which can be obtained from here. Then set ISIS_DATA to where the data was pulled into.

Now, you should be able to run the following to generate an entry into the UPC database:

python pds_pipelines/upc_queueing.py -a mro_ctx
python pds_pipelines/upc_process.py

This may take some time as the image we are working with is quite large. Once this is finished, we can interrogate the database similarly to how we examined the di_test database:

psql -h localhost -p 5432 upc_test postgres

Again, this connects us to the associated database. Then we run:

select * from datafiles;

There should be one record whose source field contains mrox_0602/data/P20_008794_2573_XN_77N268W.IMG. If that is true, then success! You have ingested a record from start to finish into the local UPC database

About

A combination of PDS software for data integrity, universal planetary coordinates, ingestion, services (POW/MAP2), etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.5%
  • Jupyter Notebook 7.7%
  • Other 0.8%