Refresh development docs #1441

pkalita-lbl · 2024-11-12T01:07:58Z

Summary

These changes are intended to fill in a few holes in the development docs and to adjust the Docker Compose configuration to better support local ingest development.

Details

I recently tried to debug an ingest issue locally and discovered that I needed a few auxiliary files in order for the KEGG step of the ingest to complete. That step looks for files stored in /data/ingest. To address that, these changes:

Update docker-compose.yml to mount a data/ingest directory relative to the project root to /data/ingest in the backend container
Update .gitignore to exclude the data directory
Add instructions to development.md for making a local copy of the necessary files (step 4 of the "Running ingest" section)

Next, I wanted to simplify the SSH tunneling required for ingest. Previously the development docs suggested running a separate Docker container to establish the tunnel in the Docker Compose network. I've never actually done that myself. I think a simpler setup is to establish the tunnel to your local machine (then you can also use e.g. Compass or Studio 3T to query MongoDB interactively) and then use the special host.docker.internal DNS name to use that tunnel from within a Docker container. This is reflected in additions to .env.example and step 3 of the "Running ingest".

Since a lot of things (fetching database backups for load-db, fetching KEGG support files, establishing MongoDB tunnels) rely on NERSC access, I added a new NERSC_USER environment variable to .env.example and utilize that variable in various commands. I also added a "NERSC Credentials" section to the development docs with information on getting that set up.

…on and running ingest

…-development-docs Incorporate feedback from run-through with new developer

.env.example

docs/development.md

eecavanna · 2024-11-12T01:44:09Z

docs/development.md

+    ```
+    > That command will set up SSH port forwarding such that your computer can access the dev MongoDB server at `localhost:37018` and the prod MongoDB server at `localhost:37019`. 
+
+    > From within a Docker container `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file.


Suggested change

> From within a Docker container `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file.

> From within a Docker container, `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file.

If I remember correctly, host.docker.internal is not the special hostname that Docker environments running on Linux hosts use (although it definitely is for Docker environments on Mac hosts). I will check my notes and report back.

Correction: host.docker.internal can be used in a Docker container running on a Linux host, provided the container has been configured appropriately.

Here's an example of where someone configured a container appropriately:

https://github.com/microbiomedata/nmdc-edge/blob/1c6bc57a6e9969b18c6a4c03a65417809b13a284/docs/docker-compose.prod.yml#L58-L61

I think some of the Kitware team members will be familiar with the situation as I think they use Linux machines for development.

Oh that's good to know. It's something that I've always used, but then again I've always worked on a Mac. I am just now realizing that the documentation I referenced in the original PR description is in the Docker Desktop section. So maybe it's not OS-dependent, but rather whether or not you're using Docker Desktop?

Either way, can someone who develops on Linux weigh in here? Will this work or do we need to refine it?

I develop on linux (Ubuntu 22.04) and have struggled with host.docker.internal in the past (although I never knew about how to configure it "correctly," so grain of salt) (thanks @eecavanna)

Previously I used the following command to connect to mongo:

docker run --rm -it -p 27017:27017 --network nmdc-server_default --name tunnel kroniak/ssh-client ssh -o StrictHostKeyChecking=no -L 0.0.0.0:27017:mongo-loadbalancer.nmdc.production.svc.spin.nersc.org:27017 [email protected] '/bin/bash -c "while [[ 1 ]]; do echo heartbeat; sleep 300; done"'

I can test using host.docker.internal combined with extra_hosts to see if there's an issue with that.

Thanks, yeah, please do let me know if you can get host.docker.internal working for you (and if so, what else you had to do to make it work).

IMO it's maybe more useful to establish the tunnel locally (as opposed to establishing it in the nmdc-server_default Docker network) so that you can use tools like Compass or Studio 3T or (if you're like me) PyCharm itself to interact with Mongo. A lot of NMDC team members (ones who don't work on nmdc-server) do that already. So this was a bit of a stab at team-wide consistency.

docs/development.md

eecavanna

Thanks for bringing this document up to date, and adding documentation to the example environment configuration file.

I made some suggestions related to terminology consistency. I don't see any deal-breakers, and so am comfortable with this branch being merged in as is. You can "take or leave" any of the suggestions I left.

Co-authored-by: eecavanna <[email protected]>

naglepuff · 2024-11-12T20:37:46Z

docs/development.md

+### MongoDB Credentials
+
+In order to connect to the dev or prod MongoDB instances for ingest, you will need your own credentials to connect to them. If you do not have these, ask a team member to create accounts for you. Then add the credentials to your `.env` file.


I've been using the org.microbiomedata.data_reader user for all of my mongo needs. Do others have their own user? Do those users have additional permissions?

I have a pkalita user on the dev and prod mongo instances, so I've never used org.microbiomedata.data_reader. But now that you mention it, I do recall that account exists. @eecavanna do you have any suggestions here? Should the generic recommendation be to request and use a personal account or use the org.microbiomedata.data_reader account?

naglepuff · 2024-11-12T20:44:14Z

docs/development.md

+<!-- TODO: Consider adding `--build` to this command so that Docker Compose builds
+           the containers, rather than pulling from from GHCR (unless you
+           want to use the versions that happen to currently be on GHCR).
+           This has to do with the fact that the `docker-compose.yml` file 
+           contains service specs having both an `image` and `build` section. -->


The pulled versions would be the ones that correspond to the main branch, no? So for new developers setting up their environment for the first time, I think it'd be okay to omit --build from this section, especially since you call out some places further down where rebuilding would be useful/necessary.

This comment came about as Eric and I were helping Yan get started with nmdc-server we both realized we were slightly hesitant if we completely understood the behavior of having both build and image in the definition of a Docker Compose service. But yeah, I think what you said is what I would expect.

naglepuff · 2024-11-12T21:02:49Z

docs/development.md

+    docker-compose run backend nmdc-server ingest -vv --function-limit 100
+    ```
+
+    > **Note**: The `--function-limit` flag is optional. It is used to reduce the time that the ingest takes by limiting the number of certain types of objects loaded. This can be useful for testing purposes.


You can skip the gene function ingest altogether with the --skip-annotation flag. The only thing this affects for the development environment is the gene function search. I haven't actually run an ingest from a remote mongo database in a while, so I don't know what the time delta is between skipping annotation ingest entirely, limiting to something like 100, or full ingest. I'm wondering if anyone does have an idea of recent time spent on local ingest

I have run an ingest from a remote MongoDB to my local Postgres within the last few weeks (a month maybe?), but I don't have exact numbers on how much time is spent one way or the other.

pkalita-lbl and others added 8 commits November 6, 2024 15:22

Add additional environment variables to example

919d61e

Mount local directory to /data/ingest for ingest development

dcd6fad

Refresh development docs including more details on tunnel configurati…

11f4815

…on and running ingest

Add scratch notes from run-through with new developer

c791568

Refine statement

f98dd17

Refine another statement

97fb202

Merge pull request #1438 from microbiomedata/temp-notes-about-refresh…

4b0504c

…-development-docs Incorporate feedback from run-through with new developer

Add details on how to load DB without NERSC access

274a72c

pkalita-lbl requested review from naglepuff, marySalvi and eecavanna November 12, 2024 01:07