Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh development docs #1441

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Refresh development docs #1441

wants to merge 10 commits into from

Conversation

pkalita-lbl
Copy link
Collaborator

Summary

These changes are intended to fill in a few holes in the development docs and to adjust the Docker Compose configuration to better support local ingest development.

Details

I recently tried to debug an ingest issue locally and discovered that I needed a few auxiliary files in order for the KEGG step of the ingest to complete. That step looks for files stored in /data/ingest. To address that, these changes:

  • Update docker-compose.yml to mount a data/ingest directory relative to the project root to /data/ingest in the backend container
  • Update .gitignore to exclude the data directory
  • Add instructions to development.md for making a local copy of the necessary files (step 4 of the "Running ingest" section)

Next, I wanted to simplify the SSH tunneling required for ingest. Previously the development docs suggested running a separate Docker container to establish the tunnel in the Docker Compose network. I've never actually done that myself. I think a simpler setup is to establish the tunnel to your local machine (then you can also use e.g. Compass or Studio 3T to query MongoDB interactively) and then use the special host.docker.internal DNS name to use that tunnel from within a Docker container. This is reflected in additions to .env.example and step 3 of the "Running ingest".

Since a lot of things (fetching database backups for load-db, fetching KEGG support files, establishing MongoDB tunnels) rely on NERSC access, I added a new NERSC_USER environment variable to .env.example and utilize that variable in various commands. I also added a "NERSC Credentials" section to the development docs with information on getting that set up.

.env.example Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
docs/development.md Outdated Show resolved Hide resolved
```
> That command will set up SSH port forwarding such that your computer can access the dev MongoDB server at `localhost:37018` and the prod MongoDB server at `localhost:37019`.

> From within a Docker container `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file.
Copy link
Collaborator

@eecavanna eecavanna Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> From within a Docker container `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file.
> From within a Docker container, `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file.

If I remember correctly, host.docker.internal is not the special hostname that Docker environments running on Linux hosts use (although it definitely is for Docker environments on Mac hosts). I will check my notes and report back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: host.docker.internal can be used in a Docker container running on a Linux host, provided the container has been configured appropriately.

Here's an example of where someone configured a container appropriately:

https://github.com/microbiomedata/nmdc-edge/blob/1c6bc57a6e9969b18c6a4c03a65417809b13a284/docs/docker-compose.prod.yml#L58-L61

I think some of the Kitware team members will be familiar with the situation as I think they use Linux machines for development.

Copy link
Collaborator Author

@pkalita-lbl pkalita-lbl Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's good to know. It's something that I've always used, but then again I've always worked on a Mac. I am just now realizing that the documentation I referenced in the original PR description is in the Docker Desktop section. So maybe it's not OS-dependent, but rather whether or not you're using Docker Desktop?

Either way, can someone who develops on Linux weigh in here? Will this work or do we need to refine it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I develop on linux (Ubuntu 22.04) and have struggled with host.docker.internal in the past (although I never knew about how to configure it "correctly," so grain of salt) (thanks @eecavanna)

Previously I used the following command to connect to mongo:

docker run --rm -it -p 27017:27017 --network nmdc-server_default --name tunnel kroniak/ssh-client ssh -o StrictHostKeyChecking=no -L 0.0.0.0:27017:mongo-loadbalancer.nmdc.production.svc.spin.nersc.org:27017 [email protected] '/bin/bash -c "while [[ 1 ]]; do echo heartbeat; sleep 300; done"'

I can test using host.docker.internal combined with extra_hosts to see if there's an issue with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah, please do let me know if you can get host.docker.internal working for you (and if so, what else you had to do to make it work).

IMO it's maybe more useful to establish the tunnel locally (as opposed to establishing it in the nmdc-server_default Docker network) so that you can use tools like Compass or Studio 3T or (if you're like me) PyCharm itself to interact with Mongo. A lot of NMDC team members (ones who don't work on nmdc-server) do that already. So this was a bit of a stab at team-wide consistency.

docs/development.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@eecavanna eecavanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this document up to date, and adding documentation to the example environment configuration file.

I made some suggestions related to terminology consistency. I don't see any deal-breakers, and so am comfortable with this branch being merged in as is. You can "take or leave" any of the suggestions I left.

Comment on lines +66 to +68
### MongoDB Credentials

In order to connect to the dev or prod MongoDB instances for ingest, you will need your own credentials to connect to them. If you do not have these, ask a team member to create accounts for you. Then add the credentials to your `.env` file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using the org.microbiomedata.data_reader user for all of my mongo needs. Do others have their own user? Do those users have additional permissions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a pkalita user on the dev and prod mongo instances, so I've never used org.microbiomedata.data_reader. But now that you mention it, I do recall that account exists. @eecavanna do you have any suggestions here? Should the generic recommendation be to request and use a personal account or use the org.microbiomedata.data_reader account?

Comment on lines +156 to +160
<!-- TODO: Consider adding `--build` to this command so that Docker Compose builds
the containers, rather than pulling from from GHCR (unless you
want to use the versions that happen to currently be on GHCR).
This has to do with the fact that the `docker-compose.yml` file
contains service specs having both an `image` and `build` section. -->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pulled versions would be the ones that correspond to the main branch, no? So for new developers setting up their environment for the first time, I think it'd be okay to omit --build from this section, especially since you call out some places further down where rebuilding would be useful/necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment came about as Eric and I were helping Yan get started with nmdc-server we both realized we were slightly hesitant if we completely understood the behavior of having both build and image in the definition of a Docker Compose service. But yeah, I think what you said is what I would expect.

docker-compose run backend nmdc-server ingest -vv --function-limit 100
```

> **Note**: The `--function-limit` flag is optional. It is used to reduce the time that the ingest takes by limiting the number of certain types of objects loaded. This can be useful for testing purposes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can skip the gene function ingest altogether with the --skip-annotation flag. The only thing this affects for the development environment is the gene function search. I haven't actually run an ingest from a remote mongo database in a while, so I don't know what the time delta is between skipping annotation ingest entirely, limiting to something like 100, or full ingest. I'm wondering if anyone does have an idea of recent time spent on local ingest

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run an ingest from a remote MongoDB to my local Postgres within the last few weeks (a month maybe?), but I don't have exact numbers on how much time is spent one way or the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants