-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh development docs #1441
base: main
Are you sure you want to change the base?
Refresh development docs #1441
Conversation
…on and running ingest
…-development-docs Incorporate feedback from run-through with new developer
``` | ||
> That command will set up SSH port forwarding such that your computer can access the dev MongoDB server at `localhost:37018` and the prod MongoDB server at `localhost:37019`. | ||
|
||
> From within a Docker container `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> From within a Docker container `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file. | |
> From within a Docker container, `host.docker.internal` can be used to access the `localhost` of your computer. When ingesting from the dev or prod MongoDB instances, be sure to set `NMDC_MONGO_HOST=host.docker.internal` in your `.env` file. |
If I remember correctly, host.docker.internal
is not the special hostname that Docker environments running on Linux hosts use (although it definitely is for Docker environments on Mac hosts). I will check my notes and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: host.docker.internal
can be used in a Docker container running on a Linux host, provided the container has been configured appropriately.
Here's an example of where someone configured a container appropriately:
I think some of the Kitware team members will be familiar with the situation as I think they use Linux machines for development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's good to know. It's something that I've always used, but then again I've always worked on a Mac. I am just now realizing that the documentation I referenced in the original PR description is in the Docker Desktop section. So maybe it's not OS-dependent, but rather whether or not you're using Docker Desktop?
Either way, can someone who develops on Linux weigh in here? Will this work or do we need to refine it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I develop on linux (Ubuntu 22.04) and have struggled with host.docker.internal
in the past (although I never knew about how to configure it "correctly," so grain of salt) (thanks @eecavanna)
Previously I used the following command to connect to mongo:
docker run --rm -it -p 27017:27017 --network nmdc-server_default --name tunnel kroniak/ssh-client ssh -o StrictHostKeyChecking=no -L 0.0.0.0:27017:mongo-loadbalancer.nmdc.production.svc.spin.nersc.org:27017 [email protected] '/bin/bash -c "while [[ 1 ]]; do echo heartbeat; sleep 300; done"'
I can test using host.docker.internal
combined with extra_hosts
to see if there's an issue with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah, please do let me know if you can get host.docker.internal
working for you (and if so, what else you had to do to make it work).
IMO it's maybe more useful to establish the tunnel locally (as opposed to establishing it in the nmdc-server_default
Docker network) so that you can use tools like Compass or Studio 3T or (if you're like me) PyCharm itself to interact with Mongo. A lot of NMDC team members (ones who don't work on nmdc-server
) do that already. So this was a bit of a stab at team-wide consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this document up to date, and adding documentation to the example environment configuration file.
I made some suggestions related to terminology consistency. I don't see any deal-breakers, and so am comfortable with this branch being merged in as is. You can "take or leave" any of the suggestions I left.
Co-authored-by: eecavanna <[email protected]>
Co-authored-by: eecavanna <[email protected]>
### MongoDB Credentials | ||
|
||
In order to connect to the dev or prod MongoDB instances for ingest, you will need your own credentials to connect to them. If you do not have these, ask a team member to create accounts for you. Then add the credentials to your `.env` file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using the org.microbiomedata.data_reader
user for all of my mongo needs. Do others have their own user? Do those users have additional permissions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a pkalita
user on the dev and prod mongo instances, so I've never used org.microbiomedata.data_reader
. But now that you mention it, I do recall that account exists. @eecavanna do you have any suggestions here? Should the generic recommendation be to request and use a personal account or use the org.microbiomedata.data_reader
account?
<!-- TODO: Consider adding `--build` to this command so that Docker Compose builds | ||
the containers, rather than pulling from from GHCR (unless you | ||
want to use the versions that happen to currently be on GHCR). | ||
This has to do with the fact that the `docker-compose.yml` file | ||
contains service specs having both an `image` and `build` section. --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pulled versions would be the ones that correspond to the main
branch, no? So for new developers setting up their environment for the first time, I think it'd be okay to omit --build
from this section, especially since you call out some places further down where rebuilding would be useful/necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment came about as Eric and I were helping Yan get started with nmdc-server
we both realized we were slightly hesitant if we completely understood the behavior of having both build
and image
in the definition of a Docker Compose service. But yeah, I think what you said is what I would expect.
docker-compose run backend nmdc-server ingest -vv --function-limit 100 | ||
``` | ||
|
||
> **Note**: The `--function-limit` flag is optional. It is used to reduce the time that the ingest takes by limiting the number of certain types of objects loaded. This can be useful for testing purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can skip the gene function ingest altogether with the --skip-annotation
flag. The only thing this affects for the development environment is the gene function search. I haven't actually run an ingest from a remote mongo database in a while, so I don't know what the time delta is between skipping annotation ingest entirely, limiting to something like 100, or full ingest. I'm wondering if anyone does have an idea of recent time spent on local ingest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run an ingest from a remote MongoDB to my local Postgres within the last few weeks (a month maybe?), but I don't have exact numbers on how much time is spent one way or the other.
Summary
These changes are intended to fill in a few holes in the development docs and to adjust the Docker Compose configuration to better support local ingest development.
Details
I recently tried to debug an ingest issue locally and discovered that I needed a few auxiliary files in order for the KEGG step of the ingest to complete. That step looks for files stored in
/data/ingest
. To address that, these changes:docker-compose.yml
to mount adata/ingest
directory relative to the project root to/data/ingest
in thebackend
container.gitignore
to exclude thedata
directorydevelopment.md
for making a local copy of the necessary files (step 4 of the "Running ingest" section)Next, I wanted to simplify the SSH tunneling required for ingest. Previously the development docs suggested running a separate Docker container to establish the tunnel in the Docker Compose network. I've never actually done that myself. I think a simpler setup is to establish the tunnel to your local machine (then you can also use e.g. Compass or Studio 3T to query MongoDB interactively) and then use the special
host.docker.internal
DNS name to use that tunnel from within a Docker container. This is reflected in additions to.env.example
and step 3 of the "Running ingest".Since a lot of things (fetching database backups for
load-db
, fetching KEGG support files, establishing MongoDB tunnels) rely on NERSC access, I added a newNERSC_USER
environment variable to.env.example
and utilize that variable in various commands. I also added a "NERSC Credentials" section to the development docs with information on getting that set up.