Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement API endpoint that returns TSV report about submissions that are pending review #1443

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cristina-stonepedraza
Copy link

@cristina-stonepedraza cristina-stonepedraza commented Nov 12, 2024

Add a script to api.py and a query to crud.py to pull information from the submission portal database, and then generate a report of NMDC submissions that have been submitted as a TSV.

This PR references issue 2047 in nmdc-schema: microbiomedata/nmdc-schema#2047

Copy link
Collaborator

@eecavanna eecavanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cristina-stonepedraza, I think this looks great overall! the code looks clean to me and I had no issues following it.

I left some feedback about path, function, and variable naming, as well as one about the endpoint description. In case you have any questions about these, you can message me here/Slack/etc.

There is one more thing I want to see in this PR:

  • An automated test

I'll send you information about that on Slack. I posted information about that below.

Comment on lines +647 to +650
r"""
Generate a report of NMDC submissions that are submitted pending review to allow review
of submission triads and evaluate for approval
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the endpoint description after looking at the code. Here's what I came up with:

Suggested change
r"""
Generate a report of NMDC submissions that are submitted pending review to allow review
of submission triads and evaluate for approval
"""
r"""
Generate a TSV-formatted report of biosamples belonging to submissions
that have a status of "Submitted- Pending Review".
The report indicates which environmental package/extension, broad scale,
local scale, and medium are specified for each biosample. The report is
designed to facilitate the review of submissions by NMDC team members.
"""

@@ -637,6 +637,97 @@ async def download_zip_file(
)


@router.get(
"/metadata_submission/mixs",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend updating the URL path so that the final part describes what is being GET-ed (i.e. gotten/retrieved via the endpoint). In this case, I think of it as a "report" (I'm not too familiar with MIxS and how it relates to what the report contains; my guess is that it is the name of an ontology that the values in the "environment..." fields come from).

Suggested change
"/metadata_submission/mixs",
"/metadata_submission/mixs_report",


# Get sample names from each sample type
for sample_type in sample_data:
samples = sample_data[sample_type] if sample_type in sample_data else {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming samples is typically a list (as opposed to a dictionary), I'd recommend defaulting to an empty list (instead of an empty dictionary) here.

Suggested change
samples = sample_data[sample_type] if sample_type in sample_data else {}
samples = sample_data[sample_type] if sample_type in sample_data else []

for x in samples:

# Get the sample name
name = x["samp_name"] if "samp_name" in x else {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming name normally contains a string, I'd recommend defaulting to an empty string here (instead of an empty dictionary).

Suggested change
name = x["samp_name"] if "samp_name" in x else {}
name = x["samp_name"] if "samp_name" in x else ""

I have the same (analogous) thought about broad, local, and medium below.

Comment on lines +686 to +689
# Get the env local scale
local = x["env_local_scale"] if "env_local_scale" in x else {}
local = str(local)
local = local.replace("\t", "").replace("\r", "").replace("\n", "").lstrip("_")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the variable not be named local—for two reasons:

  1. The word "local" can have a special meaning in programming (i.e. local versus global) and I think future readers (of this code or stack traces involving this code) will find it confusing (at first glance).
  2. I think "local" is an adjective describing "scale". Based on that, I think the value this variable will contain is some kind of "scale".

With those things in mind, I'd recommend renaming this variable to local_scale (or even env_local_scale).

I have the same thought (analogous) about broad.

@@ -686,6 +686,19 @@ def get_query_for_all_submissions(db: Session):
return all_submissions


def get_query_for_submitted_pending_review(db: Session):
r"""
Returns a SQLAlchemy query that can be used to retrieve submissions pending review.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for including a docstring!

@@ -686,6 +686,19 @@ def get_query_for_all_submissions(db: Session):
return all_submissions


def get_query_for_submitted_pending_review(db: Session):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this be named...

- def get_query_for_submitted_pending_review(db: Session):
+ def get_query_for_submitted_pending_review_submissions(db: Session):

...as a way of indicating that the function can be used to get submissions.

@eecavanna eecavanna changed the title 2047 mixs endpoint Implement API endpoint that returns TSV report about submissions that are pending review Nov 14, 2024
@eecavanna
Copy link
Collaborator

Hi @cristina-stonepedraza, here's an example of an automated test that targets an endpoint like this one:

def test_get_metadata_submissions_report_as_non_admin(
db: Session, client: TestClient, logged_in_user
):
response = client.request(method="GET", url="/api/metadata_submission/report")
assert response.status_code == 403
def test_get_metadata_submissions_report_as_admin(
db: Session, client: TestClient, logged_in_admin_user
):
now = datetime.utcnow()
# Create two submissions, only one of which is owned by the logged-in user.
logged_in_user = logged_in_admin_user # allows us to reuse some code snippets
submission = fakes.MetadataSubmissionFactory(
author=logged_in_user,
author_orcid=logged_in_user.orcid,
created=now,
)
fakes.SubmissionRoleFactory(
submission=submission,
submission_id=submission.id,
user_orcid=logged_in_user.orcid,
role=SubmissionEditorRole.owner,
)
other_user = fakes.UserFactory()
other_submission = fakes.MetadataSubmissionFactory(
author=other_user,
author_orcid=other_user.orcid,
created=now + timedelta(seconds=1),
metadata_submission={
"studyForm": {
"studyName": "My study name",
"piName": "My PI name",
"piEmail": "My PI email",
},
},
status="in-progress",
source_client="field_notes",
)
db.commit()
response = client.request(method="GET", url="/api/metadata_submission/report")
assert response.status_code == 200
# Confirm the response payload is a TSV file having the fields and values we expect;
# i.e. below its header row, it has two data rows, each representing a submission,
# ordered from most recently-created to least recently-created.
# Reference: https://docs.python.org/3/library/csv.html#csv.DictReader
fieldnames = [
"Submission ID",
"Author ORCID",
"Author Name",
"Study Name",
"PI Name",
"PI Email",
"Source Client",
"Status",
]
reader = DictReader(response.text.splitlines(), fieldnames=fieldnames, delimiter="\t")
rows = [row for row in reader]
assert len(rows) == 3 # includes the header row
header_row = rows[0] # gets the header row
assert len(list(header_row.keys())) == len(fieldnames)
data_row = rows[1] # gets the first data row (the most recently-created submission)
assert data_row["Submission ID"] == str(other_submission.id)
assert data_row["Author ORCID"] == other_user.orcid
assert data_row["Author Name"] == other_user.name
assert data_row["Study Name"] == "My study name"
assert data_row["PI Name"] == "My PI name"
assert data_row["PI Email"] == "My PI email"
assert data_row["Source Client"] == "field_notes"
assert data_row["Status"] == "in-progress"
data_row = rows[2] # gets the second data row
assert data_row["Submission ID"] == str(submission.id)
assert data_row["Author ORCID"] == logged_in_user.orcid
assert data_row["Author Name"] == logged_in_user.name
assert data_row["Study Name"] == ""
assert data_row["PI Name"] == ""
assert data_row["PI Email"] == ""
assert data_row["Source Client"] == "" # upstream faker lacks `source_client` attribute
assert data_row["Status"] == "In Progress" # matches value in upstream faker

There are actually two tests there because the endpoint in question is only accessible to admins. One of the tests focuses on that aspect (i.e. security).

By the way, I do wonder whether we will restrict access to this reporting endpoint also. You/we can discuss that with @mslarae13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants