Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release data 26/03/2021 #8

Merged
merged 3 commits into from
May 5, 2021
Merged

Conversation

ESapenaVentura
Copy link
Collaborator

@ESapenaVentura ESapenaVentura commented Mar 26, 2021

Versions updated:

system/links.json - v3.0.0
module/protocol/matrix.json - v1.0.0
type/protocol/analysis/analysis_protocol.json - v9.2.0
type/file/analysis_file.json - v6.3.0
core/file/file_core.json - v6.2.0
type/file/sequence_file.json - v9.3.0
type/file/supplementary_file.json - v2.3.0
type/file/reference_file.json - v3.3.0
type/file/image_file.json - v2.3.0
module/ontology/data_use_ontology.json - v1.0.0
type/project/project.json - v14.2.0

Issues addressed:

New subgraphs?:

  • tests/links/d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7_2021-01-01T00:00:00.000000Z_90bf705c-d891-5ce2-aa54-094488b445c6.json: Cell suspension --> Analysis files

Copy link

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path to the test data should not encode a date or time or any other version signifier. We want Git to do the version tracking for us. We can't release a new set of test data for every schema release. That would defeat the purpose of letting the Git history reflect how the metadata evolves. The schema evolution should be reflected in actual commit differences to existing files, not just in the accumulation of new metadata files, leaving in place the old files that comply with the old schema.

@aaclan-ebi
Copy link
Collaborator

The path to the test data should not encode a date or time or any other version signifier. We want Git to do the version tracking for us. We can't release a new set of test data for every schema release. That would defeat the purpose of letting the Git history reflect how the metadata evolves. The schema evolution should be reflected in actual commit differences to existing files, not just in the accumulation of new metadata files, leaving in place the old files that comply with the old schema.

Hi @hannes-ucsc, do you mean the version timestamp in the filenames? All timestamp values are using 2021-01-01T00:00:00.000000Z. Instead of 0, we used a "zero" timestamp value which will never change. Is this a problem?

Also, there are 2 commits in the PR, 1st commit contains the base test data, the 2nd commit contains the diff which is what we want.

@hannes-ucsc
Copy link

Sorry, I made a mistake. Disregard the my previous comment. I took the branch name for a directory.

Copy link

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the two commits: It seems that each commit adds a h5ad file. Not sure we need to structure the history that way. We want to be able to correlate a diff in this repository to a diff in the schema repository. So if a schema PR renames an property in a schema, there should be a one or two line diff in the schema PR and as many one line changes in the test data PR as there are instances of that schema in the test data repo.

Anyways, I think the two commits can be squashed simply because the test data project is made up exclusively of matrix files. There are no other sequence of analysis files. There is no initial commit that adds anything worth looking at.

The second commit adds a links.json but also renames the links.json from the first commit. Why does it do that? Shouldn't the second commit simply add a links.json and leave the other one as is? I worry that we have a non-determinism in the UUID generation for the links.json files.

@ESapenaVentura
Copy link
Collaborator Author

We have corrected the problem with the links.json (It was an issue with the processes IDs being auto-generated each time) and re-deployed the test data to match the last changes we pushed to staging (rolling back the project version)\

About the files: I don't think I get the request. Comparing both commits shows the difference with the updated schemas (https://github.com/HumanCellAtlas/schema-test-data/compare/dd0553a98cd4d2eff216a8b9e567a04be85b3e75..7213226d250126a1ac327142663b7221d8506849). In the future, there will only be 1 commit that will compare against master and showcase the updates in the schema, but in this first iteration we also needed to add the baseline data.

Let me know if you have more concerns/queries!

@hannes-ucsc
Copy link

About the files: I don't think I get the request. Comparing both commits shows the difference with the updated schemas

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.

I am writing this after I wrote

#8 (comment)

It looks like you already did exactly what I proposed?

@hannes-ucsc
Copy link

If the answer is yes, then Daniel and I can re-review. He's made good progress on the code that reads this staging area:

DataBiosphere/hca-metadata-api#42

@ESapenaVentura
Copy link
Collaborator Author

ESapenaVentura commented Apr 15, 2021

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.

I am writing this after I wrote

#8 (comment)

It looks like you already did exactly what I proposed?

Sorry about the missunderstanding, I meant the comment about the h5ad files!

About the other question, yes that's what we did!

We have fixed the newlines but there's been a small problem squashing the commits. Apologies, I take full responsibility about this one, I squashed Alegria's fix of the script into the commit with the regenerated test data. If you could please ignore the post_process.py file, everything else should be in place as requested so you can re-review the test data.

We have solved the problem with the squash so now only the updates are showing

Many thanks!

@ESapenaVentura ESapenaVentura force-pushed the release-data-26/03/2021 branch 2 times, most recently from e3f0355 to bd3a9d2 Compare April 15, 2021 15:31
@hannes-ucsc
Copy link

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.
I am writing this after I wrote
#8 (comment)
It looks like you already did exactly what I proposed?

Sorry about the missunderstanding, I meant the comment about the h5ad files!

Which comment? Maybe post a link to it?

@ESapenaVentura
Copy link
Collaborator Author

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.
I am writing this after I wrote
#8 (comment)
It looks like you already did exactly what I proposed?

Sorry about the missunderstanding, I meant the comment about the h5ad files!

Which comment? Maybe post a link to it?

This comment #8 (review)

Copy link

@dsotirho-ucsc dsotirho-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file tests/links/37e91a9f-b04e-5313-bbc0-4c394406247e_2021-01-01T00:00:00.000000Z_90bf705c-d891-5ce2-aa54-094488b445c6.json lists process_id for files that appear to be missing from schema-test-data/tests/metadata/process/

@hannes-ucsc
Copy link

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.
I am writing this after I wrote
#8 (comment)
It looks like you already did exactly what I proposed?

Sorry about the missunderstanding, I meant the comment about the h5ad files!

Which comment? Maybe post a link to it?

This comment #8 (review)

That comment does not mention h5ad files.

@ESapenaVentura
Copy link
Collaborator Author

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.
I am writing this after I wrote
#8 (comment)
It looks like you already did exactly what I proposed?

Sorry about the missunderstanding, I meant the comment about the h5ad files!

Which comment? Maybe post a link to it?

This comment #8 (review)

That comment does not mention h5ad files.

It does, in the very first line

@aaclan-ebi
Copy link
Collaborator

@danielsotirhos I am not sure why there were missing metadata files and incorrect links.json filenames. But I've just re-run the post-processor against the original test data and I believe that corrected the issue.

@ESapenaVentura
Copy link
Collaborator Author

@danielsotirhos as alegria pointed out, she has added the missing files and we have corrected the regex.

It should be fine now! Please let us know if you find any further issues

Copy link

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8 (comment)

wasn't addressed. The UUID of two processes is reused as that of links files:

$ find . | grep -E '4da04038-adab-59a9-b6c4-3a61242cc972|d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7'
./tests/metadata/process/4da04038-adab-59a9-b6c4-3a61242cc972_2021-01-01T00:00:00.000000Z.json
./tests/metadata/process/d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7_2021-01-01T00:00:00.000000Z.json
./tests/links/4da04038-adab-59a9-b6c4-3a61242cc972_2021-01-01T00:00:00.000000Z_90bf705c-d891-5ce2-aa54-094488b445c6.json
./tests/links/d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7_2021-01-01T00:00:00.000000Z_90bf705c-d891-5ce2-aa54-094488b445c6.json

Each entity must have a universally unique ID (hence the name UUID), even if they are of different types.

@hannes-ucsc
Copy link

Generally speaking, you can't force push a PR branch and then use the NEW history to question/contradict a claim that I make about the OLD history.
I am writing this after I wrote
#8 (comment)
It looks like you already did exactly what I proposed?

Sorry about the missunderstanding, I meant the comment about the h5ad files!

Which comment? Maybe post a link to it?

This comment #8 (review)

That comment does not mention h5ad files.

It does, in the very first line

Ahh, I stand corrected. That comment was more about the history than the actual files, hence my confusion with your reference to my comment as being about h5ad files. But, technically you are correct.

@aaclan-ebi
Copy link
Collaborator

#8 (comment)

wasn't addressed. The UUID of two processes is reused as that of links files:

$ find . | grep -E '4da04038-adab-59a9-b6c4-3a61242cc972|d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7'
./tests/metadata/process/4da04038-adab-59a9-b6c4-3a61242cc972_2021-01-01T00:00:00.000000Z.json
./tests/metadata/process/d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7_2021-01-01T00:00:00.000000Z.json
./tests/links/4da04038-adab-59a9-b6c4-3a61242cc972_2021-01-01T00:00:00.000000Z_90bf705c-d891-5ce2-aa54-094488b445c6.json
./tests/links/d7b8cbff-aee9-5a05-a4a1-d8f4e720aee7_2021-01-01T00:00:00.000000Z_90bf705c-d891-5ce2-aa54-094488b445c6.json

Each entity must have a universally unique ID (hence the name UUID), even if they are of different types.

Replied here #8 (comment)

Copy link

@ruchim ruchim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't done a deep dive, but approving for fullness.

Copy link

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielsotirhos could you also please verify whether the reused

Copy link

@aherbst-broad aherbst-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test data update PR related to test data generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants