Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anglo_Saxon_Project #104

Merged
merged 33 commits into from
Oct 4, 2023
Merged

Anglo_Saxon_Project #104

merged 33 commits into from
Oct 4, 2023

Conversation

93Boy
Copy link
Contributor

@93Boy 93Boy commented Nov 9, 2022

This is a draft .Janno file based on available data.

@stschiff
Copy link
Member

OK, perhaps Joscha can still provide the Plink data in time, otherwise it's also in AADR (embarrassingly).

@stschiff
Copy link
Member

Please make this a draft PR for now.

@nevrome nevrome marked this pull request as draft May 5, 2023 13:58
@stschiff
Copy link
Member

Hi @93Boy. I have uploaded the genetic data. It contains 8 more rows than your original Janno file. I have now adapted the order of rows in the Janno file to the order in the genotype data. I have added rows with "n/a" for those individuals that were listed in the genotype data but not in the Janno. Perhaps you can check again in the paper tables whether you find information for those samples and add it. If not, please ask Joscha Gretzinger.

Here are a number of todos:

  • Check for the 8 (?) individuals with missing data ("n/a") what to do there, as said above
  • Fill missing genetic source ID information for them. The project ENA ID is listed in the paper
  • Check whether the group IDs are correct. I think they might differ now between the Janno file and the ind-file. Perhaps it could be a solution to keep the names from the Janno file, but prepend them by the group names in the ind-file. I don't know where they came from.

If anything is unclear, please ask me and I can inquire further.

@93Boy
Copy link
Contributor Author

93Boy commented May 30, 2023

Sure Stephan I will start working on this now

@93Boy
Copy link
Contributor Author

93Boy commented May 31, 2023

Hello Stephan, I went through the data and found those points.

  • I have gone through the n/a fields. They are not in the original dataset but EAS003 is available in the S2.1 table and the rest in Table S3.5 which includes F4 stats but no other genotype data was found. I will contact Joscha for further information.
  • I have added accession ID and downloaded ENA data to process as a ".ssf" file.
  • Group IDs are not matching with .ind file. The group IDs in .ind file are more informative. for example .janno group name for Poseidon_ID 'ADN001' is 'Germany_EMA'. In .ind file it is 'NGermany_EMA_Anderten'. I think .ind version is better. Do you want me to change it accordingly?

@stschiff
Copy link
Member

stschiff commented Jun 1, 2023

Yes I think perhaps it's easiest if you just then adapt the janno file to the ind-groups

@93Boy
Copy link
Contributor Author

93Boy commented Jun 1, 2023

  • I have contacted Joscha , he said EAS003 was once a part of data but then he removed it in order to publish seperately. He will look into other IDs as well. However I didnt find these IDs in ENA data.
  • I have adapted the group names of the .ind file into .janno as you mentioned and ssf file also created

@stschiff
Copy link
Member

stschiff commented Jun 5, 2023

Hmm, but that means that EAS003 is part of the genotype files then? That's bad... then we need to remove those I suppose.

@93Boy
Copy link
Contributor Author

93Boy commented Jun 8, 2023

May I remove these from the genotype data?

@93Boy
Copy link
Contributor Author

93Boy commented Jun 12, 2023

Hello Stephan, I received an update from Joscha. As a summary we can include all of them in the Poseidon package, All the entries except EAS003, are re-sequences from the Schiffels et al. 2016 paper. May I extract that information from your paper?

@stschiff
Copy link
Member

Yes of course. Please do so. Once you have filled whatever you can, please report back. I'm happy to fill anything you don't know.

@93Boy
Copy link
Contributor Author

93Boy commented Jul 7, 2023

@stschiff I have filled missing data fields from your 2016_SchiffelsNatureCommunications. I would like to note the below points.

  • EAS003 was removed
  • I havent found a match for I0791_duplicate in your publication.
  • Group name fields of 2022_Gretzinger_AngloSaxons were named as "England_EMA" but yours its equal to the Poseidon_ID. But I kept your format. Should I change it back?

@stschiff stschiff marked this pull request as ready for review July 15, 2023 10:26
@stschiff stschiff self-requested a review July 15, 2023 10:26
@stschiff stschiff self-assigned this Jul 15, 2023
@stschiff
Copy link
Member

I checked, and it seems that EAS003 was in fact not removed from the genotype data, only from the Janno File. I will put it back

@stschiff
Copy link
Member

stschiff commented Aug 11, 2023

OK, I've gone through this. As written above, EAS003 was still part of the genotype data, so I put it back into the Janno and filled the necessary fields in consultation with Joscha.

@93Boy, Please work on the following:

  • you filled "C14" in all the Date_Type entries, but only some have actual C14 dates. Please change all the ones without C14 dates to "contextual". I started, but there are a lot more.
  • I talked to Joscha about I0791_duplicate: Please remove this individual, from the Janno file and the genotype data. To remove it from the genotype data, you will have to use forge, using the syntax -<I0791_duplicate> in the forge string to get it out.
  • Please check Joschas original Table S1 from Gretzinger et al. 2022. There is kinship information in there (column AC), particularly identical samples. Please add this information into new Janno columns (http://www.poseidon-adna.org/#/janno_details?id=relations-among-samplesindividuals).

For individuals I0161, I0159, I0769, I0773, I0774, I0777, I0157: These are the ones that you took from Schiffels_2016. I have now some new information on these: They are in fact Capture datasets from Davids Lab of the same individuals that were published in 2016, but it's new datasets. So please

  • update their Janno fields to the values used in the AADR (https://raw.githubusercontent.com/poseidon-framework/aadr-archive/main/AADR_v54_1_p1_1240K_EuropeAncient/AADR_v54_1_p1_1240K_EuropeAncient.janno)
  • add information about them being duplicates to the ones published in Schiffels et al. 2016, using the new Janno columns such as Relation_To and so on, as above.
  • adapt their group names to the ones proposed in Joschas *.ind file! You have put new group names now that are not in sync with the ind-file. Please switch back to the ind file. I know you had asked me about this, but I've now changed my mind. They should all follow exactly the ind file.
  • Please add a citation to Gretzinger et al. 2022, to Schiffels et al. 2016 and to the AADR. This then means that all of these must also be part of the bib file.

Finally, please convert the genotype data to Plink using trident genoconvert.

Let me know if you encounter problems.

@stschiff
Copy link
Member

@93Boy do you have an update for us?

@stschiff
Copy link
Member

stschiff commented Sep 4, 2023

Ooookay, so I have finally finalised this Pull Request. It needs to still be reviewed by Joscha Gretzinger, though, which I'll take care of.

For the record, here are a few things that I did:

  • The order between Janno and ind was messed up, I reordered to janno, as the genotype data must remain fixed.
  • The janno-file was comma-separated, not tab-separated when I took over, so I changed that.
  • After the change from comma- to tab-, for some reasons the columns in Janno were mis-aligned... I suspect that somewhere along the lines before, Libre-Office or some other tool misinterpreted whitespace with tabs, and messed up the alignment. I went through by hand and inserted/deleted columns so that all columns are aligned again.
  • I fixed various issues around dates. There were spurious "CE" which I had to remove to make them strictly numeric.
  • I added the "No collagen" strings to the Date_notes field.
  • I manually added dummy values for the contamination error estimates, to be clarified with Joscha
  • I manually added some lower or upper bounds to the dates where there were missing (to be confirmed by Joscha).
  • I aligned some group names, from England_EMA_Capture to just England_EMA.
  • I added duplicates via Relation_* fields. No other relatives have been added yet.

@stschiff stschiff marked this pull request as draft September 4, 2023 09:18
@stschiff
Copy link
Member

stschiff commented Sep 4, 2023

Specific points for Joscha to check:

  • Our supplement gives contamination estimates without error bars. Poseidon requires them, if the estimates themselves are set. I now set them all to a dummy value of 0.001, but it would of course be good if you could actually fill the correct ones if you still have them.
  • The dates in GRO004, GRO006, GRO015, GRO016, GRO020 were given only as upper bound in our Supplement. I have now set their lower bound to 900 in all of these. Please check whether that is appropriate
  • All HIDXXX samples were only given with a date fixed at 400 CE. I have now set this 300-500. Please check whether that is appropriate.
  • I changed England_EMA_Capture to England_EMA. I think that they are capture is in Janno already given in the Capture_Type column, and that they’re the same as my 2016 individuals is given via the relationship field
  • I changed the group names of the four 1240K duplicates of WGS samples at OAI.

@stschiff
Copy link
Member

stschiff commented Sep 4, 2023

@93Boy I think it would be great if you could start filling in the rest of the relationships. I have created the Relationship_* columns and filled only the duplicates so far, but there are a lot more, listed in Joscha's Supplement. I suggest you start a new branch, which (as an exception to the rule) branches of from this branch here, so that we can get this merged in even before your task is done, and then merge your bits later as a second branch.

@93Boy
Copy link
Contributor Author

93Boy commented Sep 5, 2023

@stschiff Yes I have begun to working on the relationship data in a sub branch as you have mentioned

@93Boy
Copy link
Contributor Author

93Boy commented Sep 11, 2023

@stschiff I have updated all the kinship information I have found in supplementary documents

@stschiff
Copy link
Member

I think you haven't pushed your changes, @93Boy

@93Boy
Copy link
Contributor Author

93Boy commented Sep 13, 2023

@stschiff I have created a sub branch named AS_relationship and pushed changes.

@stschiff
Copy link
Member

Great, I'll take a look.

@stschiff stschiff mentioned this pull request Oct 4, 2023
@stschiff
Copy link
Member

stschiff commented Oct 4, 2023

OK, after having first merged #136 and then realised that the kinship needs more work, I decided to roll-back this PR to the state before the kinship information was merged. I will now await the validation pass (it passed locally) and then merge this in. Some points need addressing by Joscha Gretzinger, but I do not want to wait further and instead then have him look through the merged package and eventually provide an update. It's such a big package that I find it important that it gets out now.

@stschiff stschiff marked this pull request as ready for review October 4, 2023 10:59
@stschiff
Copy link
Member

stschiff commented Oct 4, 2023

OK, so I think this is ready to be merged. @AyGhal do you want to have a look? There are some points, listed above, which need to be checked by Joscha, but given that some time has passed I suggest to merge this now and then release an update once Joscha has checked some of the details in the Janno. This is a large package and it needs to be published yesterday.

@AyGhal AyGhal merged commit 3d04838 into master Oct 4, 2023
1 check passed
@AyGhal
Copy link
Contributor

AyGhal commented Oct 4, 2023

I had a quick look and it seems okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants