Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

masked DNA strings #32

Open
kchu25 opened this issue Oct 14, 2022 · 1 comment
Open

masked DNA strings #32

kchu25 opened this issue Oct 14, 2022 · 1 comment

Comments

@kchu25
Copy link

kchu25 commented Oct 14, 2022

There are some DNA strings in the datasets that either partially or entirely consist of masked strings, e.g., the 7th sequence in the DemoHumanOrWorm training set (checked via dset[6]), is a string of 'NNNNNNN....NNNN'. Maybe consider extracting the DNA strings from the unmasked genome?

@simecek
Copy link
Contributor

simecek commented Oct 14, 2022

I believe we use unmasked genome but I will look into that. It might still be that we hit the beginning / end of chromosomes that are often unknown. Maybe we should check the randomly chosen sequences and remove long all Ns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants