Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HESTData: provide util to map ensemble ID to gene name #71

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

konst-int-i
Copy link
Collaborator

@konst-int-i konst-int-i commented Nov 9, 2024

This PR

To dos

  • Might want to make this bi-directional to have ensembleID_to_gene and gene_to_ensembleID
  • There is some redundancy with get_gene_db() which is not needed when using the use_cache when we query the biomart annotations. Are we good to remove this function @pauldoucet as I don't see it used anywhere?

Run instructions

from hest import iter_hest, ensembleID_to_gene

# three samples with ensemblIDs as var_names
id_list = ['SPA118', 'SPA117', 'SPA116']

for st in iter_hest('/home/iain/kh/ssd/hest_data/', id_list=id_list):
    
    print(any([var_name.startswith("ENSG") for var_name in st.adata.var_names]))
    print(st.adata.var_names[:5])
    
    st_updated = ensembleID_to_gene(st)
    
    print(any([var_name.startswith("ENSG") for var_name in st_updated.adata.var_names]))
    print(st_updated.adata.var_names[:5])

Expected output:

True
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
False
Index(['TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM'], dtype='object', name='gene_name')
True
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
False
Index(['TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM'], dtype='object', name='gene_name')
True
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
False
Index(['TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM'], dtype='object', name='gene_name')

@konst-int-i konst-int-i added the enhancement New feature or request label Nov 9, 2024
@pauldoucet
Copy link
Collaborator

Hi @konst-int-i,
Have you tried on TENX24? It gives me 0 valid genes, weird

@konst-int-i
Copy link
Collaborator Author

Hi @konst-int-i, Have you tried on TENX24? It gives me 0 valid genes, weird

Yes, that's because TENX24 doesn't contain any ensemble IDs and currently the default behavior was to invalidate genes without a mapping. Changed it to keep genes without mapping in 4e369f9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants