Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError 'Field <GENE> does not exist in schema' at prune2df step for tutorial #589

Open
tuanpham96 opened this issue Oct 23, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@tuanpham96
Copy link

Description

I'm running the tutorial and I keep getting the errors at the prune2df step like this:

Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')'
[...]
Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')'

Steps to reproduce the behavior

  1. Command run when the error occurred:
Import & Define resources
# import
import os
import sys
import glob
import re

import numpy as np
import pandas as pd

from dask.diagnostics import ProgressBar
from dask.distributed import Client, LocalCluster

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2, genie3

from ctxcore.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell

# define paths
OUTPUT_DATA_FOLDER = "data/grn"
INPUT_EXPR_FILE = 'data/external/geo/GSE60361_C1-3005-Expression.txt'

RESOURCES_DIRECTORY = "data/external/aertslab/resources.aertslab.org/cistarget"
DATABASES_GLOB = os.path.join(
    RESOURCES_DIRECTORY,
    "databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/",
    "mm9-*.mc9nr.genes_vs_motifs.rankings.feather"
)

MOTIF_ANNOTATIONS_FNAME = os.path.join(
    RESOURCES_DIRECTORY,
    "motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl"
)

MM_TFS_FNAME = os.path.join(
    RESOURCES_DIRECTORY,
    "tf_lists/allTFs_mm.txt"
)

REGULONS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "regulons.p")
MOTIFS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "motifs.csv")

Here's what the resource directory looks like:

data/external/aertslab/resources.aertslab.org/cistarget
├── databases
│   └── mus_musculus
│       ├── mm10
│       │   ├── refseq_r80
│       │   │   ├── mc9nr
│       │   │   │   └── gene_based
│       │   │   └── mc_v10_clust
│       │   │       └── gene_based
│       │   └── screen
│       │       └── mc_v10_clust
│       │           └── region_based
│       └── mm9
│           ├── refseq_r45
│           │   └── mc9nr
│           │       └── gene_based
│           │           ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather
│           │           └── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           └── refseq_r70
│               └── mc9nr
│                   └── region_based
├── motif2tf
│   ├── motifs-v10nr_clust-nr.chicken-m0.001-o0.0.tbl
│   ├── motifs-v10nr_clust-nr.flybase-m0.001-o0.0.tbl
│   ├── motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
│   ├── motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl
│   ├── motifs-v8-nr.flybase-m0.001-o0.0.tbl
│   ├── motifs-v9-nr.flybase-m0.001-o0.0.tbl
│   ├── motifs-v9-nr.hgnc-m0.001-o0.0.tbl
│   └── motifs-v9-nr.mgi-m0.001-o0.0.tbl
└── tf_lists
    ├── allTFs_dmel.txt
    ├── allTFs_hg38.txt
    └── allTFs_mm.txt
Load data
ex_matrix = pd.read_csv(INPUT_EXPR_FILE, sep='\t', header=0, index_col=0).T
ex_matrix.shape
---
(3005, 19972)
tf_names = load_tf_names(MM_TFS_FNAME)
len(tf_names)
---
1860
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.splitext(os.path.basename(fname))[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs
---
[FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings")]

Then the steps as in the tutorials:

adjacencies = grnboost2(
    ex_matrix, 
    tf_names=tf_names, 
    verbose=True
)

modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

The above steps worked fine. Then to prune2df, which didn't work:

Since I'm running on university HPC, I followed this comment:

with ProgressBar():
    df = prune2df(
        dbs, modules, MOTIF_ANNOTATIONS_FNAME,
        client_or_address=Client(LocalCluster())
    )
  1. Error encountered:

Here's a snippet of the trace back:

KeyError: 'Field "1810058I24Rik" does not exist in schema'
2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7f2faeb34430>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7f2faeb34c10>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, 
kwargs:    {}
Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')'
Full traceback
/users/<MY-USERNAME>/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40923 instead
  warnings.warn(
/users/<MY-USERNAME>/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3169: UserWarning: Sending large graph of size 67.13 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(

2024-10-23 12:20:00,160 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snapc5 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module.

2024-10-23 12:20:00,317 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Zkscan8 could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module.

2024-10-23 12:20:00,375 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snai2 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module.

2024-10-23 12:20:00,558 - pyscenic.transform - WARNING - Less than 80% of the genes in Zscan4f could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module.
2024-10-23 12:20:00,703 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9993)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7fb1799ea710>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7fb1799ea830>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snai3', gene2weight=frozendict.frozendict({'Nptxr': 3.71858079580093, '4930422G04Rik': 3.4189593778885565, 'Tst': 3.4019598703995646, 'Gm16982': 3.2794141345795023, 'Fam71d': 2.770652596133948, 'Trim65': 2.5165290070963446, '6430571L13Rik': 2.322653157302345, 'Snora68': 2.2677351112931254, 'Mir1983': 2.1049208171740688, '4833412C05Rik': 1.9567400648623148, 'Slfn9': 1.918912519709758, 'Myh14': 1.8155076543676392, 'Adrb3': 1.7655713128799888, 'Psd4': 1.7593517666032166, 'Snora5c': 1.742871015263215, 'Armc2': 1.722555406741925, 'Vmn2r87': 1.697990923076
kwargs:    {}
Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')'

2024-10-23 12:20:00,708 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9994)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7fb1799ea710>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7fb1799ea830>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snapc4', gene2weight=frozendict.frozendict({'C330006A16Rik': 4.349251118259209, 'C130060C02Rik': 4.208884115579275, 'Gm7854': 3.5410158417095827, 'Sarm1': 2.927046278423663, 'Lenep': 2.785681164068822, 'B3galt6': 2.6131502759106913, 'Gm11202': 2.462704280429404, 'Sgcz': 2.3117112215460534, 'Lrpprc': 1.9038514495386707, 'Nsun7': 1.8296346841767743, 'Gpsm2': 1.8010605529771901, 'Pcnxl3': 1.7919756056901275, 'Gm5801_loc2': 1.7491126804317503, 'Wdr31': 1.570558011998353, 'Cxcl9': 1.556944185120764, 'Kctd8': 1.5382108739561857, 'Cad': 1.5234094823248008, 
kwargs:    {}
Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')'

[... TRUNCATED DUE TO LIMIT ON GITHUB ISSUE ...]

2024-10-23 12:20:03,485 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9956)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7fb1799ea710>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7fb1799ea830>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Sfpq', gene2weight=frozendict.frozendict({'Erdr1': 3.0321972987058734, 'Zc3h11a': 2.691669912271372, 'Fnbp4': 2.6088463086281473, '1110037F02Rik': 2.0684340440320277, 'Tdrd7': 2.0345506791813444, 'Polr1a': 2.010575150065573, 'Tnrc6c': 1.960275733025441, 'Nup155': 1.8890378889442363, 'Crebzf': 1.8710276416734972, 'Arid2': 1.8643832712747197, '1810026B05Rik': 1.8462831579510215, 'Snrnp70': 1.844349613916035, 'Pprc1': 1.7901300300003178, 'Cntnap5a': 1.7415507727206572, '0610009O20Rik': 1.707972820514252, 'Chd9': 1.6735837131019469, 'Pkd1l3': 1.661729812
kwargs:    {}
Exception: 'KeyError(\'Field "0610030E20Rik" does not exist in schema\')'

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[11], line 2
      1 with ProgressBar():
----> 2     df = prune2df(
      3         dbs, modules, MOTIF_ANNOTATIONS_FNAME,
      4         client_or_address=Client(LocalCluster())
      5     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:424, in prune2df(rnkdbs, modules, motif_annotations_fname, rank_threshold, auc_threshold, nes_threshold, motif_similarity_fdr, orthologuous_identity_threshold, weighted_recovery, client_or_address, num_workers, module_chunksize, filter_for_annotation)
    418 # Create a distributed dataframe from individual delayed objects to avoid out of memory problems.
    419 aggregation_func = (
    420     partial(from_delayed, meta=DF_META_DATA)
    421     if client_or_address != "custom_multiprocessing"
    422     else pd.concat
    423 )
--> 424 return _distributed_calc(
    425     rnkdbs,
    426     modules,
    427     motif_annotations_fname,
    428     transformation_func,
    429     aggregation_func,
    430     motif_similarity_fdr,
    431     orthologuous_identity_threshold,
    432     client_or_address,
    433     num_workers,
    434     module_chunksize,
    435 )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:362, in _distributed_calc(rnkdbs, modules, motif_annotations_fname, transform_func, aggregate_func, motif_similarity_fdr, orthologuous_identity_threshold, client_or_address, num_workers, module_chunksize)
    357 client, shutdown_callback = _prepare_client(
    358     client_or_address,
    359     num_workers=num_workers if num_workers else cpu_count(),
    360 )
    361 try:
--> 362     return client.compute(create_graph(client), sync=True)
    363 finally:
    364     shutdown_callback(False)

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3502, in Client.compute(self, collections, sync, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, traverse, **kwargs)
   3499         futures.append(arg)
   3501 if sync:
-> 3502     result = self.gather(futures)
   3503 else:
   3504     result = futures

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:2384, in Client.gather(self, futures, errors, direct, asynchronous)
   2381     local_worker = None
   2383 with shorten_traceback():
-> 2384     return self.sync(
   2385         self._gather,
   2386         futures,
   2387         errors=errors,
   2388         direct=direct,
   2389         local_worker=local_worker,
   2390         asynchronous=asynchronous,
   2391     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:368, in modules2df()
    356 def modules2df(
    357     db: Type[RankingDatabase],
    358     modules: Sequence[Regulon],
   (...)
    365     # to be fixed for the dask framework.
    366     # TODO: Remove this restriction.
    367     return pd.concat(
--> 368         [
    369             module2df(
    370                 db,
    371                 module,
    372                 motif_annotations,
    373                 weighted_recovery,
    374                 False,
    375                 module2features_func,
    376             )
    377             for module in modules
    378         ]
    379     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:369, in <listcomp>()
    356 def modules2df(
    357     db: Type[RankingDatabase],
    358     modules: Sequence[Regulon],
   (...)
    365     # to be fixed for the dask framework.
    366     # TODO: Remove this restriction.
    367     return pd.concat(
    368         [
--> 369             module2df(
    370                 db,
    371                 module,
    372                 motif_annotations,
    373                 weighted_recovery,
    374                 False,
    375                 module2features_func,
    376             )
    377             for module in modules
    378         ]
    379     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:287, in module2df()
    285 # Derive enriched and TF-annotated features for module.
    286 try:
--> 287     df_annotated_features, rccs, rankings, genes, avg2stdrcc = module2features_func(
    288         db, module, motif_annotations, weighted_recovery=weighted_recovery
    289     )
    290 except MemoryError:
    291     LOGGER.error(
    292         'Unable to process "{}" on database "{}" because ran out of memory. Stacktrace:'.format(
    293             module.name, db.name
    294         )
    295     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:176, in module2features_auc1st_impl()
    162 """
    163 Create a dataframe of enriched and annotated features a given ranking database and a co-expression module.
    164 
   (...)
    172 :return: A dataframe with enriched and annotated features.
    173 """
    175 # Load rank of genes from database.
--> 176 df = db.load(module)
    177 features, genes, rankings = df.index.values, df.columns.values, df.values
    178 weights = (
    179     np.asarray([module[gene] for gene in genes])
    180     if weighted_recovery
    181     else np.ones(len(genes))
    182 )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/rnkdb.py:132, in load()
    128 def load(self, gs: GeneSignature) -> pd.DataFrame:
    129     # For some genes in the signature there might not be a rank available in the database.
    130     gene_set = self.geneset.intersection(set(gs.genes))
--> 132     return self.ct_db.subset_to_pandas(
    133         region_or_gene_ids=RegionOrGeneIDs(
    134             region_or_gene_ids=gene_set,
    135             regions_or_genes_type=self.ct_db.all_region_or_gene_ids.type,
    136         )
    137     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:789, in subset_to_pandas()
    785 engine = engine if engine else self.engine
    787 # Fetch scores or rankings for input region IDs or gene IDs from cisTarget database file for region IDs or
    788 # gene IDs which were not prefetched in previous calls.
--> 789 self.prefetch(region_or_gene_ids=region_or_gene_ids, engine=engine, sort=True)
    791 if not self.df_cached:
    792     raise RuntimeError(
    793         f"Prefetch failed to retrieve {self.scores_or_rankings} for "
    794         f"{region_or_gene_ids} from cisTarget database "
    795         f'"{self.ct_db_filename}".'
    796     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:739, in prefetch()
    734     self._prefetch_as_polars_dataframe(
    735         region_or_gene_ids=region_or_gene_ids, use_pyarrow=True, sort=sort
    736     )
    737 elif engine == "pyarrow":
    738     # Store prefetched data as pyarrow Table (self.df_cached) and read data with pyarrow's native IPC reader.
--> 739     self._prefetch_as_pyarrow_table(
    740         region_or_gene_ids=region_or_gene_ids, sort=sort
    741     )
    742 else:
    743     raise ValueError(
    744         f'Unsupported engine "{engine}" for reading cisTarget database.'
    745     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:678, in _prefetch_as_pyarrow_table()
    673 self.region_or_gene_ids_loaded = found_region_or_gene_ids.union(
    674     self.region_or_gene_ids_loaded
    675 )
    677 # Store new pyarrow Table with previously and newly loaded region IDs or gene IDs scores/rankings.
--> 678 self.df_cached = pa_table.select(
    679     (
    680         self.region_or_gene_ids_loaded.sort().ids
    681         if sort
    682         else self.region_or_gene_ids_loaded.ids
    683     )
    684     + (self.all_motif_or_track_ids.type.value,)
    685 )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:4207, in pyarrow.lib.Table.select()
   4205 
   4206         for idx in columns:
-> 4207             idx = self._ensure_integer_index(idx)
   4208             idx = _normalize_index(idx, self.num_columns)
   4209             c_indices.push_back(<int> idx)

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:1668, in pyarrow.lib._Tabular._ensure_integer_index()
   1666 
   1667             if len(field_indices) == 0:
-> 1668                 raise KeyError("Field \"{}\" does not exist in schema"
   1669                                .format(i))
   1670             elif len(field_indices) > 1:

KeyError: 'Field "1810058I24Rik" does not exist in schema'
2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7f2faeb34430>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7f2faeb34c10>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, 
kwargs:    {}
Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')'

Please complete the following information:

  • pySCENIC version: due to the current numpy issue, I installed via pip git+...
  • Installation method: first created a conda environment (Python 3.10.14), then pip git+...
  • Run environment: Jupyter Notebook on university HPC (1 node, 40 cores, 120g)
  • OS: Linux (I believe the HPC uses RHEL/9.2)
  • Package versions:
aiohttp                   3.10.0
anndata                   0.10.8
arboreto                  0.1.6
arrow                     1.3.0
attrs                     23.2.0
boltons                   24.0.0
cloudpickle               3.0.0
ctxcore                   0.2.0
cytoolz                   0.12.3
dask                      2024.2.1
dask-expr                 0.5.3
distributed               2024.2.1
feather-format            0.4.1
frozendict                2.4.4
fsspec                    2024.6.1
interlap                  0.2.7
llvmlite                  0.43.0
loompy                    3.0.7
matplotlib                3.9.2
matplotlib-inline         0.1.7
multiprocessing_on_dill   3.5.0a4
networkx                  3.3
numba                     0.60.0
numexpr                   2.10.1
numpy                     1.26.4
numpy-groupies            0.11.2
pandas                    2.2.2
pandas-flavor             0.6.0
pyarrow                   17.0.0
pyarrow-hotfix            0.6
pyscenic                  0.12.1+8.gd2309fe
requests                  2.32.3
scanpy                    1.10.2
scikit-learn              1.5.1
scipy                     1.14.0
seaborn                   0.13.2
setuptools                71.0.4
tqdm                      4.66.4
umap-learn                0.5.6
@tuanpham96 tuanpham96 added the bug Something isn't working label Oct 23, 2024
@tuanpham96
Copy link
Author

Update: also tested with singularity and had the same error

# build image & bind path
singularity build pyscenic.sif docker://aertslab/pyscenic_scanpy:0.12.1_1.9.1
export SINGULARITY_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data" # this is from our HPC's guide for binding path
# create a shell inside
singularity shell utils/pyscenic.sif

Then inside the shell I just started an ipython kernel, copied and pasted that same code. The same issues occurred.

Am I defining the right resources? There are some pages in the resources URL that are indicated as deprecated but I'm not entirely sure which ones to change them to.

@ghuls
Copy link
Member

ghuls commented Oct 24, 2024

Run the command line version and not the notebook version:
https://pyscenic.readthedocs.io/en/latest/installation.html#docker-podman-and-singularity-apptainer-images

@tuanpham96
Copy link
Author

tuanpham96 commented Oct 24, 2024

I'm using the singularity image with the CLI and it seems to be stuck at ctx step for > 2 hrs without finishing. I'm using --mode "custom_multiprocessing" --num_workers 40. Is that typical?

@tuanpham96
Copy link
Author

nevermind, based on reading other issues it seems to be I need more RAM and less number of cores. I did 20 cores + 200 gb and it seems to finish within 20 - 25 minutes using the singularity image with "dask_multiprocessing".

Is there a guide about suggested minimum RAM + # cores for each step, given some number of genes / cells / databases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants