KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589

tuanpham96 · 2024-10-23T17:11:51Z

Description

I'm running the tutorial and I keep getting the errors at the prune2df step like this:

Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')'
[...]
Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')'

Steps to reproduce the behavior

Command run when the error occurred:

Import & Define resources

# import
import os
import sys
import glob
import re

import numpy as np
import pandas as pd

from dask.diagnostics import ProgressBar
from dask.distributed import Client, LocalCluster

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2, genie3

from ctxcore.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell

# define paths
OUTPUT_DATA_FOLDER = "data/grn"
INPUT_EXPR_FILE = 'data/external/geo/GSE60361_C1-3005-Expression.txt'

RESOURCES_DIRECTORY = "data/external/aertslab/resources.aertslab.org/cistarget"
DATABASES_GLOB = os.path.join(
    RESOURCES_DIRECTORY,
    "databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/",
    "mm9-*.mc9nr.genes_vs_motifs.rankings.feather"
)

MOTIF_ANNOTATIONS_FNAME = os.path.join(
    RESOURCES_DIRECTORY,
    "motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl"
)

MM_TFS_FNAME = os.path.join(
    RESOURCES_DIRECTORY,
    "tf_lists/allTFs_mm.txt"
)

REGULONS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "regulons.p")
MOTIFS_FNAME = os.path.join(OUTPUT_DATA_FOLDER, "motifs.csv")

Here's what the resource directory looks like:

data/external/aertslab/resources.aertslab.org/cistarget
├── databases
│   └── mus_musculus
│       ├── mm10
│       │   ├── refseq_r80
│       │   │   ├── mc9nr
│       │   │   │   └── gene_based
│       │   │   └── mc_v10_clust
│       │   │       └── gene_based
│       │   └── screen
│       │       └── mc_v10_clust
│       │           └── region_based
│       └── mm9
│           ├── refseq_r45
│           │   └── mc9nr
│           │       └── gene_based
│           │           ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather
│           │           ├── mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           │           ├── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather
│           │           └── mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather.sha1sum.txt
│           └── refseq_r70
│               └── mc9nr
│                   └── region_based
├── motif2tf
│   ├── motifs-v10nr_clust-nr.chicken-m0.001-o0.0.tbl
│   ├── motifs-v10nr_clust-nr.flybase-m0.001-o0.0.tbl
│   ├── motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
│   ├── motifs-v10nr_clust-nr.mgi-m0.001-o0.0.tbl
│   ├── motifs-v8-nr.flybase-m0.001-o0.0.tbl
│   ├── motifs-v9-nr.flybase-m0.001-o0.0.tbl
│   ├── motifs-v9-nr.hgnc-m0.001-o0.0.tbl
│   └── motifs-v9-nr.mgi-m0.001-o0.0.tbl
└── tf_lists
    ├── allTFs_dmel.txt
    ├── allTFs_hg38.txt
    └── allTFs_mm.txt

Load data

ex_matrix = pd.read_csv(INPUT_EXPR_FILE, sep='\t', header=0, index_col=0).T
ex_matrix.shape
---
(3005, 19972)

tf_names = load_tf_names(MM_TFS_FNAME)
len(tf_names)
---
1860

db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.splitext(os.path.basename(fname))[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs
---
[FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings")]

Then the steps as in the tutorials:

adjacencies = grnboost2(
    ex_matrix, 
    tf_names=tf_names, 
    verbose=True
)

modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

The above steps worked fine. Then to prune2df, which didn't work:

Since I'm running on university HPC, I followed this comment:

with ProgressBar():
    df = prune2df(
        dbs, modules, MOTIF_ANNOTATIONS_FNAME,
        client_or_address=Client(LocalCluster())
    )

Error encountered:

Here's a snippet of the trace back:

KeyError: 'Field "1810058I24Rik" does not exist in schema'
2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7f2faeb34430>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7f2faeb34c10>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, 
kwargs:    {}
Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')'

Full traceback

/users/<MY-USERNAME>/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40923 instead
  warnings.warn(
/users/<MY-USERNAME>/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3169: UserWarning: Sending large graph of size 67.13 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(

2024-10-23 12:20:00,160 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snapc5 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module.

2024-10-23 12:20:00,317 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Zkscan8 could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module.

2024-10-23 12:20:00,375 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for Snai2 could be mapped to mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings. Skipping this module.

2024-10-23 12:20:00,558 - pyscenic.transform - WARNING - Less than 80% of the genes in Zscan4f could be mapped to mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings. Skipping this module.
2024-10-23 12:20:00,703 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9993)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7fb1799ea710>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7fb1799ea830>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snai3', gene2weight=frozendict.frozendict({'Nptxr': 3.71858079580093, '4930422G04Rik': 3.4189593778885565, 'Tst': 3.4019598703995646, 'Gm16982': 3.2794141345795023, 'Fam71d': 2.770652596133948, 'Trim65': 2.5165290070963446, '6430571L13Rik': 2.322653157302345, 'Snora68': 2.2677351112931254, 'Mir1983': 2.1049208171740688, '4833412C05Rik': 1.9567400648623148, 'Slfn9': 1.918912519709758, 'Myh14': 1.8155076543676392, 'Adrb3': 1.7655713128799888, 'Psd4': 1.7593517666032166, 'Snora5c': 1.742871015263215, 'Armc2': 1.722555406741925, 'Vmn2r87': 1.697990923076
kwargs:    {}
Exception: 'KeyError(\'Field "Snora5c" exists 2 times in schema\')'

2024-10-23 12:20:00,708 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9994)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7fb1799ea710>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7fb1799ea830>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Snapc4', gene2weight=frozendict.frozendict({'C330006A16Rik': 4.349251118259209, 'C130060C02Rik': 4.208884115579275, 'Gm7854': 3.5410158417095827, 'Sarm1': 2.927046278423663, 'Lenep': 2.785681164068822, 'B3galt6': 2.6131502759106913, 'Gm11202': 2.462704280429404, 'Sgcz': 2.3117112215460534, 'Lrpprc': 1.9038514495386707, 'Nsun7': 1.8296346841767743, 'Gpsm2': 1.8010605529771901, 'Pcnxl3': 1.7919756056901275, 'Gm5801_loc2': 1.7491126804317503, 'Wdr31': 1.570558011998353, 'Cxcl9': 1.556944185120764, 'Kctd8': 1.5382108739561857, 'Cad': 1.5234094823248008, 
kwargs:    {}
Exception: 'KeyError(\'Field "1600002H07Rik" does not exist in schema\')'

[... TRUNCATED DUE TO LIMIT ON GITHUB ISSUE ...]

2024-10-23 12:20:03,485 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 9956)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7fb1799ea710>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7fb1799ea830>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Regulon for Sfpq', gene2weight=frozendict.frozendict({'Erdr1': 3.0321972987058734, 'Zc3h11a': 2.691669912271372, 'Fnbp4': 2.6088463086281473, '1110037F02Rik': 2.0684340440320277, 'Tdrd7': 2.0345506791813444, 'Polr1a': 2.010575150065573, 'Tnrc6c': 1.960275733025441, 'Nup155': 1.8890378889442363, 'Crebzf': 1.8710276416734972, 'Arid2': 1.8643832712747197, '1810026B05Rik': 1.8462831579510215, 'Snrnp70': 1.844349613916035, 'Pprc1': 1.7901300300003178, 'Cntnap5a': 1.7415507727206572, '0610009O20Rik': 1.707972820514252, 'Chd9': 1.6735837131019469, 'Pkd1l3': 1.661729812
kwargs:    {}
Exception: 'KeyError(\'Field "0610030E20Rik" does not exist in schema\')'

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[11], line 2
      1 with ProgressBar():
----> 2     df = prune2df(
      3         dbs, modules, MOTIF_ANNOTATIONS_FNAME,
      4         client_or_address=Client(LocalCluster())
      5     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:424, in prune2df(rnkdbs, modules, motif_annotations_fname, rank_threshold, auc_threshold, nes_threshold, motif_similarity_fdr, orthologuous_identity_threshold, weighted_recovery, client_or_address, num_workers, module_chunksize, filter_for_annotation)
    418 # Create a distributed dataframe from individual delayed objects to avoid out of memory problems.
    419 aggregation_func = (
    420     partial(from_delayed, meta=DF_META_DATA)
    421     if client_or_address != "custom_multiprocessing"
    422     else pd.concat
    423 )
--> 424 return _distributed_calc(
    425     rnkdbs,
    426     modules,
    427     motif_annotations_fname,
    428     transformation_func,
    429     aggregation_func,
    430     motif_similarity_fdr,
    431     orthologuous_identity_threshold,
    432     client_or_address,
    433     num_workers,
    434     module_chunksize,
    435 )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/prune.py:362, in _distributed_calc(rnkdbs, modules, motif_annotations_fname, transform_func, aggregate_func, motif_similarity_fdr, orthologuous_identity_threshold, client_or_address, num_workers, module_chunksize)
    357 client, shutdown_callback = _prepare_client(
    358     client_or_address,
    359     num_workers=num_workers if num_workers else cpu_count(),
    360 )
    361 try:
--> 362     return client.compute(create_graph(client), sync=True)
    363 finally:
    364     shutdown_callback(False)

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:3502, in Client.compute(self, collections, sync, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, traverse, **kwargs)
   3499         futures.append(arg)
   3501 if sync:
-> 3502     result = self.gather(futures)
   3503 else:
   3504     result = futures

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/distributed/client.py:2384, in Client.gather(self, futures, errors, direct, asynchronous)
   2381     local_worker = None
   2383 with shorten_traceback():
-> 2384     return self.sync(
   2385         self._gather,
   2386         futures,
   2387         errors=errors,
   2388         direct=direct,
   2389         local_worker=local_worker,
   2390         asynchronous=asynchronous,
   2391     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:368, in modules2df()
    356 def modules2df(
    357     db: Type[RankingDatabase],
    358     modules: Sequence[Regulon],
   (...)
    365     # to be fixed for the dask framework.
    366     # TODO: Remove this restriction.
    367     return pd.concat(
--> 368         [
    369             module2df(
    370                 db,
    371                 module,
    372                 motif_annotations,
    373                 weighted_recovery,
    374                 False,
    375                 module2features_func,
    376             )
    377             for module in modules
    378         ]
    379     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:369, in <listcomp>()
    356 def modules2df(
    357     db: Type[RankingDatabase],
    358     modules: Sequence[Regulon],
   (...)
    365     # to be fixed for the dask framework.
    366     # TODO: Remove this restriction.
    367     return pd.concat(
    368         [
--> 369             module2df(
    370                 db,
    371                 module,
    372                 motif_annotations,
    373                 weighted_recovery,
    374                 False,
    375                 module2features_func,
    376             )
    377             for module in modules
    378         ]
    379     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:287, in module2df()
    285 # Derive enriched and TF-annotated features for module.
    286 try:
--> 287     df_annotated_features, rccs, rankings, genes, avg2stdrcc = module2features_func(
    288         db, module, motif_annotations, weighted_recovery=weighted_recovery
    289     )
    290 except MemoryError:
    291     LOGGER.error(
    292         'Unable to process "{}" on database "{}" because ran out of memory. Stacktrace:'.format(
    293             module.name, db.name
    294         )
    295     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyscenic/transform.py:176, in module2features_auc1st_impl()
    162 """
    163 Create a dataframe of enriched and annotated features a given ranking database and a co-expression module.
    164 
   (...)
    172 :return: A dataframe with enriched and annotated features.
    173 """
    175 # Load rank of genes from database.
--> 176 df = db.load(module)
    177 features, genes, rankings = df.index.values, df.columns.values, df.values
    178 weights = (
    179     np.asarray([module[gene] for gene in genes])
    180     if weighted_recovery
    181     else np.ones(len(genes))
    182 )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/rnkdb.py:132, in load()
    128 def load(self, gs: GeneSignature) -> pd.DataFrame:
    129     # For some genes in the signature there might not be a rank available in the database.
    130     gene_set = self.geneset.intersection(set(gs.genes))
--> 132     return self.ct_db.subset_to_pandas(
    133         region_or_gene_ids=RegionOrGeneIDs(
    134             region_or_gene_ids=gene_set,
    135             regions_or_genes_type=self.ct_db.all_region_or_gene_ids.type,
    136         )
    137     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:789, in subset_to_pandas()
    785 engine = engine if engine else self.engine
    787 # Fetch scores or rankings for input region IDs or gene IDs from cisTarget database file for region IDs or
    788 # gene IDs which were not prefetched in previous calls.
--> 789 self.prefetch(region_or_gene_ids=region_or_gene_ids, engine=engine, sort=True)
    791 if not self.df_cached:
    792     raise RuntimeError(
    793         f"Prefetch failed to retrieve {self.scores_or_rankings} for "
    794         f"{region_or_gene_ids} from cisTarget database "
    795         f'"{self.ct_db_filename}".'
    796     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:739, in prefetch()
    734     self._prefetch_as_polars_dataframe(
    735         region_or_gene_ids=region_or_gene_ids, use_pyarrow=True, sort=sort
    736     )
    737 elif engine == "pyarrow":
    738     # Store prefetched data as pyarrow Table (self.df_cached) and read data with pyarrow's native IPC reader.
--> 739     self._prefetch_as_pyarrow_table(
    740         region_or_gene_ids=region_or_gene_ids, sort=sort
    741     )
    742 else:
    743     raise ValueError(
    744         f'Unsupported engine "{engine}" for reading cisTarget database.'
    745     )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/ctxcore/ctdb.py:678, in _prefetch_as_pyarrow_table()
    673 self.region_or_gene_ids_loaded = found_region_or_gene_ids.union(
    674     self.region_or_gene_ids_loaded
    675 )
    677 # Store new pyarrow Table with previously and newly loaded region IDs or gene IDs scores/rankings.
--> 678 self.df_cached = pa_table.select(
    679     (
    680         self.region_or_gene_ids_loaded.sort().ids
    681         if sort
    682         else self.region_or_gene_ids_loaded.ids
    683     )
    684     + (self.all_motif_or_track_ids.type.value,)
    685 )

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:4207, in pyarrow.lib.Table.select()
   4205 
   4206         for idx in columns:
-> 4207             idx = self._ensure_integer_index(idx)
   4208             idx = _normalize_index(idx, self.num_columns)
   4209             c_indices.push_back(<int> idx)

File ~/data/<MY-USERNAME>/conda-env/sequencing/lib/python3.10/site-packages/pyarrow/table.pxi:1668, in pyarrow.lib._Tabular._ensure_integer_index()
   1666 
   1667             if len(field_indices) == 0:
-> 1668                 raise KeyError("Field \"{}\" does not exist in schema"
   1669                                .format(i))
   1670             elif len(field_indices) > 1:

KeyError: 'Field "1810058I24Rik" does not exist in schema'
2024-10-23 12:20:05,208 - distributed.worker - WARNING - Compute Failed
Key:       ('modules2df-to_pyarrow_string-cbd9587a233d2930af793e697ea79787', 44894)
Function:  execute_task
args:      ((subgraph_callable-6ab9cdec5cc5a887810c34ced19255d5, (functools.partial(<function modules2df at 0x7f2faeb34430>, module2features_func=functools.partial(<function module2features_auc1st_impl at 0x7f2faeb34c10>, rank_threshold=1500, auc_threshold=0.05, nes_threshold=3.0, filter_for_annotation=True), weighted_recovery=False), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr.genes_vs_motifs.rankings"), [Regulon(name='Zmat2', gene2weight=frozendict.frozendict({'0610009L18Rik': 0.24334112928456972, '0610010F05Rik': 0.12469187083208143, '0610011F06Rik': 0.07426427896756556, '0610040B10Rik': 0.5023871257709747, '1110002L01Rik': 0.25781507539505055, '1110004E09Rik': 2.0959058794181535, '1110004F10Rik': 0.4213780187493597, '1110008L16Rik': 0.2437094347255712, '1110008P14Rik': 1.2910006242509904, '1110032F04Rik': 0.33099966516163515, '1110035M17Rik': 1.2395167785396992, '1110037F02Rik': 0.9282107870741647, '1110038F14Rik': 1.939926191405302, '1110046J04Rik': 0.2586313695021531, 
kwargs:    {}
Exception: 'KeyError(\'Field "0610010K14Rik" does not exist in schema\')'

Please complete the following information:

pySCENIC version: due to the current numpy issue, I installed via pip git+...
Installation method: first created a conda environment (Python 3.10.14), then pip git+...
Run environment: Jupyter Notebook on university HPC (1 node, 40 cores, 120g)
OS: Linux (I believe the HPC uses RHEL/9.2)
Package versions:

aiohttp                   3.10.0
anndata                   0.10.8
arboreto                  0.1.6
arrow                     1.3.0
attrs                     23.2.0
boltons                   24.0.0
cloudpickle               3.0.0
ctxcore                   0.2.0
cytoolz                   0.12.3
dask                      2024.2.1
dask-expr                 0.5.3
distributed               2024.2.1
feather-format            0.4.1
frozendict                2.4.4
fsspec                    2024.6.1
interlap                  0.2.7
llvmlite                  0.43.0
loompy                    3.0.7
matplotlib                3.9.2
matplotlib-inline         0.1.7
multiprocessing_on_dill   3.5.0a4
networkx                  3.3
numba                     0.60.0
numexpr                   2.10.1
numpy                     1.26.4
numpy-groupies            0.11.2
pandas                    2.2.2
pandas-flavor             0.6.0
pyarrow                   17.0.0
pyarrow-hotfix            0.6
pyscenic                  0.12.1+8.gd2309fe
requests                  2.32.3
scanpy                    1.10.2
scikit-learn              1.5.1
scipy                     1.14.0
seaborn                   0.13.2
setuptools                71.0.4
tqdm                      4.66.4
umap-learn                0.5.6

The text was updated successfully, but these errors were encountered:

tuanpham96 · 2024-10-23T17:41:29Z

Update: also tested with singularity and had the same error

# build image & bind path
singularity build pyscenic.sif docker://aertslab/pyscenic_scanpy:0.12.1_1.9.1
export SINGULARITY_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data" # this is from our HPC's guide for binding path
# create a shell inside
singularity shell utils/pyscenic.sif

Then inside the shell I just started an ipython kernel, copied and pasted that same code. The same issues occurred.

Am I defining the right resources? There are some pages in the resources URL that are indicated as deprecated but I'm not entirely sure which ones to change them to.

ghuls · 2024-10-24T09:53:18Z

Run the command line version and not the notebook version:
https://pyscenic.readthedocs.io/en/latest/installation.html#docker-podman-and-singularity-apptainer-images

tuanpham96 · 2024-10-24T19:50:11Z

I'm using the singularity image with the CLI and it seems to be stuck at ctx step for > 2 hrs without finishing. I'm using --mode "custom_multiprocessing" --num_workers 40. Is that typical?

tuanpham96 · 2024-10-25T15:44:17Z

nevermind, based on reading other issues it seems to be I need more RAM and less number of cores. I did 20 cores + 200 gb and it seems to finish within 20 - 25 minutes using the singularity image with "dask_multiprocessing".

Is there a guide about suggested minimum RAM + # cores for each step, given some number of genes / cells / databases?

tuanpham96 added the bug Something isn't working label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589

KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589

tuanpham96 commented Oct 23, 2024

tuanpham96 commented Oct 23, 2024

ghuls commented Oct 24, 2024

tuanpham96 commented Oct 24, 2024 •

edited

Loading

tuanpham96 commented Oct 25, 2024

KeyError 'Field <GENE> does not exist in schema' at prune2df step for tutorial #589

KeyError 'Field <GENE> does not exist in schema' at prune2df step for tutorial #589

Comments

tuanpham96 commented Oct 23, 2024

tuanpham96 commented Oct 23, 2024

ghuls commented Oct 24, 2024

tuanpham96 commented Oct 24, 2024 • edited Loading

tuanpham96 commented Oct 25, 2024

KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589

KeyError 'Field <GENE> does not exist in schema' at `prune2df` step for tutorial #589

tuanpham96 commented Oct 24, 2024 •

edited

Loading