[Bug] Isse with running Baysor segmentation on some but not all patches #123

marsdenl · 2024-09-12T06:41:28Z

Hey Quentin,

I've been trying to use the CLI workflow to segment a MERSCOPE dataset. After successfully creating patches I see that the formatting for the transcript.csv file was quite variable across patches. In some patches it was completely fine and segmentation ran well, in others some rows were shifted/missing values and I got warning messages so I removed those rows for simplicity (after which segmentation worked). Finally, some patches failed even after removing problematic rows due to:

[16:38:11] Info: Run R493595882
[16:38:12] Info: (2024-09-12) Run Baysor v0.6.2
[16:38:12] Info: Loading data...
ERROR: MethodError: Cannot convert an object of type InlineStrings.String31 to an object of type Float64.

Closest candidates are:
convert(::Type{T}, ::ColorTypes.Gray) where T<:Real
@ ColorTypes C:\Users\marsdenl.julia\packages\ColorTypes\vpFgh\src\conversions.jl:113
convert(::Type{T}, ::ColorTypes.Gray24) where T<:Real
@ ColorTypes C:\Users\marsdenl.julia\packages\ColorTypes\vpFgh\src\conversions.jl:114
convert(::Type{T}, ::Ratios.SimpleRatio{S}) where {T<:AbstractFloat, S}
@ Ratios C:\Users\marsdenl.julia\packages\Ratios\FsiCW\src\Ratios.jl:51
...

Stacktrace:
[1] setindex!(A::Vector{Float64}, x::InlineStrings.String31, i1::Int64)
@ Base .\array.jl:1021
[2] _unsafe_copyto!(dest::Vector{Float64}, doffs::Int64, src::Vector{Union{Missing, InlineStrings.String31}}, soffs::Int64, n::Int64)
@ Base .\array.jl:299
[3] unsafe_copyto!
@ .\array.jl:353 [inlined]
[4] _copyto_impl!
@ .\array.jl:376 [inlined]
[5] copyto!
@ .\array.jl:363 [inlined]
[6] copyto!
@ .\array.jl:385 [inlined]
[7] copyto_axcheck!
@ .\abstractarray.jl:1177 [inlined]
[8] Vector{Float64}(x::Vector{Union{Missing, InlineStrings.String31}})
@ Base .\array.jl:673
[9] convert(::Type{Vector{Float64}}, a::Vector{Union{Missing, InlineStrings.String31}})
@ Base .\array.jl:665
[10] load_df(data_path::String; min_molecules_per_gene::Int64, exclude_genes::Vector{String}, kwargs::@kwargs{x_col::Symbol, y_col::Symbol, z_col::Symbol, gene_col::Symbol, drop_z::Bool, filter_cols::Bool})
@ Baysor.DataLoading C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\data.jl:84
[11] load_df
@ C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\data.jl:69 [inlined]
[12] load_df(coordinates::String, data_opts::Baysor.Utils.DataOptions; kwargs::@kwargs{filter_cols::Bool})
@ Baysor.DataLoading C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\cli_wrappers.jl:165
[13] run(coordinates::String, prior_segmentation::String; config::Baysor.Utils.RunOptions, x_column::String, y_column::String, z_column::String, gene_column::String, min_molecules_per_cell::Int64, scale::Float64, scale_std::String, n_clusters::Int64, prior_segmentation_confidence::Float64, output::String, plot::Bool, save_polygons::String, no_ncv_estimation::Bool, count_matrix_format::String)
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\cli\main.jl:100
[14] run
@ C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\cli\main.jl:51 [inlined]
[15] command_main(ARGS::Vector{String})
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Comonicon\F3QqZ\src\codegen\julia.jl:343
[16] command_main()
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Comonicon\F3QqZ\src\codegen\julia.jl:90
[17] command_main(; kwargs::@kwargs{})
@ Baysor C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\Baysor.jl:41
[18] top-level scope
@ none:1

Quick visual inspection shows the files look fine. Weirdly enough, the patches that failed due to that error came at repeating intervals. For instance, if I generated 11 patches total, then patch 1, 4, 7, 10 failed. If I generated 15 patches, 2, 6, 10 and 14 failed (1/3 patches). Do you have any idea why some of the patches failed and not others?

Code run:

cd PATH.zarr.sopa_cache\baysor_boundaries\6
C:\Users\marsdenl.julia\bin\baysor.cmd run --save-polygons GeoJSON -c config.toml transcripts.csv

OS: Windows

Thanks!

The text was updated successfully, but these errors were encountered:

quentinblampey · 2024-09-13T11:49:32Z

Hello @marsdenl, thanks for reporting this weird issue. I have never seen this before. Could you share an example of a transcript.csv file that failed (together with the corresponding config.toml file)?
If it's a problem to share the whole transcript file, you can also delete some unused columns or obfuscate the gene names.

marsdenl · 2024-09-15T23:23:33Z

Hey @quentinblampey, thanks for getting back to me. Here is the info:

config.txt
limited_file.csv

The transcripts.csv above is the one that contains both the missing columns error and the float64 error. For missing columns see row 17612 for ex. For float64, I have no idea where it's coming from...I've only attached the first 50,000 rows.
I have managed to make Baysor work independently of sopa on one of my samples so I think it might be linked to Patchify itself.

Thank you!

marsdenl · 2024-09-17T01:13:57Z

In case this helps, @quentinblampey, I also tried Comseg and get a similar issue with poor formatting on some patches but not all patches (some rows have fewer or more columns).
Config.json was the default one from your tutorial. I did cellpose prior nucleus segmentation as recommended and that ran fine :)

Code:
sopa patchify comseg region_0.zarr --config-path Z:\Queries\comseg_config_default.json --patch-width-microns 1000 --patch-overlap-microns 100

sopa segmentation comseg region_0.zarr

I get the error on my second (but not first) patch - ParserError: Error tokenizing data. C error: Expected 10 fields in line 112260, saw 14.

Here is a portion of the transcripts.csv containing 2 misformatted rows.
transcripts_cut.csv

Config.json
config.txt

Let me know what you think! Thank you very much.

quentinblampey · 2024-09-19T09:26:22Z

Indeed I think the problem is happening when creating the transcripts.csv file. We do that using dask, which process each partition in parallel. Maybe, at some points, two processes try to write on the same file, and weird things happen.

Do you know if the issue always happen on the same files if you run it again? Is it random or deterministic?

marsdenl · 2024-09-20T03:18:50Z

I repeated patchify on the same sample with the same config file then simply counted the number of rows that did not have 9 columns in the patch files (I couldn't test re-segmentation with baysor on the patches since the Baysor update, same as issue #125) . Despite getting the same number of patches everytime I patchify, the transcripts.csv for a given patch (say patch 0) has slightly different dimensions and therefore also has a different number of rows that do not have 9 columns. Not too sure how to test for the float64 issue as I don't know where it's coming from...

quentinblampey · 2024-09-23T15:04:31Z

Alright, thanks for the details. Can you also let me know if this happens on the toy dataset (--technology uniform)?
I'm trying to reproduce the issue

marsdenl · 2024-09-24T08:58:16Z

Hey, thanks for trying to reproduce the issue :) Just tried with the --technology uniform dataset but for baysor patchify it only creates one patch so I don't know if it'll reproduce the issue where some patches are okay and others aren't. Also since the baysor update I can't run baysor segmentation to check if it all runs smoothly on that single patch (see issue 125). Looking at the transcripts file for that patch though, it seems okay :) Hope this helps. Let me know if I can send anything else over.

quentinblampey · 2024-09-24T12:25:54Z

Can you try to make smaller patches using --patch-width-microns? You can choose any value that creates more than 1 patch

marsdenl · 2024-09-24T23:57:24Z

Of course, yes sorry xD Made 8 patches on the tutorial dataset and ran baysor (made the changes needed in the config file for v7.0) and it worked perfectly fine. I tried again on my dataset using the same config and again I get the same errors/warnings:

ERROR: MethodError: Cannot convert an object of type InlineStrings.String31 to an object of type Float64
Warning: thread = 1 warning: only found 9 / 14 columns around data row: 6243. Filling remaining columns with missing

Maybe something to do with the formatting of the input detected_transcripts.csv?

Thanks

quentinblampey · 2024-09-25T07:31:42Z

Thanks for trying!

So, now, my hypothesis is either (1) the detected_transcripts.csv file itself has some issue, or (2) dask dataframe reads it wrong.

To test these:

Can you read the file (for instance with pandas), and see if it contains some incomplete rows?
Read your raw files with sopa.io.merscope as usual, and then write the full dask dataframe on disk, as a CSV. See if this file now has incomplete lines.

marsdenl · 2024-09-26T04:28:14Z

Here is what I have done:

Check initial detected_trascripts.csv

import pandas as pd

data = pd.read_csv('Z:Queries/Data/Batch5_region0/region_0/detected_transcripts.csv', header=None, low_memory=False)

incomplete_rows = data[data.apply(lambda x: len(x.dropna()) != 11, axis=1)] # => this gave 0 rows

Check dask written csv

import spatialdata
import sopa
import dask.dataframe as dd
import pandas as pd
import os

sdata = sopa.io.merscope("Z:Queries/Data/Batch5_region0/region_0/")

sdata.points

{'Batch5_region0_region_0_transcripts': Dask DataFrame Structure:
x y fov cell_id barcode_id transcript_id global_z gene Unnamed: 0
npartitions=12
float64 float64 int64 int64 int64 string float64 category[unknown] int64
... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
Dask Name: assign, 21 graph layers}

points_df = sdata.points['Batch5_region0_region_0_transcripts']
points_df.to_csv('dask_df.csv', index=False, single_file=True)
written_panda_df = pd.read_csv('C:/Users/marsdenl/dask_df.csv')

check if any row has fewer than 9 columns -> this gave my 0 rows

written_panda_df[written_panda_df.apply(lambda x: len(x.dropna()) != 9, axis=1)]

check if any row has more than 9 columns -> this gave my 0 rows

written_panda_df[written_panda_df.apply(lambda x: len(x) > 9, axis=1)]

At first hand, it seems both files - the inital csv and the dask written csv - look okay in terms of number of columns (although let me know if code above is not correct). Could it be patchify itself?

quentinblampey · 2024-09-26T09:59:37Z

When checking for nan values, can you use data.isna().any(axis=1).mean()? This will give you the ratio of rows that contains at least one NaN value. This should be 0 on MERSCOPE data.

Also, on the CSV file you gave me, I saw that the header is also weird, as it has three empty columns at the end, see below:

x,y,Unnamed: 0,transcript_id,barcode_id,cell_id,fov,gene,global_z,cell,,,,

This is also not expected. Do you have something similar with the raw data? Since the header is weird, I suspect there is something wrong before even making the patches.

marsdenl · 2024-09-27T00:22:18Z

Sure thing!

On the original file:

data.isna().any(axis=1).mean()
0.00000010972423337598314

data.isna().any(axis=1).value_counts()
False 9113756
True 1
Name: count, dtype: int64

The only true count being the column names.

On the dask-written file:

written_panda_df.isna().any(axis=1).mean()
0.0

written_panda_df.isna().any(axis=1).value_counts()
False 9113756
Name: count, dtype: int64

Regarding the csv I sent you, that's not in the original file, I must have added that... Happy to send you the original file by email.

quentinblampey · 2024-09-27T08:55:30Z

The only true count being the column names.

Do you mean that the only row with NaN values are the columns? The command data.isna().any(axis=1).mean() shouldn't count the columns. Are you sure?

Regarding the csv I sent you, that's not in the original file, I must have added that... Happy to send you the original file by email.

Yes, please, can you send it to me (just the file of the patch that has an issue) at [email protected]?

Btw, I'll be on vacations for two weeks, I'll answer you in mid-October :)

marsdenl · 2024-09-29T02:41:07Z

Sorry, I wasn't super clear. The only count with data.isna().any(axis=1).value_counts() is the first row, which corresponds to the column names. The first row of the first column is empty:

Just sent it now :)

No worries at all, enjoy your holidays Quentin!

quentinblampey · 2024-10-14T11:43:18Z

Hi @marsdenl, do you also have access to a Linux or MacOS machine by any chance? If yes, do you have the same issue? I think it may be a Windows-specific issue, which would explain why I wasn't able to reproduce it.

quentinblampey mentioned this issue Oct 14, 2024

[Bug] Issue with transcripts patches #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Isse with running Baysor segmentation on some but not all patches #123

[Bug] Isse with running Baysor segmentation on some but not all patches #123

marsdenl commented Sep 12, 2024

quentinblampey commented Sep 13, 2024

marsdenl commented Sep 15, 2024

marsdenl commented Sep 17, 2024

quentinblampey commented Sep 19, 2024

marsdenl commented Sep 20, 2024

quentinblampey commented Sep 23, 2024

marsdenl commented Sep 24, 2024 •

edited

Loading

quentinblampey commented Sep 24, 2024

marsdenl commented Sep 24, 2024

quentinblampey commented Sep 25, 2024

marsdenl commented Sep 26, 2024 •

edited

Loading

quentinblampey commented Sep 26, 2024

marsdenl commented Sep 27, 2024

quentinblampey commented Sep 27, 2024

marsdenl commented Sep 29, 2024

quentinblampey commented Oct 14, 2024

[Bug] Isse with running Baysor segmentation on some but not all patches #123

[Bug] Isse with running Baysor segmentation on some but not all patches #123

Comments

marsdenl commented Sep 12, 2024

quentinblampey commented Sep 13, 2024

marsdenl commented Sep 15, 2024

marsdenl commented Sep 17, 2024

quentinblampey commented Sep 19, 2024

marsdenl commented Sep 20, 2024

quentinblampey commented Sep 23, 2024

marsdenl commented Sep 24, 2024 • edited Loading

quentinblampey commented Sep 24, 2024

marsdenl commented Sep 24, 2024

quentinblampey commented Sep 25, 2024

marsdenl commented Sep 26, 2024 • edited Loading

Check initial detected_trascripts.csv

Check dask written csv

check if any row has fewer than 9 columns -> this gave my 0 rows

check if any row has more than 9 columns -> this gave my 0 rows

quentinblampey commented Sep 26, 2024

marsdenl commented Sep 27, 2024

quentinblampey commented Sep 27, 2024

marsdenl commented Sep 29, 2024

quentinblampey commented Oct 14, 2024

marsdenl commented Sep 24, 2024 •

edited

Loading

marsdenl commented Sep 26, 2024 •

edited

Loading