Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

returned data types are hard for me to deal with #441

Open
jkrick opened this issue Oct 16, 2024 · 9 comments
Open

returned data types are hard for me to deal with #441

jkrick opened this issue Oct 16, 2024 · 9 comments

Comments

@jkrick
Copy link

jkrick commented Oct 16, 2024

When working on a cross match and join with panstarrs catalogs, the returned dataframe has data types that were hard for me to deal with. They play nicely with pandas, so it was hard for me to notice things like double[pyarrow] are not the same as numpy data types. Specifically, I didn't have trouble with them until I put them into my plotting routines (matplotlib) and then I got strange errors which took me a while to figure out were related to the data types.

Not sure what to do with this information, but thought it might be helpful for you to have the feedback. My solution is the following where I take just the columns I want from the lsdb returned: matched_df and converted them to np data types. I am sure there are other ways of handling this that might be more elegant, but this is one functional way.

df_lc = pd.DataFrame({
'flux': pd.to_numeric(matched_df['psfFlux'] * 1e3, errors='coerce').astype(np.float64),
'err': pd.to_numeric(matched_df['psfFluxErr'] * 1e3, errors='coerce').astype(np.float64),
'time': pd.to_numeric(matched_df['obsTime'], errors='coerce').astype(np.float64),
'objectid': matched_df['objectid'].astype(np.int64),
'band': filtername,
'label': matched_df['label'].astype(str)
}).set_index(["objectid", "label", "band", "time"])

@hombit
Copy link
Contributor

hombit commented Oct 17, 2024

Thank you for the feedback; it is very useful!

We use the pyarrow dtype backend to exactly match types in parquet files and to avoid implicit unsafe casts. The Numpy backend (the default for pandas) doesn’t always match arrow/parquet types. For example, there is no one-to-one match for integer arrays with missing values, which would be loosely cast to floats if we load them using Numpy types.

You can try to cast the entire pandas DataFrame to Numpy dtypes with df.convert_dtypes().

Could you please give an example of matplotlib code that didn’t work for you?

@nevencaplar
Copy link
Member

I want to echo this, and I can also try to give a better example - but it happened to me also a couple of times that the outputs are double[pyarrow] types, making a code crash later in a way that I did not expect.
Hard to action without concrete examples, which I will also try to provide.

@jkrick
Copy link
Author

jkrick commented Oct 17, 2024

@hombit thanks for the tips
I tried convert_dtypes() instead of the more complicated above code, and it still gives me the error. Full traceback below in case it helps.

It looks like the problem is actually when using stats.zscore to do some sigma clipping and clean up the data before plotting. A minimum example would be to take the code you sent to match and join panstarrs light curves, then try the following code with the caveat that I might have renamed some columns to be more in line with my own search functions:

for cc, (objectid, singleobj_df) in enumerate(ndf.data.groupby('objectid')):
    singleobj = singleobj_df.sort_index().reset_index()
    # Remove rows containing NaN in time, flux, or err
    singleobj = singleobj.dropna(subset=["time", "flux", "err"])
    # Do sigma-clipping per band.
    band_groups = singleobj.groupby("band").flux
    zscore = band_groups.transform(lambda fluxes: np.abs(stats.zscore(fluxes)))

`---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[7], line 1
----> 1 _ = create_figures(df_lc = df_lc, # either df_lc (serial call) or parallel_df_lc (parallel call)
      2                    show_nbr_figures = 5,  # how many plots do you actually want to see?
      3                    save_output = False ,  # should the resulting plots be saved?
      4                   )

File [~/fornax-demo-notebooks/light_curves/code_src/plot_functions.py:69](https://daskhub-dev.fornax.smce.nasa.gov/user/jkrick/fornax-demo-notebooks/light_curves/code_src/plot_functions.py#line=68), in create_figures(df_lc, show_nbr_figures, save_output)
     65 fig, axes = plt.subplot_mosaic(mosaic=[["A"],["A"],["B"]] , figsize=(10,8))
     67 # Iterate over bands and plot light curves.
     68 # IceCube needs to be done last so that we know the y-axis limits.
---> 69 band_groups = _clean_lightcurves(singleobj_df).groupby('band')
     70 max_fluxes = band_groups.flux.max()  # max flux per band
     71 for band, band_df in band_groups:

File [~/fornax-demo-notebooks/light_curves/code_src/plot_functions.py:121](https://daskhub-dev.fornax.smce.nasa.gov/user/jkrick/fornax-demo-notebooks/light_curves/code_src/plot_functions.py#line=120), in _clean_lightcurves(singleobj_df)
    119 # Do sigma-clipping per band.
    120 band_groups = singleobj.groupby("band").flux
--> 121 zscore = band_groups.transform(lambda fluxes: np.abs(stats.zscore(fluxes)))
    122 n_points = band_groups.transform("size")  # number of data points in the band
    124 # Keep data points with a zscore < 3 or in a band with less than 10 data points.

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/groupby/generic.py:517](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/groupby/generic.py#line=516), in SeriesGroupBy.transform(self, func, engine, engine_kwargs, *args, **kwargs)
    514 @Substitution(klass="Series", example=__examples_series_doc)
    515 @Appender(_transform_template)
    516 def transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs):
--> 517     return self._transform(
    518         func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
    519     )

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/groupby/groupby.py:2021](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/groupby/groupby.py#line=2020), in GroupBy._transform(self, func, engine, engine_kwargs, *args, **kwargs)
   2018     warn_alias_replacement(self, orig_func, func)
   2020 if not isinstance(func, str):
-> 2021     return self._transform_general(func, engine, engine_kwargs, *args, **kwargs)
   2023 elif func not in base.transform_kernel_allowlist:
   2024     msg = f"'{func}' is not a valid function name for transform(name)"

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/groupby/generic.py:557](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/groupby/generic.py#line=556), in SeriesGroupBy._transform_general(self, func, engine, engine_kwargs, *args, **kwargs)
    552 for name, group in self._grouper.get_iterator(
    553     self._obj_with_exclusions, axis=self.axis
    554 ):
    555     # this setattr is needed for test_transform_lambda_with_datetimetz
    556     object.__setattr__(group, "name", name)
--> 557     res = func(group, *args, **kwargs)
    559     results.append(klass(res, index=group.index))
    561 # check for empty "results" to avoid concat ValueError

File [~/fornax-demo-notebooks/light_curves/code_src/plot_functions.py:121](https://daskhub-dev.fornax.smce.nasa.gov/user/jkrick/fornax-demo-notebooks/light_curves/code_src/plot_functions.py#line=120), in _clean_lightcurves.<locals>.<lambda>(fluxes)
    119 # Do sigma-clipping per band.
    120 band_groups = singleobj.groupby("band").flux
--> 121 zscore = band_groups.transform(lambda fluxes: np.abs(stats.zscore(fluxes)))
    122 n_points = band_groups.transform("size")  # number of data points in the band
    124 # Keep data points with a zscore < 3 or in a band with less than 10 data points.

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/scipy/stats/_stats_py.py:2992](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/scipy/stats/_stats_py.py#line=2991), in zscore(a, axis, ddof, nan_policy)
   2910 def zscore(a, axis=0, ddof=0, nan_policy='propagate'):
   2911     """
   2912     Compute the z score.
   2913 
   (...)
   2990            [-0.91611681, -0.89090508,  1.4983032 ,  0.88731639, -0.5785977 ]])
   2991     """
-> 2992     return zmap(a, a, axis=axis, ddof=ddof, nan_policy=nan_policy)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/scipy/stats/_stats_py.py:3168](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/scipy/stats/_stats_py.py#line=3167), in zmap(scores, compare, axis, ddof, nan_policy)
   3166 # Set std deviations that are 0 to 1 to avoid division by 0.
   3167 std[isconst] = 1.0
-> 3168 z = (scores - mn) [/](https://daskhub-dev.fornax.smce.nasa.gov/) std
   3169 # Set the outputs associated with a constant input to nan.
   3170 z[np.broadcast_to(isconst, z.shape)] = np.nan

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/ops/common.py:76](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/ops/common.py#line=75), in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arraylike.py:194](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arraylike.py#line=193), in OpsMixin.__sub__(self, other)
    192 @unpack_zerodim_and_defer("__sub__")
    193 def __sub__(self, other):
--> 194     return self._arith_method(other, operator.sub)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/series.py:6135](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/series.py#line=6134), in Series._arith_method(self, other, op)
   6133 def _arith_method(self, other, op):
   6134     self, other = self._align_for_op(other)
-> 6135     return base.IndexOpsMixin._arith_method(self, other, op)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/base.py:1382](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/base.py#line=1381), in IndexOpsMixin._arith_method(self, other, op)
   1379     rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step)
   1381 with np.errstate(all="ignore"):
-> 1382     result = ops.arithmetic_op(lvalues, rvalues, op)
   1384 return self._construct_result(result, name=res_name)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:273](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/ops/array_ops.py#line=272), in arithmetic_op(left, right, op)
    260 # NB: We assume that extract_array and ensure_wrapped_if_datetimelike
    261 #  have already been called on `left` and `right`,
    262 #  and `maybe_prepare_scalar_for_op` has already been called on `right`
    263 # We need to special-case datetime64[/timedelta64](https://daskhub-dev.fornax.smce.nasa.gov/timedelta64) dtypes (e.g. because numpy
    264 # casts integer dtypes to timedelta64 when operating with timedelta64 - GH#22390)
    266 if (
    267     should_extension_dispatch(left, right)
    268     or isinstance(right, (Timedelta, BaseOffset, Timestamp))
   (...)
    271     # Timedelta[/Timestamp](https://daskhub-dev.fornax.smce.nasa.gov/Timestamp) and other custom scalars are included in the check
    272     # because numexpr will fail on it, see GH#31457
--> 273     res_values = op(left, right)
    274 else:
    275     # TODO we should handle EAs consistently and move this check before the if[/else](https://daskhub-dev.fornax.smce.nasa.gov/else)
    276     # (https://github.com/pandas-dev/pandas/issues/41165)
    277     # error: Argument 2 to "_bool_arith_check" has incompatible type
    278     # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]"
    279     _bool_arith_check(op, left, right)  # type: ignore[arg-type]

File /opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arraylike.py:194](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arraylike.py#line=193), in OpsMixin.__sub__(self, other)
    192 @unpack_zerodim_and_defer("__sub__")
    193 def __sub__(self, other):
--> 194     return self._arith_method(other, operator.sub)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:787](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py#line=786), in ArrowExtensionArray._arith_method(self, other, op)
    786 def _arith_method(self, other, op):
--> 787     return self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:775](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py#line=774), in ArrowExtensionArray._evaluate_op_method(self, other, op, arrow_funcs)
    772 if pc_func is NotImplemented:
    773     raise NotImplementedError(f"{op.__name__} not implemented.")
--> 775 result = pc_func(self._pa_array, other)
    776 return type(self)(result)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/compute.py:247](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/compute.py#line=246), in _make_generic_wrapper.<locals>.wrapper(memory_pool, *args)
    245 if args and isinstance(args[0], Expression):
    246     return Expression._call(func_name, list(args))
--> 247 return func.call(args, None, memory_pool)

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/_compute.pyx:385](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/_compute.pyx#line=384), in pyarrow._compute.Function.call()

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/error.pxi:155](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/error.pxi#line=154), in pyarrow.lib.pyarrow_internal_check_status()

File [/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/error.pxi:92](https://daskhub-dev.fornax.smce.nasa.gov/opt/conda/envs/science_demo/lib/python3.11/site-packages/pyarrow/error.pxi#line=91), in pyarrow.lib.check_status()

ArrowInvalid: Array arguments must all be the same length`

@hombit
Copy link
Contributor

hombit commented Oct 17, 2024

@jkrick Thank you! I edited your message for better readability.

Do I understand correctly that you’re not using join_nested, but just join? join_nested would provide pre-grouped light curves in a memory- and CPU-efficient way. Moreover, all types would be “standard” numpy types. We absolutely should add more documentation about that.

While I agree that the issue you’re experiencing is related to user experience with LSDB, technically, it seems like a pandas issue to me. Would you mind if I go ahead and open an issue in the pandas repo?

I can reproduce you error with this code:

import pandas as pd
import pyarrow as pa
from scipy.stats import zscore

series = pd.Series([1.0, 2.0, 3.0], dtype=pd.ArrowDtype(pa.float64()))
zscore(series)  # ArrowInvalid: Array arguments must all be the same length

If we convert series to a numpy or a pyarrow array the code works as expected

import numpy as np
import pandas as pd
import pyarrow as pa
from scipy.stats import zscore

series = pd.Series([1.0, 2.0, 3.0], dtype=pd.ArrowDtype(pa.float64()))
np.testing.assert_array_equal(zscore(np.asarray(series)), zscore(pa.array(series)))

@jkrick
Copy link
Author

jkrick commented Oct 17, 2024

yes, I am using join not join_nested. I haven't jumped on that learning curve yet.

Thanks for synthesizing my problem, sounds good to open an issue with pandas.

@hombit
Copy link
Contributor

hombit commented Oct 18, 2024

pandas-dev/pandas#60073

@hombit
Copy link
Contributor

hombit commented Oct 28, 2024

And also an arrow issue
apache/arrow#44544

@nevencaplar
Copy link
Member

Is there something on our side that w have to do, or can we close this issue? @hombit

@jkrick
Copy link
Author

jkrick commented Oct 28, 2024

One suggestion is to document this for users until pandas or arrow get a chance to implement a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants