-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
returned data types are hard for me to deal with #441
Comments
Thank you for the feedback; it is very useful! We use the pyarrow dtype backend to exactly match types in parquet files and to avoid implicit unsafe casts. The Numpy backend (the default for pandas) doesn’t always match arrow/parquet types. For example, there is no one-to-one match for integer arrays with missing values, which would be loosely cast to floats if we load them using Numpy types. You can try to cast the entire pandas DataFrame to Numpy dtypes with Could you please give an example of matplotlib code that didn’t work for you? |
I want to echo this, and I can also try to give a better example - but it happened to me also a couple of times that the outputs are double[pyarrow] types, making a code crash later in a way that I did not expect. |
@hombit thanks for the tips It looks like the problem is actually when using
|
@jkrick Thank you! I edited your message for better readability. Do I understand correctly that you’re not using While I agree that the issue you’re experiencing is related to user experience with LSDB, technically, it seems like a pandas issue to me. Would you mind if I go ahead and open an issue in the pandas repo? I can reproduce you error with this code: import pandas as pd
import pyarrow as pa
from scipy.stats import zscore
series = pd.Series([1.0, 2.0, 3.0], dtype=pd.ArrowDtype(pa.float64()))
zscore(series) # ArrowInvalid: Array arguments must all be the same length If we convert series to a numpy or a pyarrow array the code works as expected import numpy as np
import pandas as pd
import pyarrow as pa
from scipy.stats import zscore
series = pd.Series([1.0, 2.0, 3.0], dtype=pd.ArrowDtype(pa.float64()))
np.testing.assert_array_equal(zscore(np.asarray(series)), zscore(pa.array(series))) |
yes, I am using Thanks for synthesizing my problem, sounds good to open an issue with pandas. |
And also an arrow issue |
Is there something on our side that w have to do, or can we close this issue? @hombit |
One suggestion is to document this for users until pandas or arrow get a chance to implement a fix. |
When working on a cross match and join with panstarrs catalogs, the returned dataframe has data types that were hard for me to deal with. They play nicely with pandas, so it was hard for me to notice things like double[pyarrow] are not the same as numpy data types. Specifically, I didn't have trouble with them until I put them into my plotting routines (matplotlib) and then I got strange errors which took me a while to figure out were related to the data types.
Not sure what to do with this information, but thought it might be helpful for you to have the feedback. My solution is the following where I take just the columns I want from the lsdb returned: matched_df and converted them to np data types. I am sure there are other ways of handling this that might be more elegant, but this is one functional way.
df_lc = pd.DataFrame({
'flux': pd.to_numeric(matched_df['psfFlux'] * 1e3, errors='coerce').astype(np.float64),
'err': pd.to_numeric(matched_df['psfFluxErr'] * 1e3, errors='coerce').astype(np.float64),
'time': pd.to_numeric(matched_df['obsTime'], errors='coerce').astype(np.float64),
'objectid': matched_df['objectid'].astype(np.int64),
'band': filtername,
'label': matched_df['label'].astype(str)
}).set_index(["objectid", "label", "band", "time"])
The text was updated successfully, but these errors were encountered: