-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARSynthesizer samples uniformly distributed time series data #2241
Comments
Hi @ardulat without metadata, this might be challenging to debug but let's try!
In general, PARSynthesizer is one our less mature synthesizers compared to our other single and multi table synthesizers. That alone could be causing this behavior, but it would be great to rule out a few other things first. |
Hi @srinify, thank you for your quick response. Here is what metadata looks like (I removed the exact column names to preserve privacy): {
"columns": {
"sequence_id": {
"sdtype": "id"
},
"context_column1": {
"sdtype": "categorical"
},
"context_column2": {
"sdtype": "categorical"
},
"context_column3": {
"sdtype": "categorical"
},
"context_column4": {
"sdtype": "categorical"
},
"context_column5": {
"sdtype": "categorical"
},
"context_column6": {
"sdtype": "numerical"
},
"context_column7": {
"sdtype": "categorical"
},
"time_series_column1": {
"sdtype": "numerical"
},
"time_series_column2": {
"sdtype": "numerical"
},
"time_series_column3": {
"sdtype": "categorical"
},
"time_series_column4": {
"sdtype": "categorical"
},
"time_series_column5": {
"sdtype": "categorical"
},
"time_series_column6": {
"sdtype": "numerical"
},
"time_series_column7": {
"sdtype": "numerical"
},
"time_series_column8": {
"sdtype": "numerical"
},
"time_series_column9": {
"sdtype": "numerical"
},
"time_series_column10": {
"sdtype": "numerical"
},
"time_series_column11": {
"sdtype": "numerical"
},
"time_series_column12": {
"sdtype": "numerical"
},
"time_series_column13__steps": {
"sdtype": "numerical"
},
"time_series_column14": {
"sdtype": "numerical"
},
"time_series_column15": {
"sdtype": "numerical"
},
"time_series_column16": {
"sdtype": "numerical"
},
"time_series_column17": {
"sdtype": "numerical"
},
"time_series_column18": {
"sdtype": "numerical"
},
"time_series_column19": {
"sdtype": "numerical"
},
"time_series_column20": {
"sdtype": "numerical"
},
"time_series_column21": {
"sdtype": "numerical"
},
"date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
},
"primary_key": {
"sdtype": "id"
}
},
"METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
"primary_key": "primary_key",
"sequence_index": "date",
"sequence_key": "sequence_id",
"synthesizer_info": {
"class_name": "PARSynthesizer",
"creation_date": "2024-09-18",
"is_fit": true,
"last_fit_date": "2024-09-18",
"fitted_sdv_version": "1.15.0"
}
} Few issues with this metadata:
Answering your questions:
|
Hi @srinify, are there any updates on this? |
Hi @npatki, can you please elaborate on this? |
Hi @ardulat apologize for the delay! It seems like there are a few issues here to discuss: Uniform distribution for some of the time series columns I will attempt to reproduce this today but I may run into some issues here if this issue is very data specific. But let's see. Context column 6 isn't respecting timestamp boundaries. The issue you linked to has been since fixed -- do you mind avoiding using timestamps and just use datetimes directly (with a Categorical time series columns produce float numbers Are your original values float numbers (e.g. Sampled data includes 36% null values Are these all in a specific column? Or entire rows with null values? And how does this match the null patterns in your real data? |
Hi, @srinify! Thank you for your reply. Further discussion on the issues:
I can't help here since the data I am working with is sensitive and private. I haven't tested
Will do, thanks.
The issue is that some of my columns contain integers, but the model samples are floats. I can do rounding, but I'm not sure if that's the right thing to do.
Apologies for my unclear explanation. The issue is that the date (not data) column, which is a To sum up, there are a couple of issues, and I will follow your suggestions where applicable. However, the main issue stopping us from using SDV is the produced uniform distributions, hence the issue title. Thank you for your help! |
Hi @ardulat unfortunately I wasn't able to recreate any of these issues with my own fake dataset that has matching metadata as yours. It's likely my fake dataset is too simple of course. Let me chat more internally with the team to understand if they've encountered these issues before and what other debugging tips we can try! Also, if you haven't updated to the latest version of SDV I always recommend trying that to see if any of these issues get resolved :) |
Hi, @srinify. To help debug the issue, I have prepared a toy dataset that clearly shows the issues I previously described here. The CSV file with training data is attached below. The updated metadata is as follows:
Here, Here is the distribution plot for the training And the distribution plot for the sampled/synthetic I hope this helps. Let me know what you think about ways how to fix uniform distributions and the rest of the issues. |
Awesome @ardulat I'll take a look today and circle back! Full disclosure though, some (or all) of these might just be issues we need to open, track, and eventually address. |
Hi @srinify! I was able to "fix" the issue. The issue was that SDV samples float values (although the actual data was integers). As a result, the sampled values were unique with frequency equal to 1 (e.g., 7230.05 and 7230.33 map to different bars in the plot). It would be great if SDV sampled integers for the integer column in the training data. Anyway, the issue with uniform distribution is fixed now. However, I experienced the same challenge as in #2230, where the actual and synthetic distributions differed. I will add a comment on the mentioned issue. Here is what the distributions look like: |
Hi @ardulat when I tried the PARSynthesizer workflow with the So if I'm understanding your last issue correctly - the Regarding #2230 thanks for adding to that thread! PAR is one of our less mature synthesizers so we're collecting examples that showcase the shortcomings so we can improve it down the road! |
Hi, @srinify! Yes, you are right; the data is all floats. My bad, I didn't notice that. But as far as I remember, even for integer columns, SDV generates float numbers. I guess this is happening due to the absence of integers and floats in the numerical type column. Yes, that is an issue, but it's a minor issue that I can fix by simply rounding generated float numbers. Our major issue now is the differing distributions, which currently limits our usage of SDV's |
Environment details
If you are already running SDV, please indicate the following details about the environment in
which you are running it:
Problem description
I've been using SDV for quite a while now. However, recently, after analyzing the sampled data, I observed a weird behavior in sampling time series data. The issue is that the sequential model
PARSynthesizer
keeps generating uniform distributions for time series data in almost all my columns. I am attaching two plots, which clearly show the difference.Actual data distribution plot:
Synthetic data distribution plot:
What I already tried
I tried synthesizing on different datasets and with different numbers of epochs. The code snippet related to the model fitting:
I can't share the data or anything related to that (including metadata) since it is sensitive medical data.
The text was updated successfully, but these errors were encountered: