PAR can't fit if Range constraint includes a `sequence_index` column #2181

srinify · 2024-08-12T16:48:30Z

Environment Details

SDV version: 1.15.0 (Latest)

Error Description

If you try to fit a PARSynthesizer model with a Range constraint that includes a sequence_index column in the logic, you will get a KeyError.

Steps to reproduce

!pip install sdv==1.15.0

import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata

event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10

start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
event_dates = [(start_date + timedelta(days=random.randint(0, (end_date - start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]
end_dates = [(datetime(2025,1,1)).strftime('%Y-%m-%d') for _ in range(n)]

s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]

df = pd.DataFrame(
    {
        "FirstDate": start_dates,
        "LatestDate": end_dates,
        "EventDate": random_dates,
        "s_key": s_key,
        "val": val
    }
)

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='s_key', sdtype='id')
metadata.set_sequence_index(column_name="EventDate")
metadata.set_sequence_key(column_name="s_key")

synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5)

master_date_constraint = {
    'constraint_class': 'Range',
    'constraint_parameters': {
        'low_column_name': 'FirstDate',
        'middle_column_name': 'EventDate',
        'high_column_name': 'LatestDate',
        'strict_boundaries': False
    }
}

synthesizer.add_constraints(constraints=[master_date_constraint])

synthesizer.fit(df)

Error:

Colab Notebook to Reproduce

Colab Link

The text was updated successfully, but these errors were encountered:

srinify · 2024-08-13T20:05:34Z

Workaround

If you have 3 datetime columns (e.g. FirstDate, EventDate, LatestDate) that you want to use in your Range constraint (so that synthetic EventDate values are between the other 2 columns), you can instead create date diff columns to replace FirstDate and LatestDate and model those directly in the SDV without using constraints at all.

Here's some example code that computes date diff columns:

# To replicate my sample data, use first half of the code in the issue body above

# Compute date diff columns, one for the lower bound and one for the upper bound
df['EventDate'] = pd.to_datetime(df['EventDate'])
df['LowerDiff']  = (pd.to_datetime(df['FirstDate']) - pd.to_datetime(df['EventDate'])).dt.days
df['UpperDiff']  = (pd.to_datetime(df['LatestDate']) - pd.to_datetime(df['EventDate'])).dt.days

# Make sure these columns are tagged as numerical in metadata
metadata.update_column(column_name='s_key', sdtype='id') # Sequence Key column
metadata.update_column(column_name='LowerDiff', sdtype='numerical')
metadata.update_column(column_name='UpperDiff', sdtype='numerical')
metadata.set_sequence_index(column_name="EventDate")
metadata.set_sequence_key(column_name="s_key")

synthesizer = PARSynthesizer(metadata2, verbose=True, epochs=5)
synthesizer.fit(df)

synthetic_data = synthesizer.sample(10)

# Cast to datetime if you prefer to keep EventDate as an Object / String dtype column
synthetic_data['FirstDate'] = pd.to_datetime(synthetic_data['EventDate']) + pd.to_timedelta(synthetic_data['LowerDiff'], unit='D')

synthetic_data['LatestDate'] = pd.to_datetime(synthetic_data['EventDate']) + pd.to_timedelta(synthetic_data['UpperDiff'], unit='D')

MichaelG-Uke · 2024-08-14T07:04:01Z

Hi, thanks for the workaround!
I use a similar way, I model the actual value in a [0,1] range, storing the lower and upper bounds separately.

srinify added bug Something isn't working data:sequential Related to timeseries datasets labels Aug 12, 2024

srinify changed the title ~~PAR can't fit if Range constraint includes a sequence_index column~~ PAR can't fit if Range constraint includes a sequence_index column Aug 12, 2024

srinify mentioned this issue Aug 12, 2024

Sequence_index in constraint, PARSynthesizer with Range constraint #2159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAR can't fit if Range constraint includes a `sequence_index` column #2181

PAR can't fit if Range constraint includes a `sequence_index` column #2181

srinify commented Aug 12, 2024 •

edited

Loading

srinify commented Aug 13, 2024 •

edited

Loading

MichaelG-Uke commented Aug 14, 2024

PAR can't fit if Range constraint includes a sequence_index column #2181

PAR can't fit if Range constraint includes a sequence_index column #2181

Comments

srinify commented Aug 12, 2024 • edited Loading

Environment Details

Error Description

Steps to reproduce

srinify commented Aug 13, 2024 • edited Loading

Workaround

MichaelG-Uke commented Aug 14, 2024

PAR can't fit if Range constraint includes a `sequence_index` column #2181

PAR can't fit if Range constraint includes a `sequence_index` column #2181

srinify commented Aug 12, 2024 •

edited

Loading

srinify commented Aug 13, 2024 •

edited

Loading