What is the best way to download large tables? #5801

evlaw-ea · 2023-03-23T06:35:05Z

evlaw-ea
Mar 23, 2023

Hi all! Appreciate this fantastic library which allowed me to easily port ETL code from BigQuery to Snowflake. Now I have a question when working with the latter.

Without being able to increase computational resources, what is the best way to download a large-ish table from Snowflake? Right now I am running .execute() on table expressions which produces a gigantic pandas DataFrame, which I then aggregate via groupby to a much smaller size. The DataFrame pre aggregation has occasionally crashed Python itself.

That is why I would like to avoid having all of the table in memory at once. I thought of two potential solutions and I wonder if they are possible using ibis?

Stream batches of data from Snowflake such that I can possess them individually? Would to_pyarrow_batches be of any help here? (Say, does it download all the data into memory before turning them into a RecordBatchReader? If yes, then pass.)
Downcast the data type downloaded from Snowflake. Are there any e.g., kwargs that I can pass to .execute() for this purpose? Documentation on the specifics of .execute() with different backends is scarce ...

Thanks in advance!!

Answered by gforsyth

Mar 23, 2023

Hey @evlaw-ea -- glad to hear that the ETL code porting is working well!!

We are thinking about ways to move data around, but it's a tricky problem (actually a bunch of tricky problems).

Stream batches of data from Snowflake such that I can possess them individually? Would to_pyarrow_batches be of any help here? (Say, does it download all the data into memory before turning them into a RecordBatchReader? If yes, then pass.)

I think this will get around your out-of-memory issues, but it will not be performant. It's going to pull down N tuples at a time, where N is the batch size, and then you can operate on those batches in sequence.

2. Downcast the data type downloaded from Snowflake…

View full answer

gforsyth · 2023-03-23T12:52:11Z

gforsyth
Mar 23, 2023
Maintainer

Hey @evlaw-ea -- glad to hear that the ETL code porting is working well!!

We are thinking about ways to move data around, but it's a tricky problem (actually a bunch of tricky problems).

Stream batches of data from Snowflake such that I can possess them individually? Would to_pyarrow_batches be of any help here? (Say, does it download all the data into memory before turning them into a RecordBatchReader? If yes, then pass.)

I think this will get around your out-of-memory issues, but it will not be performant. It's going to pull down N tuples at a time, where N is the batch size, and then you can operate on those batches in sequence.

2. Downcast the data type downloaded from Snowflake. Are there any e.g., kwargs that I can pass to .execute() for this purpose? Documentation on the specifics of .execute() with different backends is scarce ...

I don't believe you can pass anything to execute to do this, and snowflake really doesn't seem to believe in numeric types that take up fewer than 64 bits. On another system, you could accomplish this by casting to lower precision, but snowflake's lower precision types are fake.

I think the best way to get data out of snowflake is to export it to parquet on a user stage, and then pull down those parquet files. There's not currently a way to do that with Ibis, although we'll probably add that functionality at some point.

1 reply

evlaw-ea Mar 23, 2023
Author

Thanks for the confirmation regarding to_pyarrow_batches and execute! Is the streaming behavior for to_pyarrow_batches consistent across backends? If so perhaps it's worth mentioning the "it doesn't download everything at once" behavior in the docs.

cpcloud · 2023-03-23T12:58:19Z

cpcloud
Mar 23, 2023
Maintainer

Hi @evlaw-ea 👋🏻! Thanks for the kinds words and for raising this discussion.

I am running .execute() on table expressions which produces a gigantic pandas DataFrame, which I then aggregate via groupby to a much smaller size.

Any chance you can do that groupby in ibis instead of in pandas? That might solve the problem without having to think too hard about which materialization API (execute, to_pyarrow_batches etc) to use.

4 replies

evlaw-ea Mar 23, 2023
Author

Unfortunately a long story short I cannot groupby in ibis; assumptions in my pipeline changed and it seems easier to write the new code in pandas instead.

Basically, I have on Snowflake time series data of the same frequency but different lengths, and I would like to unify to the same length. For demo I have attached a dummy dataset (CSV) with series A and B; A only has steps 1-9 whereas B has steps 1-14. I would like to forward-fill A with the latest value in A to cover steps 10-14.
dummy.csv

Right now to tackle the problem, I am creating a separate pandas dataframe that contains all the series IDs and each time step of interest. Say with series A & B, there would be 28 rows in this new dataframe --- 2 series * 14 steps. Then I left-join this dataframe (left) on the time series (right), which produces some nulls that I can forward fill.

Catch is that I have to download the time series from Snowflake before I can join+ffill in Pandas. Is there a way to achieve the whole thing in ibis? If so it would be a lifesaver!

cpcloud Mar 23, 2023
Maintainer

This is likely doable in pure ibis!

Any chance you can provide us with code from your current workflow? That way we can look at what you're doing specifically and help you port it to ibis.

cpcloud Mar 23, 2023
Maintainer

Also take a look at this blog post on how to ffill/bfill with ibis, it seems like that is maybe a sticking point here!

evlaw-ea Mar 23, 2023
Author

Thank you for the blog post, but it only covers the case when all measurement timestamps of interest are already in the table; they just need a ffill value. Whereas in my case there are measurement timestamps that I would like to add to the table/data.

My code is below. right_df contains time series data. left_df is created to contain all the measurement timestamps (integer field STEP) of interest.

# Set index of right_df to allow formation of left_df
right_df.set_index([TIME_SERIES_IDS], inplace=True)
left_df = copy_and_stack_df_index(right_df.index, repeats=right_df["STEP"].max())

# Add `STEP` to index to allow pandas joining
# .sort_index() is vital for enabling .ffill() below to function correctly
left_df = left_df.set_index(["STEP"], append=True).sort_index()
right_df = right_df.set_index(["STEP"], append=True).sort_index()

joined_df = (
    left_df.join(right_df, how="left")
    # key line below, ensuring that cumulative value plateaus at steps not reached
    .ffill()
    .reset_index()
)


def copy_and_stack_df_index(index: pd.MultiIndex, repeats: int) -> pd.DataFrame:
    """ 
        Helper function that encapsulates the following steps:

        1. Generate a new dataframe with the same unique values as `index`.
        2. Repeat each index value `repeats` # of times, stack them together---and then sort?
        3. Add an integer column called `STEP` with range [1, `repeats`]

    Args:
        index (pd.MultiIndex): The index we want to copy.
        repeats (int): Repeat each unique index value this many times.

    Returns:
        pd.DataFrame: as described in the 3 steps above.
    """
    
    if isinstance(index, pd.Index) and (not isinstance(index, pd.MultiIndex)):
        raise NotImplementedError
    
    index = index.unique()
    num_unique_indices = index.nunique()

    new_df = pd.DataFrame(
        index=pd.MultiIndex.from_tuples(
            np.repeat(index.to_numpy(), axis=0, repeats=repeats).ravel(),
            names=index.names
        ),
        data=np.repeat([np.arange(1, repeats+1)], axis=0, repeats=num_unique_indices).ravel(),
        columns=["STEP"]
    )

    return new_df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to download large tables? #5801

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is the best way to download large tables? #5801

evlaw-ea Mar 23, 2023

Replies: 2 comments · 5 replies

gforsyth Mar 23, 2023 Maintainer

evlaw-ea Mar 23, 2023 Author

cpcloud Mar 23, 2023 Maintainer

evlaw-ea Mar 23, 2023 Author

cpcloud Mar 23, 2023 Maintainer

cpcloud Mar 23, 2023 Maintainer

evlaw-ea Mar 23, 2023 Author

evlaw-ea
Mar 23, 2023

Replies: 2 comments 5 replies

gforsyth
Mar 23, 2023
Maintainer

evlaw-ea Mar 23, 2023
Author

cpcloud
Mar 23, 2023
Maintainer

evlaw-ea Mar 23, 2023
Author

cpcloud Mar 23, 2023
Maintainer

cpcloud Mar 23, 2023
Maintainer

evlaw-ea Mar 23, 2023
Author