-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs #945
Comments
Apparently the suggestion I've made isn't good either, because apparently I wound up writing a function to solve it. For some reason from snowflake.snowpark._internal.utils import result_set_to_iter
from snowflake.snowpark.dataframe import DataFrame
from snowflake.snowpark.async_job import AsyncJob
def get_iterator_from_df(df: DataFrame, case_sensitive=True):
"""
This function is a workaround for a bug in Snowpark, allowing to iterate over multiple dataframes simultaneously.
"""
# Async jobs create a new cursor, which is good for us
async_job: AsyncJob = df.to_local_iterator(block=False) # type: ignore
# Not using "async_job.result" because it uses fetchall - effectively collecting everything
result_meta = async_job._cursor.describe(async_job._query)
assert result_meta is not None, "Failed to get result metadata"
async_job._cursor.get_results_from_sfqid(async_job.query_id)
return result_set_to_iter(
iter(async_job._cursor),
result_meta,
case_sensitive=case_sensitive,
) This code will show that it works, and profiling shows that import cProfile
df_1 = session.table("<>").limit(500000)
df_2 = session.table("<>").limit(400000)
with cProfile.Profile() as pr:
iterator_1 = get_iterator_from_df(df_1)
iterator_2 = get_iterator_from_df(df_2, case_sensitive=False)
print(f"First iterator: {next(iterator_1)}")
print(f"Second iterator: {next(iterator_2)}")
print()
print(f"First iterator: {next(iterator_1)}")
print(f"Second iterator: {next(iterator_2)}")
pr.print_stats() |
I can repro, this is because the same cursor is being reused for both queries.
|
Please answer these questions before submitting your issue. Thanks!
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)]
macOS-13.3.1-arm64-arm-64bit
pip freeze
)?... (Snowpark 1.5.1)
We've been using an iterator from
to_local_iterator()
, and also using the table's schema to parse it.We expected to iterate over all the rows, and we only iterated over the first one.
Calling
df.schema
had probably caused the python snowflake connector to execute another query, makingcursor.execute()
no longer point to our query and rendering the iterator useless.This probably means that generally, other queries cannot be run while iterating.
Note that there's an easy workaround, using AsyncJobs - which makes the iterator query specifically for our query-id, and thus is still stable even while other queries are running:
Hard to do with our current environment :(
The text was updated successfully, but these errors were encountered: