Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNPY-1521] Fixes asDataFrame kwarg Collision Issue #1132

Merged
merged 11 commits into from
Sep 24, 2024

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Sep 20, 2024

Problem:

In #1127, I added the ability to pass pandas.read_csv keyword arguments to CsvFileTable.asDataFrame to give users the ability to have more control over the dataframe that was produced (such as determining which values should be treated as NA). An unintended side-effect of this was that users could unintentionally pass a keyword argument that is already set in the code that calls read_csv in _csv_to_pandas_df, causing an error like this one:

    446 # "ValueError: Only length-1 line terminators supported"
--> 447 df = pd.read_csv(
    448     filepath,
    449     dtype=dtype,
    450     sep=separator,
    451     lineterminator=line_terminator if len(line_terminator) == 1 else None,
    452     quotechar=quote_char,
    453     escapechar=escape_char,
    454     header=0 if contain_headers else None,
    455     skiprows=lines_to_skip,
    456     **kwargs,
    457 )
    458 # parse date columns if exists
    459 if date_columns:

TypeError: pandas.io.parsers.readers.read_csv() got multiple values for keyword argument 'sep'

Solution:

Add logic to _csv_to_pandas_df which creates a keyword arguments dictionary which is then updated to resolve any conflicts while giving precedence to the arguments passed intentionally by the user.

Testing:

  • Tested that the change behaves as expected (works with duplicate keyword arguments) using this script:
import synapseclient
from synapseclient.table import CsvFileTable

syn = synapseclient.Synapse()
syn.login()

import pandas as pd

data = {
    'id': ['1', '2', '3', '4', '5'],
    'test_num': ['1', '2', '3', '4', '5'],
}

test_df = pd.DataFrame(data)
schema = synapseclient.table.Schema(
    name="test_table",
    parent="syn59478569",
    column_names=["id", "version"],
    column_types=["STRING", "STRING"],
)

test_table = CsvFileTable.from_data_frame(df=test_df, schema=schema)

# Data should be read in as expected without errors

test_table_df_sep = test_table.asDataFrame(sep=",")
test_table_df_sep.to_csv("test_table_with_sep.csv", index=False)

test_table_df_quote_char = test_table.asDataFrame(quotechar='"')
test_table_df_quote_char.to_csv("test_table_with_quote_char.csv", index=False)
  • Added unit tests for _csv_to_pandas_df. Previously there were none.

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2024

Hello @BWMac! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 422:89: E501 line too long (92 > 88 characters)
Line 2441:89: E501 line too long (89 > 88 characters)
Line 2443:89: E501 line too long (96 > 88 characters)
Line 2501:89: E501 line too long (95 > 88 characters)
Line 2502:89: E501 line too long (95 > 88 characters)
Line 2503:89: E501 line too long (94 > 88 characters)

Comment last updated at 2024-09-24 21:00:21 UTC

@BWMac BWMac marked this pull request as ready for review September 23, 2024 16:07
@BWMac BWMac requested a review from a team as a code owner September 23, 2024 16:07
synapseclient/table.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jaymedina jaymedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🔥

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me! Thanks, just 1 comment on how we could improve the docstring.

@BWMac BWMac merged commit 35db9a9 into develop Sep 24, 2024
21 of 23 checks passed
@BWMac BWMac deleted the bwmac/SYNPY-1521/asdataframe-kwarg-fix branch September 24, 2024 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants