Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use read_csv_to_dataframe in validation #1419

Merged
merged 19 commits into from
Nov 9, 2023
Merged

Conversation

emlys
Copy link
Member

@emlys emlys commented Sep 26, 2023

Description

Fixes #1379

This sets us up for #327.

  • Split out some of the functionality of utils.read_csv_to_dataframe in a new function, validation.get_validated_dataframe. This new function enforces a contract: given a CSV path and a spec, get_validated_dataframe will either return a dataframe that matches the spec, or raise an error if that's not possible. utils.read_csv_to_dataframe now handles the lower-level CSV parsing details.
  • validation.get_validated_dataframe is used both in validation and in execute. If it raises an error during validation, that error message will be returned as a validation message. If it raises an error in execute, that error will be raised. This is convenient because we get the same message in either context without duplicating.
  • All tables are now read in via utils.read_csv_to_dataframe (indirectly through validation.get_validated_dataframe in most cases, or directly in the few places (HRA criteria table, Wave Energy machine performance table)). This lets us apply the same basic rules to all CSVs, such as encoding, dropping empty rows/columns, and lower-casing column names.

Checklist

  • Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
  • Updated the user's guide (if needed)
  • Tested the Workbench UI (if relevant)

@@ -462,6 +462,13 @@
"type": "integer",
"about": "Shore point ID"
},
"R_hab": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this up so that the r_hab column is matched before we get to the [HABITAT] pattern, which matches everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Could you add an inline comment about this so we don't forget in the future?

@emlys emlys self-assigned this Nov 7, 2023
@emlys emlys marked this pull request as ready for review November 8, 2023 17:32
@emlys emlys requested a review from phargogh November 8, 2023 17:32
"units": u.kilowatt
},
"hsmax": {
"columns": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to columns so that we don't have to validate the individual rows, as discussed on the last team call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense! Could you add an inline comment about this so we are sure to remember?

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emlys ! I really like the new contract being enforced here in the validated dataframe ... really cleans things up conceptually.

I just had a few minor comments and suggestions, and then would you object to adding a note to HISTORY about this? I think the main thing that might be worth mentioning there is something along the lines of improving consistency in validation of tables, which should improve the readability of validation errors. Just an idea ... I defer to you on it!

def read_csv_to_dataframe(path, spec, **kwargs):
def read_csv_to_dataframe(path, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're removing the spec, I think this docstring becomes much, much simpler! Could you update the docstring to reflect the current state of this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true!

Comment on lines +641 to +642
elif col_spec['type'] == 'boolean':
df[col] = df[col].astype('boolean')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an else base case, in case something falls through the other cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! That should never happen with our code because the model specs are tested, but it would be good to have in case of plugins.

src/natcap/invest/validation.py Outdated Show resolved Hide resolved
@@ -462,6 +462,13 @@
"type": "integer",
"about": "Shore point ID"
},
"R_hab": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Could you add an inline comment about this so we don't forget in the future?

"units": u.kilowatt
},
"hsmax": {
"columns": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense! Could you add an inline comment about this so we are sure to remember?

tests/test_habitat_quality.py Outdated Show resolved Hide resolved
@emlys
Copy link
Member Author

emlys commented Nov 9, 2023

@phargogh I think I addressed all your comments

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@phargogh phargogh merged commit 6769776 into natcap:main Nov 9, 2023
25 checks passed
@emlys emlys deleted the task/1379 branch October 3, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use read_csv_to_dataframe in CSV validation
2 participants