Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functionality to output to parquet #761

Merged
merged 12 commits into from
Jul 26, 2024
Merged

Conversation

karnesh
Copy link
Contributor

@karnesh karnesh commented Apr 23, 2024

PR to add functionality to output flow, velocity and depth to parquet file. The output parquet is a timeseries data(details in Notes). It is required for the TEEHR input. TEEHR comprises of set of tools for hydrologic model and forecast evaluation. Storing the t-route output in parquet format will lead to efficient query of the data. Also, it will help in automation by connecting TEEHR with NextGen water model. CIROH along with Lynker has developed a containerized version of NextGen National Water Model NextGen In A Box (NGIAB). TEEHR uses DuckDB to query the parquet output from NGIAB stored on cloud.

Additions

  • Added functionality to output write flow, velocity and depth to parquet format.

Changes

  • Changed yaml input file to include parquet output format
parquet_output:
    #---------
    parquet_output_folder: output/
    configuration: short_range
    prefix_ids: nex

Testing

  1. The modified code is tested by executing LowerColorado_TX demonstration test.
LowerColorado_TX

Screenshots

  • Here is a screenshot of successful compilation.
compile
  • Parquet output timeserie data from LowerColorado_TX demonstration test.
parquet

Notes

  • The flowveldepth dateframe is modified to create a timeseries containing following variables:
location_id: string
value: double
value_time: timestamp[us]
variable_name: string
configuration: string
units: string
reference_time: timestamp[us]
  • Location_id variable contains nexus IDs and has 'nex-' prepended to it.
  • Configuration (short range, medium range, long range etc.) has to be entered by the user in the input yaml file.

Todos

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Visually tested in supported browsers and devices (see checklist below 👇)
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Target Environment support

  • Windows
  • Linux
  • Browser

Copy link
Contributor

@hellkite500 hellkite500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I like the idea of adding this as an option for possible output formats. There are some details of the implementation that need to be discussed/considered a bit I think, however.

Also, a more general note, but it configuring your local dev environment git to ignore whitespace changes may be helpful on PR's like this which touch many files. It looks like some auto formatting was applied, and the formatting changes can make review more challenging when trying to see all the places in the code which have changed in functionality. If you want to contribute formatting changes, I would suggest putting them all into a single commit on a PR, or an independent PR.

timeseries_df['units'] = timeseries_df['variable_name'].map(variable_to_units_map)
timeseries_df['reference_time'] = start_datetime.date()
timeseries_df['location_id'] = timeseries_df['location_id'].astype('string')
timeseries_df['location_id'] = 'nex-' + timeseries_df['location_id']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is generically appropriate to label all locations in the routing data frame with a nex- prefix. This may be applicable to routing which uses a hy_features network, but wouldn't be applicable to one which uses a nhd network. Is there any guarantee before this point that this function only gets applied to hy_features network results? Also, even within a hy_features network, the routed segments this ouptut relates to is for the waterbodies which typically have a wb- prefix in the hydrofabric identifiers, not nex. The nexus and waterbody features are related, but destinct concepts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below...


df.index.name = 'location_id'
df.reset_index(inplace=True)
timeseries_df = df.melt(id_vars=['location_id'], var_name='var')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you use value_vars=[ 'q', 'v', 'd' ] as a kwarg to melt, you might have an easier time extracting from un-pivoted table?

The below method seems like there should be a better way besides casting to string, manipulating the string, and recasting to numeric/datetime types.

I'm not sure exactly what the df looks like that is trying to be manipulated at this point, but I would try to consider a different method(s) for extracting the needed data from it.

If for some reason this is the only way, then please comment this implementation to describe what the state of the df is and why this is the way it needs to be manipulated.

Copy link
Contributor Author

@karnesh karnesh May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into options and this seems to be the only way. I cannot use value_vars=[ 'q', 'v', 'd' ] as a kwarg to melt because of the format of df column names. Also, the column names needed to be manipulated as strings. Please look at the screenshot below.

parquet t-route

Each column name consists of a time step and a variable name (q, v or d) in string format. The time steps values are used to get the value_time.

timeseries_df = _parquet_output_format_converter(flowveldepth, restart_parameters.get("start_datetime"), dt,
output_parameters["parquet_output"].get("configuration"))

parquet_output_segments_str = ['nex-' + str(segment) for segment in parquet_output_segments]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another use of nex- prefix that may not be generically appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest we make it a variable with a default argument in the wrapping function and a value in the yaml.
parquet_output_segments_str = [prefix_str + str(segment) for segment in parquet_output_segments]

We would need to do something more comprehensive to update T-Route to comprehend the naming/labeling of the IDs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the PR to add user defined value in yaml file for the prefix string.

class StreamOutput(BaseModel):
# NOTE: required if writing StreamOutput files
stream_output_directory: Optional[DirectoryPath] = None
stream_output_time: int = 1
stream_output_type: streamOutput_allowedTypes = ".nc"
stream_output_type:streamOutput_allowedTypes = ".nc"
Copy link
Contributor

@jameshalgren jameshalgren May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this formatting change from the commit?

Comment on lines 32 to 50
lakeids = np.fromiter(crosswalk.keys(), dtype=int)
lakeids = np.fromiter(crosswalk.keys(), dtype = int)
idxs = target_df.index.to_numpy()
lake_index_intersect = np.intersect1d(
idxs,
lakeids,
return_indices=True
return_indices = True
)

# replace lake ids with link IDs in the target_df index array
linkids = np.fromiter(crosswalk.values(), dtype=int)
linkids = np.fromiter(crosswalk.values(), dtype = int)
idxs[lake_index_intersect[1]] = linkids[lake_index_intersect[2]]

# (re) set the target_df index
target_df.set_index(idxs, inplace=True)
target_df.set_index(idxs, inplace = True)

return target_df


def _parquet_output_format_converter(df, start_datetime, dt, configuration):
def _parquet_output_format_converter(df, start_datetime, dt, configuration, prefix_ids):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop most of these format only changes to keep the PR super clean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the format changes.

@karnesh karnesh marked this pull request as ready for review May 23, 2024 01:22
@shorvath-noaa
Copy link
Contributor

@hellkite500 @jameshalgren does this PR still need changes in your opinions or can we review/approve?

@jameshalgren
Copy link
Contributor

jameshalgren commented Jul 22, 2024

@hellkite500 @jameshalgren does this PR still need changes in your opinions or can we review/approve?

Thanks for asking -- please proceed with final review. GitHub still shows a conflict -- do you need us to resolve that?

Copy link
Contributor

@shorvath-noaa shorvath-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow review. Can you rebase this on the master branch to resolve conflicts?

@@ -4,13 +4,16 @@
from typing_extensions import Literal
from .types import FilePath, DirectoryPath

streamOutput_allowedTypes = Literal['.csv', '.nc', '.pkl']
streamOutput_allowedTypes = Literal['.csv', '.nc', '.pkl', '.parquet']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamOutput_allowedTypes is specific for the stream_output class of parameters. It looks like you've created a new class, ParquetOutput that doesn't use this. Is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. I have removed .parquet from streamOutput_allowedTypes

@karnesh
Copy link
Contributor Author

karnesh commented Jul 26, 2024

I have rebased the PR to resolve conflicts.

@shorvath-noaa
Copy link
Contributor

@karnesh it looks like you are missing this line in your PR. I think this is why it is failing the GitActions tests. Can you add it in?

@shorvath-noaa shorvath-noaa merged commit 0a5046c into NOAA-OWP:master Jul 26, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants