Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discovered issues and improve MESSAGEix-Materials #201

Merged
merged 45 commits into from
Jul 4, 2024
Merged

Conversation

macflo8
Copy link
Contributor

@macflo8 macflo8 commented Jun 21, 2024

  • Removes message_data dependencies (some .yaml and .csv input files).
  • Fixes incorrect imports of previously migrated code in utils/compat/message_data.
  • Removes deprecated/unused variables and code sections.
  • Adresses Further clean up material module #198.
  • Implements data packaging mentioned in Test and improve .model.material #194 (comment).
  • Hide some debugging and project specific CLI commands of material-ix group
  • Moves from using untested helpers in utils/compat/message_data to using tested equivalents of message-ix-models
  • Removes click options from material-ix commands that are already implemented in message_ix_models/cli.py
  • Stores csv files in data/material as text (previously stored with git LFS)
  • General updates to documentation

Details:
Building MESSAGEix-Materials on an existing MESSAGEix-GLOBIOM scenario still requires a few external data file that have previously been stored in a private repository. The respective files are migrated to message-ix-models/data and data paths in the modules updated accordingly in this PR. The files located in util/compat/message_data migrated in the course of #188 contain some incorrect import statements, which are fixed in this PR.

The following files are needed in the build and were therefore added:

The build however still requires proprietary data from IEA Extended Energy Balances to calibrate historical industry activity. The path to the required file can be specified with a newly added CLI option --iea_data_path in the build command.

All input data for MESSAGEix-Materials is now packaged in a .tar.gz file, that can be fetched with mix-models fetch MESSAGEix-Materials.

When executing mix-models material-ix --help only the basic comments related to building/solving/reporting are shown

How to review

Run build/solve/report with material-ix CLI commands and confirm correct execution of each task.

PR checklist

  • [ ] Continuous integration checks all ✅ Not possible due to missing tests
  • [ ] Add or expand tests; coverage checks both ✅ will be handled via Test and improve .model.material #194
  • Add, expand, or update documentation.
  • Update doc/whatsnew.

Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 29.32166% with 323 lines in your changes missing coverage. Please review.

Project coverage is 51.9%. Comparing base (229eecb) to head (4814e7a).

Additional details and impacted files
@@           Coverage Diff           @@
##            main    #201     +/-   ##
=======================================
- Coverage   52.2%   51.9%   -0.3%     
=======================================
  Files        141     142      +1     
  Lines      11346   11250     -96     
=======================================
- Hits        5925    5843     -82     
+ Misses      5421    5407     -14     
Files Coverage Δ
message_ix_models/cli.py 93.4% <ø> (ø)
...ssage_ix_models/model/material/data_ammonia_new.py 8.7% <100.0%> (+0.2%) ⬆️
...l/material/material_demand/material_demand_calc.py 17.3% <ø> (-1.6%) ⬇️
...e_ix_models/tests/model/material/test_data_util.py 100.0% <100.0%> (ø)
...s/util/compat/message_data/get_historical_years.py 20.0% <ø> (ø)
...util/compat/message_data/get_optimization_years.py 28.5% <ø> (ø)
message_ix_models/util/pooch.py 42.0% <ø> (ø)
.../compat/message_data/change_technology_lifetime.py 2.5% <0.0%> (ø)
...at/message_data/check_scenario_fix_and_inv_cost.py 6.2% <50.0%> (ø)
...els/util/compat/message_data/update_h2_blending.py 18.7% <66.6%> (+6.9%) ⬆️
... and 9 more

... and 2 files with indirect coverage changes

@macflo8 macflo8 changed the title Fix discovered issues in MESSAGEix-Materials Fix discovered issues and improve MESSAGEix-Materials Jun 27, 2024
@macflo8 macflo8 added bug Something isn't working enh New features or functionality material MESSAGEix-Materials variant labels Jun 27, 2024
@macflo8 macflo8 self-assigned this Jun 27, 2024
@macflo8 macflo8 marked this pull request as ready for review June 27, 2024 09:04
doc/material/index.rst Outdated Show resolved Hide resolved
description: >-
Does not include Saint Helena, Ascension and Tristan da Cunha.
In France-> Mayotte, Reunion data is present post 2010.
IEA MAP: AFRICATOT - (DZA+EGY+LBY+MAR+SDN+SSD+TUN)->IIASA_AFRICA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what this description line and the following line mean, or how they are actually/intended to be used in the code:

  • 'AFRICATOT' (for instance) is not defined anywhere.
  • 'IIASA_AFRICA' (for instance) is not defined anywhere.
  • This added file is not mentioned in doc/pkg-data/node.rst.
  • The header comment seems to be exactly the same as the one in R12.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was shared by @SiddharthJoshi-Git together with a copy of the preprocessed IEA EWEB 2023. It is used to group and aggregate IEA EWEB data to R12 regions for calibration purposes in MESSAGEix-Materials. Since the file is similar to R12.yaml with main differences in the child values, does it make more sense to merge it into the existing R12.yaml? Otherwise we can also rename the file since R12_SSP_V1.yaml is maybe not very indicative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to group and aggregate IEA EWEB data to R12 regions for calibration purposes in MESSAGEix-Materials.

Can you point to where specifically this happens?

Copy link
Contributor Author

@macflo8 macflo8 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this happens in material/data_util.py. The top-level functions are modify_industry_demand() and add_new_ind_hist_act(), which are called in model.material.build(). They read a parquet file containing the EWEB data, map it to R12 and adjust existing demand and historical_activity parametrization. At the moment there are still some hardcoded parts in the whole procedure and inefficient parts (e.g. reading the parquet file more then once instead of passing it to the respective function).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. I recall this was alluded to in a MESSAGE meeting recently—that some code, somewhere, is being used to process the IEA EWEB into a Parquet file, including some spatial manipulations that are not in .tools.iea.web. Not having seen that code, I can't guess at what it contains, but from the code in message_ix_models.model.material/this branch, it seems some of these labels like "AFRICATOT" are introduced or used by it.

There are several ways in which this appears to duplicate existing message_ix_models code that has tests and docs and is more concise. We need to be careful to avoid this. However, to avoid expanding scope too much, I won't insist that they be removed/reworked this PR, but will mention them as further TODOs in #194 or subsidiary to it.

As for what to do in this PR:

Per your link, I see the file is used in two calls to this function

def map_iea_db_to_msg_regs(df_iea: pd.DataFrame, reg_map_fname: str) -> pd.DataFrame:
"""
Parameters
----------
df_iea
df containing the IEA energy balances data set
reg_map_fname
name of file used for mapping countries to MESSAGEix regions
Returns
-------
object
"""
file_path = package_data_path("node", reg_map_fname)
yaml_data = read_yaml_file(file_path)
if "World" in yaml_data.keys():
yaml_data.pop("World")
r12_map = {k: v["child"] for k, v in yaml_data.items()}
r12_map_inv = {k: v[0] for k, v in invert_dictionary(r12_map).items()}
df_iea = df_iea.merge(
pd.DataFrame.from_dict(
r12_map_inv, orient="index", columns=["REGION"]
).reset_index(),
left_on="COUNTRY",
right_on="index",
).drop("index", axis=1)
return df_iea

I am guessing here because of the lack of tests, but it appears most of the contents of R12_SSP_V1.yaml are not used. Rather, only the child: field is used to populate a "REGION" column in a data frame that has a "COUNTRY" column. Is that correct?

If so, this can be simplified a lot using existing features. For example, see:


I suspect we can add another ‘annotation’ named, e.g. material-iea-eweb-region and containing these values like "CHINAREG". Then the file can be read with get_codes(), and MappingAdapter can be used to do the relabeling.

I'd be happy to push some commits to make these changes; although again it will be difficult for me to ensure that I preserve behaviour without tests.

Copy link
Member

@khaeru khaeru Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed the modifications, including one added test. Please have a look and try them out with the private data file.

There was one ambiguous point I couldn't figure out from what's visible on the branch. The existing R12.yaml file contains, for example:

R12_NAM:
  child: [CAN, GUM, PRI, SPM, USA, VIR]

…whereas the added file contained:

R12_NAM:
  child: [CAN, USA]

It's not clear why this is different. Does the private data file contain data for e.g. "GUM" that must be omitted for a correct result? Or does the added YAML file only reflect a subset of codes that actually appear in that private data file?

In the latter case, note that it's harmless to keep "GUM → R12_NAM" in the constructed mapping; since there will be no observations with COUNTRY=GUM, the entry simply goes unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I manually tested with the private data file now. It seems that parametrization is identical except for R12_EEU. The private data file contains entries, where COUNTRY == "YUG", until 2021. This means there is a double counting of all countries that formerly belonged to Yugoslavia. According to the IEA EWEB documentation, IEA reports Yugoslavia under two codes: YUGOND and MYUGO. But I don't know how YUG data was derived in the private data file. So we need to either revisit the pre-processing of the EWEB data or modify the mapping in R12.yaml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the check. The annotations I added to R12.yaml include:

  material-region: ["*", SCG, YUG, KOSOVO]

The file added on the branch has:

  child: [ALB, BGR, BIH, CZE, EST, HRV, HUN, LTU, LVA, MKD, MNE, POL, ROU, SRB, SVK, SVN, KOSOVO]

Curiously the latter does not include any of YUG, YUGOND, or MYUGO, which really makes me scratch my head.

But anyway, with the goal of identical behaviour, in the former, "*" refers to the list of children for the same node, which I see already includes "YUG". So you could try to edit to:

  material-region: ["*", SCG, KOSOVO]

(don't duplicate, although with the way the dict is constructed, I don't think this will make any change)

or, most verbose:

  material-region: [ALB, BGR, BIH, CZE, EST, HRV, HUN, LTU, LVA, MKD, MNE, POL, ROU, SRB, SVK, SVN, KOSOVO]

…which simply copies the value from the added file—without clarifying why the Yugoslavia-related codes are omitted.

Could you please try those two settings and see what happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying the second suggested setting leads to the correct parametrization. So this would be a viable solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good to confirm. If this works, then perhaps you can git rebase -i main your branch and put d/drop beside the commit that added the R12_SSP_V1.yaml file (if it was a single commit. If it was added with other files, try e/edit and adjust the commit so the file is not added).

It may also help to:

  • Put a comment in R12.yaml file directly above that line, linking back to this thread or with a summary, in case others are also confused.
  • Expand the header comment in R12.yaml to inform people where they could find the R12_SSP_V1.yaml from which we got that information.

@@ -184,7 +184,7 @@ def broadcast_reduced_df(df, par_name):

for col in yr_col_inp:
yr_cols_codes[col] = literal_eval(df_bc_node[col].values[0])
broadcast_years(df_bc_node, yr_col_out, yr_cols_codes, col)
df_bc_node = broadcast_years(df_bc_node, yr_col_out, yr_cols_codes, col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function called here has no tests or docstring. Does it do something that can't be done with message_ix_models.util.broadcast()? If so, what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed that top level function now to a imo better name and added docstrings to all functions in this module. broadcast_years() and broadcast_nodes() use message_ix_models.util.broadcast() to unpivot the rows that are stored in methanol_techno_economic.xlsx.

Copy link
Member

@khaeru khaeru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments inline.

Some other points:

  • Files like message_ix_models/data/material/iea_mappings/*.csv:
    • should be stored as text, not using Git LFS.
    • contain the same phrase "iea_mappings" in their path and filename; this could be simplified.
    • do not appear to be mentioned anywhere in the documentation.
  • message_data.tools.utilities.get_optimization_years() does the same thing as ScenarioInfo.Y. The latter is computed only once, and has tests and documentation. The former is not tested, and hits the database every time it is called. I would prefer that we take the opportunity, when moving code to message_ix_models, to use the better-maintained solution. Similar applies to get_historical_years().

macflo8 added 13 commits July 2, 2024 14:57
Add missing files in new directories and replace private_data_path with
package_data_path

Materials build still had some overseen dependencies on message_data
Apply sorting to MESSAGEix sets based on number of dimensions

ixmp 3.9.0 automatically sorts set lists generated by .set_list()
alphabetically. When adding new items to multidimensional sets they need
 to be added to the basic sets first that do not depend on other sets.
 e.g extending 'cat_addon' requires adding both technologies in
 set 'technology' first.
macflo8 and others added 16 commits July 2, 2024 14:58
Move documentation of base year demand literature to yaml file
- Use top-level CLI options in material-ix
commands
- Replace underscore with hyphens in command names
- Replace print with log
- Add .data_util.get_region_map().
- Use message_ix_model built-ins to read the node codelist.
- Add/expand docstrings, inline comments.
- Drop file name argument from map_iea_db_to_msg_regs();
  adjust usage.
- Define all children in material-region field for R12_EEU explicitly
- Extend documentation in R12.yaml
- Move build function to build.py
- Move CLI commands to separate submodule
- Remove model.material.build.apply_spec()
- Use model.build.apply_spec()
@macflo8
Copy link
Contributor Author

macflo8 commented Jul 2, 2024

* Files like message_ix_models/data/material/iea_mappings/*.csv:
  * should be stored as text, not using Git LFS.
  * contain the same phrase "iea_mappings" in their path and filename; this could be simplified.
  * do not appear to be mentioned anywhere in the documentation.

At the moment our biggest csv file is ~400 KB. Is it okay to store all of them as text?

* `message_data.tools.utilities.get_optimization_years()` does the same thing as [`ScenarioInfo.Y`](https://docs.messageix.org/projects/models/en/latest/api/util.html#message_ix_models.util.scenarioinfo.ScenarioInfo.Y). The latter is computed only once, and has tests and documentation. The former is not tested, and hits the database every time it is called. I would prefer that we take the opportunity, when moving code to `message_ix_models`, to use the better-maintained solution. Similar applies to `get_historical_years()`.

Just to be clear: This means we should refactor the functions that use these helpers and pass a ScenarioInfo instance to them instead, correct?

@khaeru
Copy link
Member

khaeru commented Jul 3, 2024

At the moment our biggest csv file is ~400 KB. Is it okay to store all of them as text?

Generally yes. Git will automatically compress these files. Some further nuances:

  • If the file will be manually adjusted or edited, then it is helpful to see those changes as a diff of a few lines. When stored as plain text, Git can provide this.
  • If the file is wholesale output of another program and is clearly documented ("The file foobar.csv comes from running foo bar --option=X"), and will only ever be entirely replaced, then diffs would not be needed or very useful, and Git LFS can be used.

Just to be clear: This means we should refactor the functions that use these helpers and pass a ScenarioInfo instance to them instead, correct?

  • Only for files that are added or modified in this PR.
  • It's not strictly necessary to entirely refactor or change the signature of the calling function. It's also possible as an intermediate step to replace e.g. with something like:
    model_years = ScenarioInfo(scen).Y
    This is already an improvement; it will later make it easier to see where the ScenarioInfo object can be passed/used instead of the scenario itself. Since we may want to provide tested and documented alternatives for the codes in .util.compat.message_data, let's not invest too much time in tinkering with them.

@macflo8
Copy link
Contributor Author

macflo8 commented Jul 3, 2024

Okay thanks for confirming! I pushed the suggested changes. I think all the requested changes and comments are addressed now.

Copy link
Member

@khaeru khaeru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some offline discussion with @macflo8, this is good to merge.

In particular, the codecov/patch check failure is acceptable: per the PR description, that will be addressed via tests to be added in a subsequent PR.

@khaeru khaeru merged commit 9c08ad6 into main Jul 4, 2024
25 of 26 checks passed
@khaeru khaeru deleted the fix/materials-W23 branch July 4, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enh New features or functionality material MESSAGEix-Materials variant
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants