Fix discovered issues and improve MESSAGEix-Materials #201

macflo8 · 2024-06-21T07:45:52Z

Removes message_data dependencies (some .yaml and .csv input files).
Fixes incorrect imports of previously migrated code in utils/compat/message_data.
Removes deprecated/unused variables and code sections.
Adresses Further clean up material module #198.
Implements data packaging mentioned in Test and improve .model.material #194 (comment).
Hide some debugging and project specific CLI commands of material-ix group
Moves from using untested helpers in utils/compat/message_data to using tested equivalents of message-ix-models
Removes click options from material-ix commands that are already implemented in message_ix_models/cli.py
Stores csv files in data/material as text (previously stored with git LFS)
General updates to documentation

Details:
Building MESSAGEix-Materials on an existing MESSAGEix-GLOBIOM scenario still requires a few external data file that have previously been stored in a private repository. The respective files are migrated to message-ix-models/data and data paths in the modules updated accordingly in this PR. The files located in util/compat/message_data migrated in the course of #188 contain some incorrect import statements, which are fixed in this PR.

The following files are needed in the build and were therefore added:

The build however still requires proprietary data from IEA Extended Energy Balances to calibrate historical industry activity. The path to the required file can be specified with a newly added CLI option --iea_data_path in the build command.

All input data for MESSAGEix-Materials is now packaged in a .tar.gz file, that can be fetched with mix-models fetch MESSAGEix-Materials.

When executing mix-models material-ix --help only the basic comments related to building/solving/reporting are shown

How to review

Run build/solve/report with material-ix CLI commands and confirm correct execution of each task.

PR checklist

~~[ ] Continuous integration checks all ✅~~ Not possible due to missing tests
~~[ ] Add or expand tests; coverage checks both ✅~~ will be handled via Test and improve .model.material #194
Add, expand, or update documentation.
Update doc/whatsnew.

codecov · 2024-06-26T16:29:51Z

Codecov Report

Attention: Patch coverage is 29.32166% with 323 lines in your changes missing coverage. Please review.

Project coverage is 51.9%. Comparing base (229eecb) to head (4814e7a).

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #201     +/-   ##
=======================================
- Coverage   52.2%   51.9%   -0.3%     
=======================================
  Files        141     142      +1     
  Lines      11346   11250     -96     
=======================================
- Hits        5925    5843     -82     
+ Misses      5421    5407     -14

Files	Coverage Δ
message_ix_models/cli.py	`93.4% <ø> (ø)`
...ssage_ix_models/model/material/data_ammonia_new.py	`8.7% <100.0%> (+0.2%)`	⬆️
...l/material/material_demand/material_demand_calc.py	`17.3% <ø> (-1.6%)`	⬇️
...e_ix_models/tests/model/material/test_data_util.py	`100.0% <100.0%> (ø)`
...s/util/compat/message_data/get_historical_years.py	`20.0% <ø> (ø)`
...util/compat/message_data/get_optimization_years.py	`28.5% <ø> (ø)`
message_ix_models/util/pooch.py	`42.0% <ø> (ø)`
.../compat/message_data/change_technology_lifetime.py	`2.5% <0.0%> (ø)`
...at/message_data/check_scenario_fix_and_inv_cost.py	`6.2% <50.0%> (ø)`
...els/util/compat/message_data/update_h2_blending.py	`18.7% <66.6%> (+6.9%)`	⬆️
... and 9 more

... and 2 files with indirect coverage changes

doc/material/index.rst

khaeru · 2024-06-27T09:47:35Z

message_ix_models/data/node/R12_SSP_V1.yaml

+  description: >-
+    Does not include Saint	Helena,	Ascension	and	Tristan	da	Cunha. 
+    In France-> Mayotte, Reunion data is present post 2010.
+    IEA MAP:  AFRICATOT - (DZA+EGY+LBY+MAR+SDN+SSD+TUN)->IIASA_AFRICA


It's unclear what this description line and the following line mean, or how they are actually/intended to be used in the code:

'AFRICATOT' (for instance) is not defined anywhere.

'IIASA_AFRICA' (for instance) is not defined anywhere.

This added file is not mentioned in doc/pkg-data/node.rst.

The header comment seems to be exactly the same as the one in R12.yaml.

This file was shared by @SiddharthJoshi-Git together with a copy of the preprocessed IEA EWEB 2023. It is used to group and aggregate IEA EWEB data to R12 regions for calibration purposes in MESSAGEix-Materials. Since the file is similar to R12.yaml with main differences in the child values, does it make more sense to merge it into the existing R12.yaml? Otherwise we can also rename the file since R12_SSP_V1.yaml is maybe not very indicative.

It is used to group and aggregate IEA EWEB data to R12 regions for calibration purposes in MESSAGEix-Materials.

Can you point to where specifically this happens?

All of this happens in material/data_util.py. The top-level functions are modify_industry_demand() and add_new_ind_hist_act(), which are called in model.material.build(). They read a parquet file containing the EWEB data, map it to R12 and adjust existing demand and historical_activity parametrization. At the moment there are still some hardcoded parts in the whole procedure and inefficient parts (e.g. reading the parquet file more then once instead of passing it to the respective function).

Okay, I see. I recall this was alluded to in a MESSAGE meeting recently—that some code, somewhere, is being used to process the IEA EWEB into a Parquet file, including some spatial manipulations that are not in .tools.iea.web. Not having seen that code, I can't guess at what it contains, but from the code in message_ix_models.model.material/this branch, it seems some of these labels like "AFRICATOT" are introduced or used by it.

There are several ways in which this appears to duplicate existing message_ix_models code that has tests and docs and is more concise. We need to be careful to avoid this. However, to avoid expanding scope too much, I won't insist that they be removed/reworked this PR, but will mention them as further TODOs in #194 or subsidiary to it.

As for what to do in this PR:

Per your link, I see the file is used in two calls to this function

message-ix-models/message_ix_models/model/material/data_util.py

Lines 921 to 950 in 36fc249

def map_iea_db_to_msg_regs(df_iea: pd.DataFrame, reg_map_fname: str) -> pd.DataFrame:

"""

Parameters

----------

df_iea

df containing the IEA energy balances data set

reg_map_fname

name of file used for mapping countries to MESSAGEix regions

Returns

-------

object

"""

file_path = package_data_path("node", reg_map_fname)

yaml_data = read_yaml_file(file_path)

if "World" in yaml_data.keys():

yaml_data.pop("World")

r12_map = {k: v["child"] for k, v in yaml_data.items()}

r12_map_inv = {k: v[0] for k, v in invert_dictionary(r12_map).items()}

df_iea = df_iea.merge(

pd.DataFrame.from_dict(

r12_map_inv, orient="index", columns=["REGION"]

).reset_index(),

left_on="COUNTRY",

right_on="index",

).drop("index", axis=1)

return df_iea

I am guessing here because of the lack of tests, but it appears most of the contents of R12_SSP_V1.yaml are not used. Rather, only the child: field is used to populate a "REGION" column in a data frame that has a "COUNTRY" column. Is that correct?

If so, this can be simplified a lot using existing features. For example, see:

message-ix-models/message_ix_models/data/node/R12.yaml

Line 19 in dad8f8a

iea-weo-region: Africa

I suspect we can add another ‘annotation’ named, e.g. material-iea-eweb-region and containing these values like "CHINAREG". Then the file can be read with get_codes(), and MappingAdapter can be used to do the relabeling.

I'd be happy to push some commits to make these changes; although again it will be difficult for me to ensure that I preserve behaviour without tests.

I've just pushed the modifications, including one added test. Please have a look and try them out with the private data file.

There was one ambiguous point I couldn't figure out from what's visible on the branch. The existing R12.yaml file contains, for example:

R12_NAM: child: [CAN, GUM, PRI, SPM, USA, VIR]

…whereas the added file contained:

R12_NAM: child: [CAN, USA]

It's not clear why this is different. Does the private data file contain data for e.g. "GUM" that must be omitted for a correct result? Or does the added YAML file only reflect a subset of codes that actually appear in that private data file?

In the latter case, note that it's harmless to keep "GUM → R12_NAM" in the constructed mapping; since there will be no observations with COUNTRY=GUM, the entry simply goes unused.

Thanks, I manually tested with the private data file now. It seems that parametrization is identical except for R12_EEU. The private data file contains entries, where COUNTRY == "YUG", until 2021. This means there is a double counting of all countries that formerly belonged to Yugoslavia. According to the IEA EWEB documentation, IEA reports Yugoslavia under two codes: YUGOND and MYUGO. But I don't know how YUG data was derived in the private data file. So we need to either revisit the pre-processing of the EWEB data or modify the mapping in R12.yaml.

Thanks for doing the check. The annotations I added to R12.yaml include:

material-region: ["*", SCG, YUG, KOSOVO]

The file added on the branch has:

child: [ALB, BGR, BIH, CZE, EST, HRV, HUN, LTU, LVA, MKD, MNE, POL, ROU, SRB, SVK, SVN, KOSOVO]

Curiously the latter does not include any of YUG, YUGOND, or MYUGO, which really makes me scratch my head.

But anyway, with the goal of identical behaviour, in the former, "*" refers to the list of children for the same node, which I see already includes "YUG". So you could try to edit to:

material-region: ["*", SCG, KOSOVO]

(don't duplicate, although with the way the dict is constructed, I don't think this will make any change)

or, most verbose:

material-region: [ALB, BGR, BIH, CZE, EST, HRV, HUN, LTU, LVA, MKD, MNE, POL, ROU, SRB, SVK, SVN, KOSOVO]

…which simply copies the value from the added file—without clarifying why the Yugoslavia-related codes are omitted.

Could you please try those two settings and see what happens?

Applying the second suggested setting leads to the correct parametrization. So this would be a viable solution.

Okay, good to confirm. If this works, then perhaps you can git rebase -i main your branch and put d/drop beside the commit that added the R12_SSP_V1.yaml file (if it was a single commit. If it was added with other files, try e/edit and adjust the commit so the file is not added).

It may also help to:

Put a comment in R12.yaml file directly above that line, linking back to this thread or with a summary, in case others are also confused.

Expand the header comment in R12.yaml to inform people where they could find the R12_SSP_V1.yaml from which we got that information.

message_ix_models/model/material/__init__.py

khaeru · 2024-06-27T09:56:05Z

message_ix_models/model/material/data_methanol_new.py

@@ -184,7 +184,7 @@ def broadcast_reduced_df(df, par_name):

        for col in yr_col_inp:
            yr_cols_codes[col] = literal_eval(df_bc_node[col].values[0])
-            broadcast_years(df_bc_node, yr_col_out, yr_cols_codes, col)
+            df_bc_node = broadcast_years(df_bc_node, yr_col_out, yr_cols_codes, col)


The function called here has no tests or docstring. Does it do something that can't be done with message_ix_models.util.broadcast()? If so, what?

I renamed that top level function now to a imo better name and added docstrings to all functions in this module. broadcast_years() and broadcast_nodes() use message_ix_models.util.broadcast() to unpivot the rows that are stored in methanol_techno_economic.xlsx.

khaeru

I added some comments inline.

Some other points:

Files like message_ix_models/data/material/iea_mappings/*.csv:
- should be stored as text, not using Git LFS.
- contain the same phrase "iea_mappings" in their path and filename; this could be simplified.
- do not appear to be mentioned anywhere in the documentation.
message_data.tools.utilities.get_optimization_years() does the same thing as ScenarioInfo.Y. The latter is computed only once, and has tests and documentation. The former is not tested, and hits the database every time it is called. I would prefer that we take the opportunity, when moving code to message_ix_models, to use the better-maintained solution. Similar applies to get_historical_years().

Add missing files in new directories and replace private_data_path with package_data_path Materials build still had some overseen dependencies on message_data

Apply sorting to MESSAGEix sets based on number of dimensions ixmp 3.9.0 automatically sorts set lists generated by .set_list() alphabetically. When adding new items to multidimensional sets they need to be added to the basic sets first that do not depend on other sets. e.g extending 'cat_addon' requires adding both technologies in set 'technology' first.

Move documentation of base year demand literature to yaml file

- Use top-level CLI options in material-ix commands - Replace underscore with hyphens in command names - Replace print with log

- Add .data_util.get_region_map(). - Use message_ix_model built-ins to read the node codelist. - Add/expand docstrings, inline comments. - Drop file name argument from map_iea_db_to_msg_regs(); adjust usage.

- Define all children in material-region field for R12_EEU explicitly - Extend documentation in R12.yaml

- Move build function to build.py - Move CLI commands to separate submodule - Remove model.material.build.apply_spec() - Use model.build.apply_spec()

macflo8 · 2024-07-02T14:30:49Z

* Files like message_ix_models/data/material/iea_mappings/*.csv:
  * should be stored as text, not using Git LFS.
  * contain the same phrase "iea_mappings" in their path and filename; this could be simplified.
  * do not appear to be mentioned anywhere in the documentation.

At the moment our biggest csv file is ~400 KB. Is it okay to store all of them as text?

* `message_data.tools.utilities.get_optimization_years()` does the same thing as [`ScenarioInfo.Y`](https://docs.messageix.org/projects/models/en/latest/api/util.html#message_ix_models.util.scenarioinfo.ScenarioInfo.Y). The latter is computed only once, and has tests and documentation. The former is not tested, and hits the database every time it is called. I would prefer that we take the opportunity, when moving code to `message_ix_models`, to use the better-maintained solution. Similar applies to `get_historical_years()`.

Just to be clear: This means we should refactor the functions that use these helpers and pass a ScenarioInfo instance to them instead, correct?

khaeru · 2024-07-03T08:59:09Z

At the moment our biggest csv file is ~400 KB. Is it okay to store all of them as text?

Generally yes. Git will automatically compress these files. Some further nuances:

If the file will be manually adjusted or edited, then it is helpful to see those changes as a diff of a few lines. When stored as plain text, Git can provide this.
If the file is wholesale output of another program and is clearly documented ("The file foobar.csv comes from running foo bar --option=X"), and will only ever be entirely replaced, then diffs would not be needed or very useful, and Git LFS can be used.

Just to be clear: This means we should refactor the functions that use these helpers and pass a ScenarioInfo instance to them instead, correct?

Only for files that are added or modified in this PR.
It's not strictly necessary to entirely refactor or change the signature of the calling function. It's also possible as an intermediate step to replace e.g.

message-ix-models/message_ix_models/util/compat/message_data/manual_updates_ENGAGE_SSP2_v417_to_v418.py

Line 419 in c831913

model_years = get_optimization_years(scen)

with something like:
```
model_years = ScenarioInfo(scen).Y
```
This is already an improvement; it will later make it easier to see where the ScenarioInfo object can be passed/used instead of the scenario itself. Since we may want to provide tested and documented alternatives for the codes in .util.compat.message_data, let's not invest too much time in tinkering with them.

macflo8 · 2024-07-03T15:02:25Z

Okay thanks for confirming! I pushed the suggested changes. I think all the requested changes and comments are addressed now.

khaeru

After some offline discussion with @macflo8, this is good to merge.

In particular, the codecov/patch check failure is acceptable: per the PR description, that will be addressed via tests to be added in a subsequent PR.

macflo8 force-pushed the fix/materials-W23 branch from 9bbc70e to ea92d38 Compare June 24, 2024 08:18

macflo8 changed the title ~~Fix discovered issues in MESSAGEix-Materials~~ Fix discovered issues and improve MESSAGEix-Materials Jun 27, 2024

macflo8 added bug Something isn't working enh New features or functionality material MESSAGEix-Materials variant labels Jun 27, 2024

macflo8 self-assigned this Jun 27, 2024

macflo8 marked this pull request as ready for review June 27, 2024 09:04

macflo8 requested review from GamzeUnlu95 and khaeru June 27, 2024 09:04

khaeru reviewed Jun 27, 2024

View reviewed changes

doc/material/index.rst Outdated Show resolved Hide resolved

khaeru reviewed Jun 27, 2024

View reviewed changes

message_ix_models/model/material/__init__.py Outdated Show resolved Hide resolved

khaeru reviewed Jun 27, 2024

View reviewed changes

message_ix_models/model/material/__init__.py Outdated Show resolved Hide resolved

khaeru reviewed Jun 27, 2024

View reviewed changes

macflo8 force-pushed the fix/materials-W23 branch from 9f64c64 to 4347136 Compare July 2, 2024 12:43

macflo8 added 13 commits July 2, 2024 14:57

Fix relative imports in calibration utils

a0791de

Resolve pandas FutureWarning

5512e95

Add missing files for Materials build

f071d04

Add missing files in new directories and replace private_data_path with package_data_path Materials build still had some overseen dependencies on message_data

Apply ruff to util/compat files

d9859bc

Fix off-by-one error in MACRO data preparation

16acf68

Add missing file for UE calibration

5098d5a

Fix bug in new methanol build

ed89710

Handle RuntimeError in MACRO data calculation

f64e802

Resolve pandas FutureWarning

2d61a67

Change pandas options in model.material.data_util

85c7996

Fix incorrect column joining

5eea8d6

Add macro calibration CLI function

f635987

macflo8 and others added 16 commits July 2, 2024 14:58

Add material.tar.gz to .pooch.SOURCE

7af647d

Exclude data/material subdirs from packaging

23e13fb

Add support for .gz files in util/pooch

b6156b7

Remove deprecated demand calculation for ammonia

6385e82

Move documentation of base year demand literature to yaml file

Use util.ixmp for ixmp imports

fe20626

Hide debugging and project specific CLI commands

3e502e3

Undo hard wrapping within sentences

c32b123

Improve material-ix commands

919b267

- Use top-level CLI options in material-ix commands - Replace underscore with hyphens in command names - Replace print with log

Use ScenarioInfo in message_data compat functions

00aef36

Integrate new iea_data_path CLI option

14b602e

Add inline comments to data_methanol_new

309adcb

Test .material.data_util.map_iea_db_to_msg_regs()

4eb4bad

Annotate node/R12.yaml for .model.material

bcdb745

Use R12 node annotations in .material.data_util

3050a2b

- Add .data_util.get_region_map(). - Use message_ix_model built-ins to read the node codelist. - Add/expand docstrings, inline comments. - Drop file name argument from map_iea_db_to_msg_regs(); adjust usage.

Fix R12_EEU IEA EWEB mapping

00852c5

- Define all children in material-region field for R12_EEU explicitly - Extend documentation in R12.yaml

Clean up model.material.__init__

c831913

- Move build function to build.py - Move CLI commands to separate submodule - Remove model.material.build.apply_spec() - Use model.build.apply_spec()

macflo8 force-pushed the fix/materials-W23 branch from 4347136 to c831913 Compare July 2, 2024 13:02

Rename methanol data functions & add docstrings

5f89394

macflo8 added 6 commits July 3, 2024 11:24

Use tested helper functions in util/compat code

2577ad6

Remove separate .gitattributes in data/material

c9b3eff

Restore materials csv files to git from lfs

eab7532

Rename iea mapping files

f8ff701

Update material module documentation

4c2c169

Fix code quality issue

4814e7a

khaeru approved these changes Jul 4, 2024

View reviewed changes

khaeru merged commit 9c08ad6 into main Jul 4, 2024
25 of 26 checks passed

khaeru deleted the fix/materials-W23 branch July 4, 2024 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix discovered issues and improve MESSAGEix-Materials #201

Fix discovered issues and improve MESSAGEix-Materials #201

macflo8 commented Jun 21, 2024 •

edited by khaeru

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

khaeru Jun 27, 2024

macflo8 Jun 28, 2024

khaeru Jun 28, 2024

macflo8 Jun 28, 2024 •

edited

Loading

khaeru Jun 28, 2024

khaeru Jul 1, 2024 •

edited

Loading

macflo8 Jul 1, 2024

khaeru Jul 1, 2024

macflo8 Jul 2, 2024

khaeru Jul 2, 2024

khaeru Jun 27, 2024

macflo8 Jul 3, 2024

khaeru left a comment

macflo8 commented Jul 2, 2024

khaeru commented Jul 3, 2024

macflo8 commented Jul 3, 2024

khaeru left a comment

	def map_iea_db_to_msg_regs(df_iea: pd.DataFrame, reg_map_fname: str) -> pd.DataFrame:
	"""

	Parameters
	----------
	df_iea
	df containing the IEA energy balances data set
	reg_map_fname
	name of file used for mapping countries to MESSAGEix regions
	Returns
	-------
	object

	"""
	file_path = package_data_path("node", reg_map_fname)
	yaml_data = read_yaml_file(file_path)
	if "World" in yaml_data.keys():
	yaml_data.pop("World")

	r12_map = {k: v["child"] for k, v in yaml_data.items()}
	r12_map_inv = {k: v[0] for k, v in invert_dictionary(r12_map).items()}

	df_iea = df_iea.merge(
	pd.DataFrame.from_dict(
	r12_map_inv, orient="index", columns=["REGION"]
	).reset_index(),
	left_on="COUNTRY",
	right_on="index",
	).drop("index", axis=1)
	return df_iea

Fix discovered issues and improve MESSAGEix-Materials #201

Fix discovered issues and improve MESSAGEix-Materials #201

Conversation

macflo8 commented Jun 21, 2024 • edited by khaeru Loading

How to review

PR checklist

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macflo8 Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaeru Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaeru left a comment

Choose a reason for hiding this comment

macflo8 commented Jul 2, 2024

khaeru commented Jul 3, 2024

macflo8 commented Jul 3, 2024

khaeru left a comment

Choose a reason for hiding this comment

macflo8 commented Jun 21, 2024 •

edited by khaeru

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

macflo8 Jun 28, 2024 •

edited

Loading

khaeru Jul 1, 2024 •

edited

Loading