-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data transformations vs data mapping #1
Comments
@fjuniorr First of, wow! This is awesome analysis, and it's fantastic to bring in other systems. I feel very lucky that you have agreed to work with us on this. Using For me, the beauty of transformation_spec = {
"update": [
{
"source": {
"sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
},
"target": {
"name": "Zinc, in ground",
"uuid": "be73218b-18af-492e-96e6-addd309d1e32",
"context": ["natural resource", "in ground"],
},
}
]
} I would add the The unit conversion is trickier; I need more time to think about that. We can't rely on
Upon some reflection, I think you might be right that there is an implicit schema, but I don't that is avoidable... Do we have an example of an OpenLCA flow mapping file in the wild? |
@fjuniorr I hope the above didn't sound too negative. I am totally open to having a different workflow for elementary flows than we do for inventory datasets - but my presumption would be negative, as the types of elementary flow mappings seem very similar to me to other types of flow mappings, such as products. For example, we might need to map one megajoule of "natural gas" to one standard cubic meter of "CH4". To me, this feels very similar, and so should have a common approach. |
Thanks @cmutel! The ramp up costs of switching domains is real so is great to hear that I'm not talking nonsense.
The point I was trying to make is that the second system needs to expect to find the information it needs from the fields
👍
Yeah, I saw your note that PRé have one UUID for flows regardless of the subcategory context values. I was using their id because in SimaProv9.4.csv and SProf94_final_substanceList.xlsx since there is no subcategory context their are unique. I noticed that in database-1.json we have the subcategory context. Should we work with a "merge" of the two1? I ask because I also couldn't find the subcategory context in your sample
Eventually there will be a schema! IMHO when is the harder question.
Not yet. The
Not at all! :)
Makes total sense. I will keep you posted on how the porting of the jupyter notebooks are going. Footnotes
|
I think we are in a good place here. I will leave the final decision to you - I suspect that implementing these ideas will tell you the best way in any case. As you said, the way that client software works with the mappings is up to each software system. For me, the most important thing is to have a maintainable, transparent, and reproducible system which can be applied to new lists as they are released, and one which can get buy in from the community. Pinging @WesIngwersen, @tngTUDOR, @tfardet, @ccomb, @msrocka, @thomassonderegger, @seanjpollard, @jsvgoncalves, @johannesecoinvent; you are in the data war trenches, please feel free to come in with your opinions, or to bring others into the conversation! The context of this discussion is that @fjuniorr is working to rewrite and update simapro_ecoinvent_elementary_flows to have a mapping format and software which can be easily applied to inventory databases. He has a lot of experience in data engineering, including working with Frictionless Data.
Yes, the You can see the available contexts (in one organization's perspective) here: https://glossary.ecoinvent.org/elementary-exchanges/. Of course, these will also evolve over time. Footnotes |
@fjuniorr this looks great, thanks for that! I'm overall in favor of the 2nd proposal in the initial post. What follows is just me trying to figure out the update proposal by @cmutel in his reply, so feel free to ignore if it turns out to be nonsense. I expect that we want, whenever possible, to just match unique ids and let the target software work with the equivalent objects with its own entries and values, converting whatever values should be changed in the exchanges only (e.g. if units vary). So whenever possible mapping_spec = {
"match": [
{
"source": {
"sumi": "80596FA1-6D62-4392-AC71-509E5F73D39E",
},
"target": {
"@id": "be73218b-18af-492e-96e6-addd309d1e32",
},
"conversionFactor": 1,
"MatchCondition": "=",
...
]
} If there are not unique ids, we supply whatever is necessary to make the match. In the example given by @cmutel, as SimaPro is the one missing the unique ID, I would expect the mapping_spec = {
"match": [
{
"source": {
"sumi": "non-unique-simapro-id",
"name": ... ,
"context": ... ,
},
"target": {
"@id": "unique-ecoinvent-id",
},
"conversionFactor": 1,
"MatchCondition": "=",
...
]
} but we don't modify any entry so that the software just works with what it has, as expected. |
@tfardet you are totally correct, this is a poor example as the "sumi" is not enough to uniquely identify an object in their ontology. |
Agree 100% with what you said @tfardet! @cmutel at least for know I don't see a problem in switching the current implementation1 between any of the formats that we discussed2, so it might come down do the use cases that we see in the future and alignment with fedelemflowlist and openLCA FlowMap. I've also created an example project that makes use of #11 to generate mappings from SimaPro 9.4 and ecoinvent 3.7 using the existing logic of simapro_ecoinvent_elementary_flows. Footnotes
|
Interesting discussion; some points regarding the openLCA FlowMap schema:
|
Thanks for the pointers and PR @msrocka! |
At least for inspecting the flows that matched having the usual |
@fjuniorr I think I understand this better now - before I more focused on some specific details but wasn't engaging with the fundamental question of mapping versus transformations. If I understand correctly, we are now focused on mappings across systems. That's fine, but I would like your opinion on how to handle transitive mappings. For example, today I was mapping a flow from SimaPro: {
"name": "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore",
"categories": [
"Resources",
"in ground"
],
"unit": "kg"
} To ecoinvent version 3.10. This didn't match, because there was a change from ecoinvent version 3.9 to version 3.10; the flow is now: {
"name": "Copper",
"categories": [
"natural resource",
"in ground"
],
"unit": "kilogram",
"uuid": "a9ac40a0-9bea-4c48-afa7-66aa6eb90624",
"CAS number": "007440-50-8",
"synonyms": []
} How should we get this right? The 3.9 to 3.10, or any set of transformations, should be usable in our generation of mapping files. To me this is blurring the line between transformations and mapping... |
@cmutel first off, I think we definitely need transformations eventually. However storing only the mappings makes it easier to create specific transformations for different use cases. The methods
I think we are concentrating on mappings across different systems mainly because:
Let's ensure we're on the same page with your example1. The flow "Copper, 0.52% in sulfide, Cu 0.27% and Mo 8.2E-3% in crude ore" from SimaPro However, in The matches to UUID Now, how should we address this? I don't think it's a blurring of lines between transformations and mapping. It's more a result of our current approach, where we don't use information from one mapping (like For example, since "Copper, Cu 5.2E-2%, Pt 4.8E-4%, Pd 2.0E-4%, Rh 2.4E-5%, Ni 3.7E-2% in ore" matched UUID I'm not sure if the best approach is to save a state of certified mappings across specific systems and versions, or if we should assume the need to update the match rules to accommodate changes in the flow lists. Footnotes
|
OK with the decision that this library is focused on mappings, and that transformations are needed but can be generated from the mappings and are a separate unit of work. |
I don't see any way around this. The lists per system change, and we can't rely on things like So I think we need to plan for generic transformations, and a config parameter to load specific mapping data based on the input/output combinations. |
It would be great if we could keep the match rules valid and generic for every input/output combination, but I agree that this can get messy or downright impossible. |
The
randonneur
data migration verbs are focused on specifying the changes that need to be made in order to convert a source object into a target object.In our case the focus is converting one source flow to a target flow that may differ in some characteristics but otherwise represent the same flow1.
Take for example this one-to-one mapping2 (suffix in ground):
One way to express the conversion using
randonneur
update
verb is:Which would generate
I think there are two main downsides to this approach.
Firstly, because we are specifying actual conversions rules, we need to impose a schema on the data. In this example I'm using GLAD flow mapping format but the problem persists even if the target flow list is always the same (eg. ecoinvent) because the problem will appear when a client application (ie. LCA software) expects to consume data in a different format.
Secondly, a conversion that keeps all the metadata from the target flow is verbose because all the fields need to be specified.
A more flexible approach so that we don't need to impose a schema on the target flow is to encode matching information on the target node and not transformations. Extra metadata for the transformations needed (such as unit conversions) should also be added but the client application should do the actual conversion. For example3:
In the randonneur examples it's model of making changes to data is not so problematic because both source and target data already share the same schema.
The challenge of how to generate the mapping information in a way that is reproducible and inspectable for individuals that don't code remains.
Footnotes
From (Edelen, et al., 2017)
↩Using this simapro excel sheet represented as a dict and this encoinvent xml represented as a dict using
xmltodict
. ↩It should be noted that probably makes more sense to reuse openLCA schema FlowMap then to create a new format. This means that the input and output flow lists need to be standardized. ↩
The text was updated successfully, but these errors were encountered: