Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cep27 #259

Open
wants to merge 6 commits into
base: source
Choose a base branch
from
Open

Cep27 #259

wants to merge 6 commits into from

Conversation

jbae11
Copy link
Contributor

@jbae11 jbae11 commented Sep 14, 2017

Abstract

@scopatz
This CEP proposes to restructure the |cyclus| output database structure in order to
reduce the number of tables and redundancy of data, and ultimately reduce the number
of joins required for data analysis. Doing so would reduce the computing time
for end-user analysis, and allow for a clearer, more concise output database.

Copy link
Member

@gonuke gonuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably useful to explicitly mention the need for a major version bump due to this.

The **ExplicitInventory** table and **ExplicitInventoryCompact**
table should be merged to a single table, called **Inventories**,
with the following columns:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm unfamiliar with the previous tables, can you clarify what is changing here by showing the old table layout? (as you did with the others)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I completely understand the question, but the old table layout is shown so that the reader can clearly understand what is being modified / removed with the proposed change.

Copy link
Member

@bam241 bam241 Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Explicitinventory table corresponds to a table with all the different inventory in all the different facilities:
SimId -- AgentId -- Time -- InventoryName -- NucId -- Quantity

where the ExplicitInventoryCompact looks more like the new "Inventory"
SimId -- AgentId -- Time -- InventoryName -- Quantity -- Composition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bam241 answered my question - the old table layout was NOT shown for this case and perhaps should be...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

============ ==========
SimId uuid
Recipes string
Composition map<int,double>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous layout anticipated the desire to select on nuclide in the query, and hence a different column for each NucId. Perhaps this has not emerged in the wild, but it seems that a consequence of this change would make this no longer possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, that is a valid point.
Maybe it's necessary for us to define what
For example, if one wants the timeseries mass of Pu239,
the query would be like the following:

SELECT sum(massfrac*quantity) FROM resources
INNER JOIN compositions
ON resources.qualid = compositions.qualid
WHERE nucid=942390000
GROUP BY timecreated

in the newer database structure, it would be:

SELECT quantity, timecreated, composition   FROM materials

followed by a script that processes the result:

Get query results
loop through every row of query results
for every row, 
    look in composition for nucid 94239, get its massfrac value
    multiply that by quantity
    add that to the list of pu239 value timeseries list

So I do assume that it would take a longer time to accomplish
what you mentioned ( and also needs additional scripting outside of the sqlite query)...

You probably know much more than me, but @scopatz and my initial thought was that
this would have more benefit than loss. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will not be optimized for a large calculation (1000-10000 facilities), if you want to see the plutonium inventory in the fleet, you will need to load all the composition, get the informations you need and then re-generate a table.

I would prefer a system that allow us to filter using facility's name and nucid, but I am not sure it is possible without having a gigantic table :(

@bam241
Copy link
Member

bam241 commented Sep 17, 2017

There is a table that is natively in CLASS output file, that might be to consider here:
We have tables that regroup information by prototype definition(sum over the AgendId if they are the same prototypes).

I know that will create redondant informations, but it will also allows analyst to access the inventory of all the PWR_UOX without any computation.

something like:
SimId -- Prototype -- Time -- InventoryName -- NucId -- Quantity

In the general case, this will not be a problem because one usually have few different archetype specialization (I usually have 5-20 different facilities in my calculations)...
This will be an issue when you have a specific prototype declaration for each agents you are using.

This light InventoryTable could provide a quick access to the informations that are need 80% of the time -- at least for me (I generally never access specific agent informations)


I know and understand why we want to avoid having informations in 2 places, or to avoid to store informations that we can retrieve with some computations. But I also believe that the user analysis experience needs to be fast for the simple query (like show me the Pu inventory in my calculation) and I am not sure we can achieve that on a typical US-fleet simulation

@jbae11
Copy link
Contributor Author

jbae11 commented Sep 18, 2017

@bam241 if I understood correctly, would there be multiple rows of nucid& quantity for the same inventoryname&time?

SimId -- PWR_UOX -- 1 -- fuel -- 922350000 -- 123
SimId -- PWR_UOX -- 1 -- fuel -- 922380000 -- 1245
SimId -- PWR_UOX -- 1 -- fuel -- 942390000 -- 124
and so on?

@bam241
Copy link
Member

bam241 commented Sep 18, 2017

Yes, but I am not sure everybody will agree on this one, but having the full inventory could allow quick analysis, and restricting it to Prototypes could limits its size.

maybe also have the capability to disable the table from the input file could be nice, (or maybe making it optional)

@jbae11
Copy link
Contributor Author

jbae11 commented Sep 18, 2017

yeah I was thinking of a case where there are ~30 prototypes (predicting the past), and each of them would get a table. So having it optional would be necessary, I think. Do you want to maybe make some edits to the CEP to pitch that in? (If that's allowed, I'm not sure)

@bam241
Copy link
Member

bam241 commented Sep 18, 2017

I'll wait a bit to see how other react to it. If they agree I'll PR an edit.

30 prototypes is not that much, I have actually about 3000 agent on my FCO simulations, if you track less than 100 nuclides your table will be smaller than the actual ExplicitInventory...

@scopatz
Copy link
Member

scopatz commented Sep 26, 2017

Hello All, I'll discuss the filtering by nuclide question here. There are a few alternatives to what is proposed here.

  1. The current situation, ie a NucId column
  2. Have a Material (or Composition) datatype in type system, and use this instead of map<int, double>
  3. Have a column for each nuclide tracked.

The current situation

Advantages: The storage is still sparse, in the sense that it doesn't store nuclides that don't exist in the material. This method also does allow direct filtering.

Disadvantages: Currently, we store the SimId (uuid, 20 bytes) and the QualId (int, 4 bytes) for every NucId (int, 4 bytes) and MassFrac (double, 8 bytes) that we store. Thus, the overhead from the SimId & QualId is 2/3rds of the total storage size. So in addition to issue of having to roll through a huge number of rows to find the material of interest, the storage is also inefficient.

Material Datatype

We have a fairly sophisticated type system that allows us to store custom classes as unique types. Nothing stops us from having cyclus::Material, cyclus::Product, Resource Buffers, etc from being first class members of the type system.

Advantages: In fact, they already are! These types were added to aid in the C++/Python inter-language communication. Using Material as the type instead of map<int, double> would allow the backends to determine how they wish to store the data in these classes. This could be done in such a way that enables direct filtering by SQLite, etc.

Disadvantages: Someone would actually have to implement these datatypes (or at least Material) for the backends. I could work on the HDF5 one, but that still leaves SQLite. Additionally, I suspect that if the SQLite backend was implemented in such a way as to allow direct filtering with SQL queries that it will end up having similar data overheads as in the current situation option.

A column for every nuclide

Instead of having a long table, we could have a wider table where every nuclide (H1, H2, ... U235, U236...) gets its own column.

Advantages: This has relatively low SimId & QualId overhead compared to the current situation. It allows direct querying of individual nuclides by picking out the column of interest.

Disadvantages: This is not sparse; nuclides that are not present in the Material get a concrete 0.0 value stored in the database. Additionally, this may necessitate a global tracked nuclide set, whereby nuclides in not in this set will not be valid in Cyclus. This is a little antithetical to Cyclus' notion of being robust to whatever an archetype developer wants to supply. Also, for a sufficiently comprehensive set of tracked nuclides (1000 - 4000), there could be technical issues in the backends. SQLite might have a column limit. HDF5 has 64 kb of header data per table, which I have exceeded with the strategy and 1500 nuclide columns (I think) .

proposal

So in summary, the proposed version went in as it did because:

  • map<int, double> is available and implemented in all backends
  • it is sparse,
  • minimizes SimId and QualId overhead.

Yes, it does this at the cost of not being able to directly filter based on NucId. But all things considered, I think this is a worthy trade off. Doing a secondary filter after the material has been pulled out isn't hard, as @gonuke said this feature isn't used much, and it is expensive when it is used.

We can think about other mitigation strategies, such as convenience functions for material filtering or putting in (cy)metrics that do this filtering.

I am open to other options, of course, or if people prefer the material or the column strategy listed here, that would be great to know!

@bam241
Copy link
Member

bam241 commented Sep 26, 2017

this seems reasonable to me.

I would also like to suggest to add an option where we don't track agents and we only differentiate by prototype.
I don't know for you but in most of my simulation, I don't need to track individual agent by only prototypes.
this could reduce a lot the size of the database, and make it cheaper to analyze the output.

even in prediction the past I don't think you need a agent tracking.
(for me, it actually should the default, but I am probably bias by my own applications)

@scopatz
Copy link
Member

scopatz commented Sep 26, 2017

I would also like to suggest to add an option where we don't track agents and we only differentiate by prototype.

You wouldn't be able to restart, but this is probably OK. In general, this is a good suggestion :) Which table do you propose be dropped or altered?

(for me, it actually should the default, but I am probably bias by my own applications)

I think the rule of thumb for the defaults should be that we store everything because we don't know what will be useful. Then if you know what you are doing (or are prepared for the consequences _), you can start turning things off ;)

@bam241
Copy link
Member

bam241 commented Sep 26, 2017

the tables that would need to be altered would be:

  • Resources
  • Materials
  • Products
  • Transactions
  • ResCreators
  • Inventories

the idea would be to replace any AgentId/SenderId/ReceiverId/ParentX into:
Prototype name.
this will imply to recompute the recipes and the compositions of the different materials.

We can also imagine the possibility to have a cohabitation between the Agent's tables and the Prototype's table....

I can submit a PR.

@gonuke
Copy link
Member

gonuke commented Feb 7, 2024

@jbae11 - is this still relevant? If so, can you update the CEP number to CEP29. I think we accidentally took CEP27 so this will need a new number and CEP28 is already in the queue.

We'll also a need a tutorial on what it's about and where the discussion stands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants