-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cep27 #259
base: source
Are you sure you want to change the base?
Cep27 #259
Conversation
Updates to CEP27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably useful to explicitly mention the need for a major version bump due to this.
source/cep/cep27.rst
Outdated
The **ExplicitInventory** table and **ExplicitInventoryCompact** | ||
table should be merged to a single table, called **Inventories**, | ||
with the following columns: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm unfamiliar with the previous tables, can you clarify what is changing here by showing the old table layout? (as you did with the others)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I completely understand the question, but the old table layout is shown so that the reader can clearly understand what is being modified / removed with the proposed change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Explicitinventory table corresponds to a table with all the different inventory in all the different facilities:
SimId -- AgentId -- Time -- InventoryName -- NucId -- Quantity
where the ExplicitInventoryCompact looks more like the new "Inventory"
SimId -- AgentId -- Time -- InventoryName -- Quantity -- Composition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bam241 answered my question - the old table layout was NOT shown for this case and perhaps should be...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
============ ========== | ||
SimId uuid | ||
Recipes string | ||
Composition map<int,double> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous layout anticipated the desire to select on nuclide in the query, and hence a different column for each NucId. Perhaps this has not emerged in the wild, but it seems that a consequence of this change would make this no longer possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, that is a valid point.
Maybe it's necessary for us to define what
For example, if one wants the timeseries mass of Pu239,
the query would be like the following:
SELECT sum(massfrac*quantity) FROM resources
INNER JOIN compositions
ON resources.qualid = compositions.qualid
WHERE nucid=942390000
GROUP BY timecreated
in the newer database structure, it would be:
SELECT quantity, timecreated, composition FROM materials
followed by a script that processes the result:
Get query results
loop through every row of query results
for every row,
look in composition for nucid 94239, get its massfrac value
multiply that by quantity
add that to the list of pu239 value timeseries list
So I do assume that it would take a longer time to accomplish
what you mentioned ( and also needs additional scripting outside of the sqlite query)...
You probably know much more than me, but @scopatz and my initial thought was that
this would have more benefit than loss. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will not be optimized for a large calculation (1000-10000 facilities), if you want to see the plutonium inventory in the fleet, you will need to load all the composition, get the informations you need and then re-generate a table.
I would prefer a system that allow us to filter using facility's name and nucid, but I am not sure it is possible without having a gigantic table :(
There is a table that is natively in CLASS output file, that might be to consider here: I know that will create redondant informations, but it will also allows analyst to access the inventory of all the something like: In the general case, this will not be a problem because one usually have few different archetype specialization (I usually have 5-20 different facilities in my calculations)... This light InventoryTable could provide a quick access to the informations that are need 80% of the time -- at least for me (I generally never access specific agent informations) I know and understand why we want to avoid having informations in 2 places, or to avoid to store informations that we can retrieve with some computations. But I also believe that the user analysis experience needs to be fast for the simple query (like show me the Pu inventory in my calculation) and I am not sure we can achieve that on a typical US-fleet simulation |
@bam241 if I understood correctly, would there be multiple rows of nucid& quantity for the same inventoryname&time?
|
Yes, but I am not sure everybody will agree on this one, but having the full inventory could allow quick analysis, and restricting it to Prototypes could limits its size. maybe also have the capability to disable the table from the input file could be nice, (or maybe making it optional) |
yeah I was thinking of a case where there are ~30 prototypes (predicting the past), and each of them would get a table. So having it optional would be necessary, I think. Do you want to maybe make some edits to the CEP to pitch that in? (If that's allowed, I'm not sure) |
I'll wait a bit to see how other react to it. If they agree I'll PR an edit. 30 prototypes is not that much, I have actually about 3000 agent on my FCO simulations, if you track less than 100 nuclides your table will be smaller than the actual ExplicitInventory... |
Hello All, I'll discuss the filtering by nuclide question here. There are a few alternatives to what is proposed here.
The current situationAdvantages: The storage is still sparse, in the sense that it doesn't store nuclides that don't exist in the material. This method also does allow direct filtering. Disadvantages: Currently, we store the SimId (uuid, 20 bytes) and the QualId (int, 4 bytes) for every NucId (int, 4 bytes) and MassFrac (double, 8 bytes) that we store. Thus, the overhead from the SimId & QualId is 2/3rds of the total storage size. So in addition to issue of having to roll through a huge number of rows to find the material of interest, the storage is also inefficient. Material DatatypeWe have a fairly sophisticated type system that allows us to store custom classes as unique types. Nothing stops us from having cyclus::Material, cyclus::Product, Resource Buffers, etc from being first class members of the type system. Advantages: In fact, they already are! These types were added to aid in the C++/Python inter-language communication. Using Material as the type instead of map<int, double> would allow the backends to determine how they wish to store the data in these classes. This could be done in such a way that enables direct filtering by SQLite, etc. Disadvantages: Someone would actually have to implement these datatypes (or at least Material) for the backends. I could work on the HDF5 one, but that still leaves SQLite. Additionally, I suspect that if the SQLite backend was implemented in such a way as to allow direct filtering with SQL queries that it will end up having similar data overheads as in the current situation option. A column for every nuclideInstead of having a long table, we could have a wider table where every nuclide (H1, H2, ... U235, U236...) gets its own column. Advantages: This has relatively low SimId & QualId overhead compared to the current situation. It allows direct querying of individual nuclides by picking out the column of interest. Disadvantages: This is not sparse; nuclides that are not present in the Material get a concrete 0.0 value stored in the database. Additionally, this may necessitate a global tracked nuclide set, whereby nuclides in not in this set will not be valid in Cyclus. This is a little antithetical to Cyclus' notion of being robust to whatever an archetype developer wants to supply. Also, for a sufficiently comprehensive set of tracked nuclides (1000 - 4000), there could be technical issues in the backends. SQLite might have a column limit. HDF5 has 64 kb of header data per table, which I have exceeded with the strategy and 1500 nuclide columns (I think) . proposalSo in summary, the proposed version went in as it did because:
Yes, it does this at the cost of not being able to directly filter based on NucId. But all things considered, I think this is a worthy trade off. Doing a secondary filter after the material has been pulled out isn't hard, as @gonuke said this feature isn't used much, and it is expensive when it is used. We can think about other mitigation strategies, such as convenience functions for material filtering or putting in (cy)metrics that do this filtering. I am open to other options, of course, or if people prefer the material or the column strategy listed here, that would be great to know! |
this seems reasonable to me. I would also like to suggest to add an option where we don't track agents and we only differentiate by prototype. even in prediction the past I don't think you need a agent tracking. |
You wouldn't be able to restart, but this is probably OK. In general, this is a good suggestion :) Which table do you propose be dropped or altered?
I think the rule of thumb for the defaults should be that we store everything because we don't know what will be useful. Then if you know what you are doing (or are prepared for the consequences |
the tables that would need to be altered would be:
the idea would be to replace any We can also imagine the possibility to have a cohabitation between the Agent's tables and the Prototype's table.... I can submit a PR. |
@jbae11 - is this still relevant? If so, can you update the CEP number to CEP29. I think we accidentally took CEP27 so this will need a new number and CEP28 is already in the queue. We'll also a need a tutorial on what it's about and where the discussion stands. |
Abstract
@scopatz
This CEP proposes to restructure the |cyclus| output database structure in order to
reduce the number of tables and redundancy of data, and ultimately reduce the number
of
joins
required for data analysis. Doing so would reduce the computing timefor end-user analysis, and allow for a clearer, more concise output database.