-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge GenDB and SchemaHelper; use GenDB in pclean binary #212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM. (If we're using pclean.cc to launch GenDB, we should probably move it up a directory and rename it, not necessarily in this PR).
|
||
// Run inference | ||
std::cout << "Running inference ...\n"; | ||
inference_hirm(&prng, &hirm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we still run HIRM with a PClean-like schema, i.e. Model 6? I think that's something we should try to keep (maybe with a bool flag for HIRM vs full GenDB, and calling incorporate_observations, inference_hirm etc on gendb.hirm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So not only can we do this, it is currently the only thing we can do -- all inference.cc::inference_gendb does right now is directly call inference_hirm.
But I agree that as we add the ability to inference_gendb to transition entity assignments, we should add flags to keep Model 5 & 6 behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Could you add those, as Model 7 inference comes together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Tests pass now. |
My preference, by the way, would be to move gendb.* down into the pclean directory rather than moving pclean.cc up a directory. |
cxx/gendb.cc
Outdated
@@ -60,6 +60,9 @@ void GenDB::incorporate( | |||
// Incorporate the items/value into the query relation. | |||
incorporate_query_relation(prng, query_rel, items, val); | |||
} | |||
|
|||
// Add to the record_class's CRP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have a CRP for the Record class. What is the problem, and is there another way to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the specific problem was that I assumed that there would be a CRP for the record class, and I wrote a test under that assumption, and then I added these lines to make that test pass.
But I also use the record class's CRP in pclean_lib.cc::make_pclean_sample to create a class_item for each new row. I'm open to suggestions for better ways to implement that, but I believe that sampling from the record class's CRP is what the model spec says to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For make_pclean_sample, I think we should just use a counter for class_item, or have the function take a vector of unique row IDs. I don't think the spec says to sample from the record class's CRP -- there's a one-to-one correspondence between observations and record class entities, and sampling from a record class CRP would result in multiple observations of a single record entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. When I was talking about the spec, I was talking about the generative model described on page 7 under "PClean entity model".
Since GenDB is a combination of HIRM and PClean, maybe we should do something like cxx/ In any case I think we can do better than the current structure, I'll file an issue to figure it out. |
Two of the integration tests are failing with crashes during inference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bool only_final_emissions; | ||
bool record_class_is_clean; | ||
std::map<std::string, std::vector<std::string>> domains; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment for "domains"?
DataFrame df; | ||
for (int i = 0; i < num_samples; i++) { | ||
std::map<std::string, std::string> query_values; | ||
WIP_make_pclean_sample(hirm, schema, annotated_domains_for_relations, | ||
prng, &query_values); | ||
make_pclean_sample(prng, gendb, start_row + i, &query_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an assertion that start_row isn't already in the record class, or add a comment that explains the assumption (that entity IDs greater than or equal to start_row aren't already in the record class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
Integration tests now pass, post merge with #215 and using new_rows_have_unique_entities=true in pclean_lib::incorporate_observations. |
This pull request is not 100% ready; there are still a few failing tests in pclean_lib_test and gendb_test.
But I thought I would get your thoughts now to make sure we are on the same page about what this is supposed to do.
Will eventually fix #207 and #206