Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

ThomasColthurst · 2024-09-25T20:17:40Z

This pull request is not 100% ready; there are still a few failing tests in pclean_lib_test and gendb_test.

But I thought I would get your thoughts now to make sure we are on the same page about what this is supposed to do.

Will eventually fix #207 and #206

emilyfertig

Overall this LGTM. (If we're using pclean.cc to launch GenDB, we should probably move it up a directory and rename it, not necessarily in this PR).

cxx/gendb.hh

emilyfertig · 2024-09-25T21:05:47Z

cxx/pclean/pclean.cc


  // Run inference
  std::cout << "Running inference ...\n";
-  inference_hirm(&prng, &hirm,


Can we still run HIRM with a PClean-like schema, i.e. Model 6? I think that's something we should try to keep (maybe with a bool flag for HIRM vs full GenDB, and calling incorporate_observations, inference_hirm etc on gendb.hirm)

So not only can we do this, it is currently the only thing we can do -- all inference.cc::inference_gendb does right now is directly call inference_hirm.

But I agree that as we add the ability to inference_gendb to transition entity assignments, we should add flags to keep Model 5 & 6 behavior.

Sounds good. Could you add those, as Model 7 inference comes together?

ThomasColthurst · 2024-09-26T15:20:09Z

Tests pass now.

ThomasColthurst · 2024-09-26T15:23:10Z

My preference, by the way, would be to move gendb.* down into the pclean directory rather than moving pclean.cc up a directory.

emilyfertig · 2024-09-26T15:25:59Z

cxx/gendb.cc

@@ -60,6 +60,9 @@ void GenDB::incorporate(
    // Incorporate the items/value into the query relation.
    incorporate_query_relation(prng, query_rel, items, val);
  }
+
+  // Add to the record_class's CRP.


I don't think we should have a CRP for the Record class. What is the problem, and is there another way to fix it?

Well, the specific problem was that I assumed that there would be a CRP for the record class, and I wrote a test under that assumption, and then I added these lines to make that test pass.

But I also use the record class's CRP in pclean_lib.cc::make_pclean_sample to create a class_item for each new row. I'm open to suggestions for better ways to implement that, but I believe that sampling from the record class's CRP is what the model spec says to do.

For make_pclean_sample, I think we should just use a counter for class_item, or have the function take a vector of unique row IDs. I don't think the spec says to sample from the record class's CRP -- there's a one-to-one correspondence between observations and record class entities, and sampling from a record class CRP would result in multiple observations of a single record entity.

Done. When I was talking about the spec, I was talking about the generative model described on page 7 under "PClean entity model".

emilyfertig · 2024-09-26T15:37:16Z

Since GenDB is a combination of HIRM and PClean, maybe we should do something like

cxx/
pclean/
hirm/
relation.hh
domain.hh
...
distributions/
...
gendb.hh
gendb_main.cc <- renamed from pclean.cc

In any case I think we can do better than the current structure, I'll file an issue to figure it out.

ThomasColthurst · 2024-09-26T19:48:20Z

Two of the integration tests are failing with crashes during inference.

emilyfertig

Thanks!

emilyfertig · 2024-09-26T19:52:16Z

cxx/gendb.hh

+  bool only_final_emissions;
+  bool record_class_is_clean;
+  std::map<std::string, std::vector<std::string>> domains;


Could you add a comment for "domains"?

emilyfertig · 2024-09-26T19:58:27Z

cxx/pclean/pclean_lib.cc

  DataFrame df;
  for (int i = 0; i < num_samples; i++) {
     std::map<std::string, std::string> query_values;
-     WIP_make_pclean_sample(hirm, schema, annotated_domains_for_relations,
-                        prng, &query_values);
+     make_pclean_sample(prng, gendb, start_row + i, &query_values);


Maybe add an assertion that start_row isn't already in the record class, or add a comment that explains the assumption (that entity IDs greater than or equal to start_row aren't already in the record class).

Added comment.

ThomasColthurst · 2024-10-02T14:25:14Z

Thanks!

Integration tests now pass, post merge with #215 and using new_rows_have_unique_entities=true in pclean_lib::incorporate_observations.

ThomasColthurst added 3 commits September 24, 2024 21:02

Merge GenDB and SchemaHelper and use GenDB in pclean

8b796e5

Finish initial pass of pclean_lib rewrite

a7cd141

Fix build errors

bd8ff32

ThomasColthurst requested a review from emilyfertig September 25, 2024 20:17

Merge with master

f1019e0

emilyfertig reviewed Sep 25, 2024

View reviewed changes

Fix bugs revealed by tests

842dda6

emilyfertig reviewed Sep 26, 2024

View reviewed changes

ThomasColthurst added 5 commits September 26, 2024 15:54

Add descriptions to compute_domain_cache and other methods

885d252

Generate pclean samples by row number, not from CRP samples

427be6d

Fix make_pclean_sample to create the correct entities

e3245e7

Remove debug printfs

86d7d43

Comment out failing test for now

50e6c24

emilyfertig approved these changes Sep 26, 2024

View reviewed changes

Debugging printfs

31f857b

ThomasColthurst mentioned this pull request Sep 30, 2024

Add sample_new parameter to gendb::incorporate #215

Merged

ThomasColthurst added 2 commits October 1, 2024 14:28

Nothing

1300c6a

Resolve merge

57ef714

ThomasColthurst merged commit fca1b5b into master Oct 2, 2024
1 of 2 checks passed

ThomasColthurst deleted the 240924-thomaswc-merge_gendb branch October 2, 2024 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

ThomasColthurst commented Sep 25, 2024

emilyfertig left a comment

emilyfertig Sep 25, 2024

ThomasColthurst Sep 26, 2024

emilyfertig Sep 26, 2024

ThomasColthurst Sep 26, 2024

ThomasColthurst commented Sep 26, 2024

ThomasColthurst commented Sep 26, 2024

emilyfertig Sep 26, 2024

ThomasColthurst Sep 26, 2024

emilyfertig Sep 26, 2024

ThomasColthurst Sep 26, 2024

emilyfertig commented Sep 26, 2024

ThomasColthurst commented Sep 26, 2024

emilyfertig left a comment

emilyfertig Sep 26, 2024

emilyfertig Sep 26, 2024

ThomasColthurst Oct 2, 2024

ThomasColthurst commented Oct 2, 2024

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

Conversation

ThomasColthurst commented Sep 25, 2024

emilyfertig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasColthurst commented Sep 26, 2024

ThomasColthurst commented Sep 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyfertig commented Sep 26, 2024

ThomasColthurst commented Sep 26, 2024

emilyfertig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasColthurst commented Oct 2, 2024