Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

Merged
merged 13 commits into from
Oct 2, 2024
3 changes: 3 additions & 0 deletions cxx/gendb.cc
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ void GenDB::incorporate(
// Incorporate the items/value into the query relation.
incorporate_query_relation(prng, query_rel, items, val);
}

// Add to the record_class's CRP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a CRP for the Record class. What is the problem, and is there another way to fix it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the specific problem was that I assumed that there would be a CRP for the record class, and I wrote a test under that assumption, and then I added these lines to make that test pass.

But I also use the record class's CRP in pclean_lib.cc::make_pclean_sample to create a class_item for each new row. I'm open to suggestions for better ways to implement that, but I believe that sampling from the record class's CRP is what the model spec says to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For make_pclean_sample, I think we should just use a counter for class_item, or have the function take a vector of unique row IDs. I don't think the spec says to sample from the record class's CRP -- there's a one-to-one correspondence between observations and record class entities, and sampling from a record class CRP would result in multiple observations of a single record entity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. When I was talking about the spec, I was talking about the generative model described on page 7 under "PClean entity model".

domain_crps[schema.query.record_class].incorporate(id, id);
}

// This function walks the class_path of the query, populates the global
Expand Down
40 changes: 35 additions & 5 deletions cxx/gendb_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,17 @@ struct SchemaTestFixture {
SchemaTestFixture() {
std::stringstream ss(R"""(
class School
name ~ string
name ~ string(maxlength=60)
degree_dist ~ categorical(k=100)

class Physician
school ~ School
degree ~ stringcat(strings="MD PT NP DO PHD")
specialty ~ stringcat(strings="Family Med:Internal Med:Physical Therapy", delim=":")

class City
name ~ string
state ~ stringcat(strings="AL AK AZ AR CA CO CT DE DC FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY")

class Practice
city ~ City
Expand All @@ -32,9 +35,11 @@ class Record
location ~ Practice

observe
physician.specialty as Specialty
physician.school.name as School
physician.degree as Degree
location.city.name as City
location.city.name as State
from Record
)""");
[[maybe_unused]] bool ok = read_schema(ss, &schema);
Expand All @@ -48,15 +53,30 @@ observe

void setup_gendb(std::mt19937* prng, GenDB& gendb) {
std::map<std::string, ObservationVariant> obs0 = {
{"Specialty", "Family Med"},
{"School", "Massachusetts Institute of Technology"},
{"Degree", "PHD"},
{"City", "Cambrij"}};
{"City", "Cambrij"},
{"State", "WA"}
};
std::map<std::string, ObservationVariant> obs1 = {
{"School", "MIT"}, {"Degree", "MD"}, {"City", "Cambridge"}};
{"Specialty", "Internal Med"},
{"School", "MIT"},
{"Degree", "MD"},
{"City", "Cambridge"},
{"State", "MA"}};
std::map<std::string, ObservationVariant> obs2 = {
{"School", "Tufts"}, {"Degree", "PT"}, {"City", "Boston"}};
{"Specialty", "Physical Therapy"},
{"School", "Tufts"},
{"Degree", "PT"},
{"City", "Boston"},
{"State", "MA"}};
std::map<std::string, ObservationVariant> obs3 = {
{"School", "Boston University"}, {"Degree", "PhD"}, {"City", "Boston"}};
{"Specialty", "Internal Med"},
{"School", "Boston University"},
{"Degree", "PhD"},
{"City", "Boston"},
{"State", "MA"}};

int i = 0;
while (i < 30) {
Expand Down Expand Up @@ -370,6 +390,11 @@ class City
class Person
birth_city ~ City
home_city ~ City

observe
birth_city.name as BirthCity
home_city.name as HomeCity
from Person
)""");
PCleanSchema schema;
[[maybe_unused]] bool ok = read_schema(ss, &schema);
Expand Down Expand Up @@ -403,6 +428,11 @@ class Practice
class Physician
practice ~ Practice
school ~ School

observe
practice.location.name as PracticeCity
school.location.name as SchoolCity
from Physician
)""");
PCleanSchema schema;
[[maybe_unused]] bool ok = read_schema(ss, &schema);
Expand Down
1 change: 1 addition & 0 deletions cxx/pclean/pclean_lib_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ Pediatrics,Harvard,MD,Seattle,WA

incorporate_observations(&prng, &gendb, df);
BOOST_TEST(gendb.domain_crps["Record"].N == 5);
BOOST_TEST(gendb.domain_crps["Practice"].N == 5);
}

BOOST_AUTO_TEST_CASE(test_make_pclean_samples) {
Expand Down
Loading