Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load and incorporate observations in pclean main #120

Merged
merged 17 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions cxx/pclean/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,35 @@ cc_binary(
name = "pclean",
srcs = ["pclean.cc"],
deps = [
":csv",
":io",
":pclean_lib",
":schema",
":schema_helper",
"//:cxxopts",
"//:hirm_lib",
"//:inference",
"//:util_io",
],
)

cc_library(
name = "pclean_lib",
hdrs = ["pclean_lib.hh"],
srcs = ["pclean_lib.cc"],
deps = [
":csv",
"//:hirm_lib",
"//:util_io",
],
)

cc_test(
name = "pclean_lib_test",
srcs = ["pclean_lib_test.cc"],
deps = [
":pclean_lib",
"@boost//:test",
],
)

Expand Down
11 changes: 10 additions & 1 deletion cxx/pclean/csv.cc
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,16 @@ DataFrame DataFrame::from_csv(
df.data[col_names[i++]].push_back(part);
}
if (!first_line) {
assert(i == col_names.size());
if (i != col_names.size()) {
if (line.back() == ',') {
// std::getline is broken and won't let the last field be empty.
df.data[col_names[i++]].push_back("");
} else {
printf("Only found %ld out of %ld expected columns in line\n%s\n",
i, col_names.size(), line.c_str());
assert(false);
}
}
}
first_line = false;
}
Expand Down
23 changes: 21 additions & 2 deletions cxx/pclean/pclean.cc
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,13 @@
#include <random>

#include "cxxopts.hpp"
#include "irm.hh"
#include "hirm.hh"
#include "inference.hh"
#include "util_io.hh"
#include "pclean/csv.hh"
#include "pclean/io.hh"
#include "pclean/pclean_lib.hh"
#include "pclean/schema.hh"
#include "pclean/schema_helper.hh"

Expand Down Expand Up @@ -49,6 +53,7 @@ int main(int argc, char** argv) {
std::cout << "Reading plcean schema ...\n";
PCleanSchema pclean_schema;
std::string schema_fn = result["schema"].as<std::string>();
std::cout << "Reading schema file from " << schema_fn << "\n";
if (!read_schema_file(schema_fn, &pclean_schema)) {
std::cout << "Error reading schema file" << schema_fn << "\n";
}
Expand All @@ -62,16 +67,30 @@ int main(int argc, char** argv) {
// Read observations
std::cout << "Reading observations ...\n";
std::string obs_fn = result["obs"].as<std::string>();
// TODO(thomaswc): This
std::cout << "Reading observations file from " << obs_fn << "\n";
DataFrame df = DataFrame::from_csv(obs_fn);

// Validate that we have a relation for each observation column.
for (const auto &col : df.data) {
if (!hirm_schema.contains(col.first)) {
printf("Error, could not find HIRM relation for column %s\n",
col.first.c_str());
assert(false);
}
}

// Create model
std::cout << "Creating hirm ...\n";
HIRM hirm(hirm_schema, &prng);

// Incorporate observations.
// TODO(thomaswc): This
std::cout << "Incorporating observations ...\n";
T_observations observations = translate_observations(df, hirm_schema);
T_encoding encoding = encode_observations(hirm_schema, observations);
incorporate_observations(&prng, &hirm, encoding, observations);

// Run inference
std::cout << "Running inference ...\n";
inference_hirm(&prng, &hirm,
result["iters"].as<int>(),
result["timeout"].as<int>(),
Expand Down
43 changes: 43 additions & 0 deletions cxx/pclean/pclean_lib.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// Copyright 2024
// Apache License, Version 2.0, refer to LICENSE.txt

#include "irm.hh"
#include "pclean/csv.hh"
#include "pclean/pclean_lib.hh"

T_observations translate_observations(
const DataFrame& df, const T_schema &schema) {
T_observations obs;

for (const auto& col : df.data) {
const std::string& col_name = col.first;
if (!schema.contains(col_name)) {
printf("Schema does not contain %s, skipping ...\n", col_name.c_str());
continue;
}

const T_relation& trel = schema.at(col_name);
size_t num_domains = std::visit([&](const auto &r) {
return r.domains.size();}, trel);

for (size_t i = 0; i < col.second.size(); ++i) {
const std::string& val = col.second[i];
if (val.empty()) {
// Don't incorporate missing values.
// TODO(thomaswc): Allow the user to specify other values that mean
// missing data. ("missing", "NA", "nan", etc.).
continue;
}

std::vector<std::string> entities;
for (size_t j = 0; j < num_domains; ++j) {
// Give every row it's own universe of unique id's.
Copy link
Collaborator

@emilyfertig emilyfertig Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about this more, I don't think we should initialize unobserved "observations" here. Either we should sample them from the generative model once we have an HIRM instance (which is what we'll do for Model 7 anyway), or for the purposes of Model 5/6 we should read them in from a file (depending on whether we decide that's necessary with the MIT folks -- sorry I haven't had a chance to post on Slack yet).

For the purposes of this PR, I think we can just omit unobserved values from the observations (i.e. assume only relations with is_observed == true have observations).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using the word "unobserved" in a different way than I would use it.

For me, the schema creates the relations, and then the data comes in, and after all the data is in, we can mark a relation as observed or unobserved depending on whether it had any observations.

If I had to guess, you are maybe using "unobserved" to mean something "a relation for which we will need to initialize and maintain a latent state"?

But just to be clear, every observation here is for a noisy relation generated in the second part of PCleanSchemaHelper::make_hirm_schema -- those are the only ones that correspond to CSV column names. None of those noisy relations will need to have a hidden latent state. Even if somehow all of their data turned out to be missing from the CSV file, there are no other relations that depend on them.

Copy link
Collaborator

@emilyfertig emilyfertig Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the schema creates the relations, and then the data comes in, and after all the data is in, we can mark a relation as observed or unobserved depending on whether it had any observations.

Ok, I think I see. The observed relations should be exactly those defined by the observe statement, so we can declare a relation to be "observed" if it appears in that statement, right? I wouldn't expect there to be data for relations not defined by observe, and if a relation defined by observe doesn't have any observations in the data, I think that's a user error. Do you agree?

If I had to guess, you are maybe using "unobserved" to mean something "a relation for which we will need to initialize and maintain a latent state"?

Assuming the observed relations are those defined by the observe statement (and those which have observed data) I think these are equivalent -- everything that isn't observed (which should be the base relations of noisy relations), we have to infer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the schema creates the relations, and then the data comes in, and after all the data is in, we can mark a relation as observed or unobserved depending on whether it had any observations.

Ok, I think I see. The observed relations should be exactly those defined by the observe statement, so we can declare a relation to be "observed" if it appears in that statement, right?

Well, again, I think the best/safest/most intuitive thing to do is to call a relation observed iff we see data for it. So in line with that, I would mark relations as is_observed inside the incorporate_observations function.

But since we don't do that currently (is_observed is marked in load_observations), I've done the next best thing and declared the relations created from the "observe" clause in the schema as is_observed in this pull request.

I wouldn't expect there to be data for relations not defined by observe, and if a relation defined by observe doesn't have any observations in the data, I think that's a user error. Do you agree?

Not quite. If I have a schema and a csv file, and then I drop a column from the csv file, I wouldn't call it user error to run pclean on the combination, even though there would be a observe relation without any observations. I think it would be quite easy to support that use case, so we probably should (while perhaps agreeing that supporting it isn't the most important priority for a research codebase).

If I had to guess, you are maybe using "unobserved" to mean something "a relation for which we will need to initialize and maintain a latent state"?

Assuming the observed relations are those defined by the observe statement (and those which have observed data) I think these are equivalent -- everything that isn't observed (which should be the base relations of noisy relations), we have to infer.

But just to be clear, every observation here is for a noisy relation generated in the second part of PCleanSchemaHelper::make_hirm_schema -- those are the only ones that correspond to CSV column names. None of those noisy relations will need to have a hidden latent state. Even if somehow all of their data turned out to be missing from the CSV file, there are no other relations that depend on them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that this conversation now has almost nothing to do with this pull request, and should probably be moved to a different channel (like a meeting or github issue).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just to be clear, every observation here is for a noisy relation generated in the second part of PCleanSchemaHelper::make_hirm_schema -- those are the only ones that correspond to CSV column names. None of those noisy relations will need to have a hidden latent state. Even if somehow all of their data turned out to be missing from the CSV file, there are no other relations that depend on them.

Could you add a TODO to sample the non-index domains from a CRP prior instead of assuming each entry is a unique entity? (We might want to rethink this more substantially for Model 7, i.e. whether it makes sense to initialize the entities as part of the data ingestion process or elsewhere. Other inferred values, e.g. latent values of a noisy relation, are initialized during incorporate, which might make more sense).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO to discuss and consider other options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that this conversation now has almost nothing to do with this pull request, and should probably be moved to a different channel (like a meeting or github issue).

For me, this conversation was very relevant, and essential in allowing me to verify that this PR does the right thing, so I appreciate you taking the time to clarify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since we don't do that currently (is_observed is marked in load_observations), I've done the next best thing and declared the relations created from the "observe" clause in the schema as is_observed in this pull request.

I was going to file a Github issue, but I think we should leave this as-is. Downstream in HIRM !is_observed implies the value is latent and needs to be inferred, so setting is_observed according to the "observe" clause gives the right semantics.

I wouldn't expect there to be data for relations not defined by observe, and if a relation defined by observe doesn't have any observations in the data, I think that's a user error. Do you agree?

Not quite. If I have a schema and a csv file, and then I drop a column from the csv file, I wouldn't call it user error to run pclean on the combination, even though there would be a observe relation without any observations. I think it would be quite easy to support that use case, so we probably should (while perhaps agreeing that supporting it isn't the most important priority for a research codebase).

That's fine, I was thinking it made sense to support a more limited notion of valid inputs before potentially supporting something more general, but I agree it doesn't matter much here.

// TODO(thomaswc): Correctly handle the case when a row makes
// references to two or more different entities of the same type.
entities.push_back(std::to_string(i));
}
obs[col_name].push_back(std::make_tuple(entities, val));
}
}
return obs;
}
16 changes: 16 additions & 0 deletions cxx/pclean/pclean_lib.hh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// Copyright 2024
// Apache License, Version 2.0, refer to LICENSE.txt

#pragma once

#include "irm.hh"
#include "util_io.hh"
#include "pclean/csv.hh"
#include "pclean/pclean_lib.hh"

// For each non-missing value in the DataFrame df, create an
// observation in the returned T_observations. The column name of the value
// is used as the relation name, and each entity in each domain is given
// its own unique value.
T_observations translate_observations(
const DataFrame& df, const T_schema &schema);
57 changes: 57 additions & 0 deletions cxx/pclean/pclean_lib_test.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#define BOOST_TEST_MODULE test pclean_csv

#include "pclean/pclean_lib.hh"
#include <sstream>
#include <boost/test/included/unit_test.hpp>
namespace tt = boost::test_tools;

BOOST_AUTO_TEST_CASE(test_translate_observations) {
std::stringstream ss(R"""(Column1,Room Type,Monthly Rent,County,State
0,studio,,Mahoning County,OH
1,4br,2152.0,,NV
2,1br,1267.0,Gwinnett County,
)""");

DataFrame df = DataFrame::from_csv(ss);

std::map<std::string, std::string> state_params = {{"strings", "AL AK AZ AR CA CO CT DE DC FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY"}};
std::map<std::string, std::string> br_params = {{"strings", "1br 2br 3br 4br studio"}};

T_schema schema = {
{"County:name",
T_clean_relation{{"County"}, false, DistributionSpec("bigram")}},
{"County:state",
T_clean_relation{{"County"}, false, DistributionSpec("stringcat", state_params)}},
{"Room Type",
T_clean_relation{{"Obs"}, false, DistributionSpec("stringcat", br_params)}},
{"Monthly Rent",
T_clean_relation{{"Obs"}, false, DistributionSpec("normal")}},
{"County",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, could you give the "County" and "State" relations names that disambiguate them from the domains? These names would be output from the schema converter as "Obs:county:name" and "Obs:state:name", right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've prepended a "d" to all of the domain names for clarity.

The schema converter would give these relation names of "County" and "State", actually -- for noisy relations coming from an "observe x as y" line, the relation name is always taken as the "y", so that it can match the csv column name.

T_noisy_relation{{"County", "Obs"}, false, EmissionSpec("bigram"), "County:name"}},
{"State",
T_noisy_relation{{"County", "Obs"}, false, EmissionSpec("bigram"), "County:state"}}};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some of the T_noisy_relations, is_observed should be true, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

translate_observations doesn't use that field of the schema at all, so I don't think it matters at all for the purposes of this test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it matters a little for self-documentation, so I'd prefer to set it to true just for readability/consistency. In the HIRM data-ingesting code, these are set to true after observations of this relation are encountered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@emilyfertig emilyfertig Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks (sorry for the back and forth, there's a lot going on in this model that's subtle but I think we're getting towards a shared understanding).

It looks to me like the HIRM schema in this test is implementing the following PClean-like schema, is that right?

class County
  name ~ bigram
  state ~ stringcat

Obs
  county ~ County
  rent ~ normal
  room ~ stringcat

Is this what schema_helper.make_hirm_schema() would output? If so, it doesn't quite conform to how I understand Model 5 (and I missed this in earlier code reviews). I would expect the HIRM schema from the above PClean schema to look like this:

  T_schema schema = {
    {"Obs:county:name",
      T_clean_relation{{"dCounty"}, false, DistributionSpec("bigram")}},
    {"Obs:county:state",
      T_clean_relation{{"dCounty"}, false, DistributionSpec("stringcat", state_params)}},
    {"Obs:room type",
      T_clean_relation{{"dObs"}, false, DistributionSpec("stringcat", br_params)}},
    {"Obs:monthly rent",
      T_clean_relation{{"dObs"}, false, DistributionSpec("normal")}},
    {"County",
      T_noisy_relation{{"dCounty", "dObs"}, true, EmissionSpec("bigram"), "Obs:county:name"}},
    {"State",
      T_noisy_relation{{"dCounty", "dObs"}, true, EmissionSpec("bigram"), "Obs:county:state"}},
    {"Room Type",
      T_noisy_relation{{"dObs"}, true, EmissionSpec(...), "Obs:room type}},
    {"Monthly Rent",
      T_noisy_relation{{"dObs"}, true, EmissionSpec(...), "Obs:monthly rent}},
};

I think this was the source of a lot of my earlier confusion. According to my reading of Model 5, the Obs/Record class should be treated like all of the other latent classes in terms of defining the clean relations. What's special about it is that there's a one-to-one correspondence between its entities and rows of the observed data. The columns of the observed data are all represented by noisy relations, and all of them have "Obs" as an input domain. If you look at the paragraph on page 10 of the Overleaf that begins "Given all this information", I think this is what it implies (in particular, we need an $I_{r_{C.a}}$ for $C = C*$ and the current code omits that).

If you agree, could you make that change in a follow-up PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about this more, I think my read is wrong and you're right, that the observations that come directly from $C*$ are clean. No changes are needed here or to make_hirm_schema(), and I'm much less confused now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say for sure, since the schema you give doesn't parse. I think you might mean

`class County
name ~ string
state ~ stringcat(strings="...")

class Obs
county ~ County
rent ~ real
room ~ stringcat(strings="...)

observe
county.name as County
county.state as State
room as "Room Type"
rent as "Monthly Rent"
from Obs
`

But anyway, the schema used in pclean_lib_test is not quite what make_hirm_schema would output. Sorry if that caused confusion, but I was optimizing for something simple that would exercise translate_observations.

From the above schema, make_hirm_schema would output four clean_relations named "County:name", "County:state", "Obs:rent" and "Obs:room", and four noisy_relations named "County", "State", "Room Type" and "Monthly Rent". Other than the slight name differences, they are basically the same as you have in your expected output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for writing a non-parsable schema, I just wanted to verify the class definitions, and you're right, I was assuming there was an observation for each attribute.

Thanks for clarifying what make_hirm_schema would output. The Model 5 section of the overleaf seems ambiguous as to whether we should have noisy observations for attributes in $C*$ (as make_hirm_schema implements) or whether $C*$ is assumed to contain clean observations (as in the test you wrote). We should ask them to clarify (and fix it in a follow-up if need be, not this PR).


T_observations obs = translate_observations(df, schema);

// Relations not corresponding to columns should be un-observed.
BOOST_TEST(!obs.contains("County:name"));
BOOST_TEST(!obs.contains("County:state"));

BOOST_TEST(obs["Room Type"].size() == 3);
BOOST_TEST(obs["Monthly Rent"].size() == 2);
BOOST_TEST(obs["County"].size() == 2);
BOOST_TEST(obs["State"].size() == 2);

BOOST_TEST(std::get<0>(obs["Room Type"][0]).size() == 1);
BOOST_TEST(std::get<1>(obs["Room Type"][0]) == "studio");

BOOST_TEST(std::get<0>(obs["Monthly Rent"][0]).size() == 1);
BOOST_TEST(std::get<1>(obs["Monthly Rent"][0]) == "2152.0");

BOOST_TEST(std::get<0>(obs["County"][0]).size() == 2);
BOOST_TEST(std::get<1>(obs["County"][0]) == "Mahoning County");

BOOST_TEST(std::get<0>(obs["State"][0]).size() == 2);
BOOST_TEST(std::get<1>(obs["State"][0]) == "OH");
}