Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load and incorporate observations in pclean main #120

Merged
merged 17 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cxx/pclean/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,14 @@ cc_binary(
name = "pclean",
srcs = ["pclean.cc"],
deps = [
":csv",
":io",
":schema",
":schema_helper",
"//:cxxopts",
"//:hirm_lib",
"//:inference",
"//:util_io",
],
)

Expand Down
57 changes: 55 additions & 2 deletions cxx/pclean/pclean.cc
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,49 @@
#include <random>

#include "cxxopts.hpp"
#include "irm.hh"
#include "hirm.hh"
#include "inference.hh"
#include "util_io.hh"
#include "pclean/csv.hh"
#include "pclean/io.hh"
#include "pclean/schema.hh"
#include "pclean/schema_helper.hh"

T_observations translate_observations(
ThomasColthurst marked this conversation as resolved.
Show resolved Hide resolved
const DataFrame& df, const T_schema &schema) {
T_observations obs;

for (const auto& col : df.data) {
const std::string& col_name = col.first;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally haven't been using const on local variables I don't have a strong opinion either way, but I think we should be consistent in using it everywhere/nowhere/some places according to some rules we agree on (and up to now we've defaulted to "nowhere"). WDYT?

The Google style guide says "Using const on local variables is neither encouraged nor discouraged."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care much about this, but I do see some other uses of this in the code base, like on lines 267, 306 and 423 of util_io.cc.

And we use const all the time on for-loop variables, and those are local variables too!

I guess my overall opinion is that this might be an area where some inconsistency is fine.

Copy link
Collaborator

@emilyfertig emilyfertig Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency is fine by me too, I guess I'd just prefer that it feel less arbitrary. Are there any loose guidelines you'd propose?

Edit: I meant "random" not "arbitrary," some arbitrariness is probably inevitable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my suggestion would be to lean towards using const when you can and when it would help.

So for example, if a local variable is only alive for a few lines and it is clear what is being done to it, the const doesn't help very much. But for a medium sized or longer function, annotating something as const can give the reader some valuable info about what is going on.

I guess the other guideline I would suggest is that consistency at the function level is more important than any sort of global consistency. That is: if my function has three variables and I mark two of them as const, then there is a slight presumption that the third non-const variable gets mutated somewhere. So don't do that unless it is.

const T_relation& trel = schema.at(col_name);
size_t num_domains;
ThomasColthurst marked this conversation as resolved.
Show resolved Hide resolved
std::visit([&](const auto &r) {
num_domains = r.domains.size();
}, trel);

for (size_t i = 0; i < col.second.size(); ++i) {
const std::string& val = col.second[i];
if (val.empty()) {
// Don't incorporate missing values.
ThomasColthurst marked this conversation as resolved.
Show resolved Hide resolved
// TODO(thomaswc): Allow the user to specify other values that mean
// missing data. ("missing", "NA", "nan", etc.).
continue;
}

std::vector<std::string> entities;
for (size_t j = 0; j < num_domains; ++j) {
// Assume that each row of the dataframe is its own entity, *and*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this block of code -- it looks like we're assuming that each entity in each domain has the same (sequential) value for each observation, so the table looks like:

D1 D2 D3 val
 0  0  0   x
 1  1  1   y
 2  2  2   z
...

is that right? For Model 5, don't we want to read these in from the observations?

Also, related to the comment, I think we do want to assume that each row of the observation dataframe is its own entity (and is indexed by a primary key domain, like #138 describes), but we don't necessarily want to assume that all ancestor entities are distinct from those of any other entity (adding the index domain should let us avoid that).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that in Model 5 we need to assume which entities exist and what their relationships are. I guess we could create a file with that information, but given that we never know that information "in real life", I think it is better to generate it programatically. When we get to model 7, we will need to generate the entity assignments programatically anyway (and then allow them to make transitions).

Just to make sure I'm being clear, let me give a concrete example: the assets/rents_dirty.csv input file, which looks like
Column1,Room Type,Monthly Rent,County,State
0,studio,486.0,Mahoning County,OH
1,4br,2152.0,Clark County,NV
2,1br,1267.0,Gwinnett County,GA
...

There are two classes in the model: Obs and County. I agree with you that it is reasonable to assume that each row corresponds to its own Obs entity. (For now! For Model 7, we will want the ability to sometimes say that the model thinks that two rows are duplicates of the same underlying entity; that's one of the cleaning operations we want a PClean-type program to be able to do.) But we aren't given the County entity assignment for each row, and it's not trivial to guess it either, as sometimes either or both of the county name and state fields are missing. Given that, I think that it is reasonable to initialize the County entity assignment to be unique for each row (which is what my code currently does).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for C*/the "Record" class, we always want to assume that each row is a separate entity and is in one-to-one correspondence with rows of the observed table. A clearer example might be the schema you've defined in other tests, which has as C*

class Record
  physian ~ Physician
  practice ~ Practice

The entities of Record are in one-to-one correspondence with the observations, but the physicians and practices they refer to are duplicated.

For testing Model 5, I think (though I'm not sure) that we want to assume we know the entities and their relationships "in real life." This seems important to be able to test that entity clustering is as we expect (we could also hand-build HIRM-like schemas/observations that replicate PClean-style databases, but it would be nice if we could just define PClean-like ones directly). We should probably clarify this with the MIT folks -- I'll post a slack message.

For sampling, eventually we'll want to sample from a County CRP instead of assuming they're unique, using something like the HIRM sampling method introduced in the 080124-emilyaf-sample-hirm branch (WIP).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted about it on slack.

My preference would be to check this code in as-is (or possibly with a TODO or issue to change it depending on the result of the slack conversation). What do you think?

// that all of its ancestor entities are distinct from those of any
// other entity.
entities.push_back(std::to_string(i));
}
obs[col_name].push_back(std::make_tuple(entities, val));
}
}
return obs;
}

int main(int argc, char** argv) {
cxxopts::Options options(
"pclean", "Run HIRM from a PClean schema");
Expand Down Expand Up @@ -48,25 +85,41 @@ int main(int argc, char** argv) {
// Read schema
PCleanSchema pclean_schema;
std::string schema_fn = result["schema"].as<std::string>();
std::cout << "Reading schema file from " << schema_fn << "\n";
if (!read_schema_file(schema_fn, &pclean_schema)) {
std::cout << "Error reading schema file" << schema_fn << "\n";
}

// Translate schema
std::cout << "Translating schema ...\n";
PCleanSchemaHelper schema_helper(pclean_schema);
T_schema hirm_schema = schema_helper.make_hirm_schema();

// Read observations
std::string obs_fn = result["obs"].as<std::string>();
// TODO(thomaswc): This
std::cout << "Reading observations file from " << obs_fn << "\n";
DataFrame df = DataFrame::from_csv(obs_fn);

// Validate that we have a relation for each observation column.
for (const auto &col : df.data) {
if (!hirm_schema.contains(col.first)) {
printf("Error, could not find HIRM relation for column %s\n",
col.first.c_str());
assert(false);
}
}

// Create model
HIRM hirm(hirm_schema, &prng);

// Incorporate observations.
// TODO(thomaswc): This
std::cout << "Incorporating observations ...\n";
T_observations observations = translate_observations(df, hirm_schema);
T_encoding encoding = encode_observations(hirm_schema, observations);
incorporate_observations(&prng, &hirm, encoding, observations);

// Run inference
std::cout << "Running inference ...\n";
inference_hirm(&prng, &hirm,
result["iters"].as<int>(),
result["timeout"].as<int>(),
Expand Down