Population variables lose case distinctions #546

riastradh-probcomp · 2017-05-02T21:02:58Z

This isn't itself quite a bug, but downstream things sometimes expect the case to match from queries, where sqlite3 preserves the original case of the variable in the underlying table, and from the results of bayesdb_variable_names.

We should fix everything downstream to do case-insensitive comparisons (e.g.: probcomp/iventure#38), but until then, we can work around those mistakes by using the original case in the bayesdb_variable table. Won't help with existing bdb files, but will suppress downstream mistakes for any new bdb files.

The text was updated successfully, but these errors were encountered:

Github issue #546

fsaad · 2017-06-02T15:10:37Z

Just hit this issue as a bug proper.

        bdb.execute('''
            CREATE POPULATION satellites_p for satellites_t WITH SCHEMA (
                GUESS STATTYPES FOR (*)
            )
        ''')
        bdb.execute('''
            CREATE METAMODEL satellites_m FOR satellites_p
                    WITH BASELINE crosscat (
                OVERRIDE GENERATIVE MODEL FOR
                    anticipated_lifetime
                GIVEN
                    class_of_orbit,
                    perigee_km,
                    apogee_km,
                    period_minutes,
                    date_of_launch,
                    launch_mass_kg,
                    dry_mass_kg
                USING ordinary_least_squares
            )
        ''')
bayeslite.exception.BQLError: Unmodellable variables:
set(['date_of_launch', 'apogee_km', 'class_of_orbit',
    'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])

And the failure happens here:
https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1378-L1382
where printing needed and modeled gives:

modeled = set([u'Power_watts', u'longitude_radians_of_geo', u'Contractor',
    u'Period_minutes', u'Country_of_Operator', u'Launch_Vehicle', u'Users',
    u'Apogee_km', u'Operator_Owner', 'anticipated_lifetime',
    u'Anticipated_Lifetime', u'Purpose', u'Dry_Mass_kg', u'Launch_Site',
    u'Perigee_km', u'Date_of_Launch', u'Class_of_Orbit', u'Eccentricity',
    u'Inclination_radians', u'Type_of_Orbit', u'Source_Used_for_Orbital_Data',
    u'Launch_Mass_kg', u'Country_of_Contractor'
])

needed = set(['date_of_launch', 'apogee_km', 'class_of_orbit',
    'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])

Also notice duplicate of anticipated_lifetime and Anticipated_lifetime (the regression variable). Not sure why this bug has not manifested before.

fsaad · 2017-06-02T15:17:48Z

Actually, what broke this behavior was @riastradh-probcomp's fix d7c5b3f to this issue, which appears to have caused the regression.

Update: git revert d7c5b3f indeed fixed the issue with Latent variables reported above. Surprised why nothing in the test suite failed after the commit in question (probably because all cases were lower uniformly).

fsaad · 2017-06-05T17:27:08Z

@riastradh-probcomp I'm going to revert this commit --- it has completely broken the create schema logic in https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1043, and refactoring that complex state machine to is likely to be more effort and introduce more bugs than updating downstream clients to either use the variable name from the base table and/or do case insensitive comparisons.

riastradh-probcomp · 2017-06-06T19:06:37Z

This state machine already appears to be problematic from a cursory glance. But it doesn't look like it needs refactoring -- it looks like it just needs a few casefolds inserted judiciously: replace clause.var(s) by casefold(clause.var)(s), and casefold the output of bayesdb_variable_names.

riastradh-probcomp · 2017-06-06T19:12:45Z

Here's an example I expect to fail with the state machine as is:

create table t(x, y);
create population p for t;
create metamodel m for p with baseline crosscat(
    latent Z categorical;
    override model for y, z given x using piecewise
)

jar600 · 2017-08-29T19:29:25Z

I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?

riastradh-probcomp · 2017-08-29T19:40:15Z

Date: Tue, 29 Aug 2017 19:29:28 +0000 (UTC) From: Jonathan Rees probably ***@***.***> I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?

I don't think adding more configuration knobs is a good idea. It seems to me: (a) The table should retain the case of the CREATE TABLE columns. (b) The population should retain the case of the CREATE POPULATION variables. (c) Everything in bayeslite should be identified case-insensitively. Since the implementation of JOIN is deferred to sqlite3, I am surprised that you are hitting anything about mismatched cases, because all names in sqlite3 are identified case-insensitively. What reproduces the symptom you're referring to?

jar600 · 2017-08-29T19:50:14Z

I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause.

Is = case insensitive as well?

riastradh-probcomp · 2017-08-29T20:04:44Z

Date: Tue, 29 Aug 2017 12:50:17 -0700 From: Jonathan Rees probably ***@***.***> I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause. Is = case insensitive as well?

On columns declared COLLATE NOCASE, yes. The semantics is described at <https://www.sqlite.org/datatype3.html#collation>. The default is COLLATE BINARY, which is case-sensitive. In the bayeslite metadata tables, e.g. bayesdb_population, names (population names, variable names, &c.) are stored with COLLATE NOCASE -- see src/schema.py for details.

riastradh-probcomp added a commit that referenced this issue May 2, 2017

Test variable case preservation -- Github issue #546.

bad8042

riastradh-probcomp added a commit that referenced this issue May 2, 2017

Preserve table column case in population variables.

d7c5b3f

Github issue #546

fsaad mentioned this issue May 6, 2018

Remove unneeded casefolds in LoomBackend.simulate_joint #618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Population variables lose case distinctions #546

Population variables lose case distinctions #546

riastradh-probcomp commented May 2, 2017

fsaad commented Jun 2, 2017 •

edited

Loading

fsaad commented Jun 2, 2017 •

edited

Loading

fsaad commented Jun 5, 2017

riastradh-probcomp commented Jun 6, 2017

riastradh-probcomp commented Jun 6, 2017

jar600 commented Aug 29, 2017

riastradh-probcomp commented Aug 29, 2017 via email

jar600 commented Aug 29, 2017

riastradh-probcomp commented Aug 29, 2017 via email

Population variables lose case distinctions #546

Population variables lose case distinctions #546

Comments

riastradh-probcomp commented May 2, 2017

fsaad commented Jun 2, 2017 • edited Loading

fsaad commented Jun 2, 2017 • edited Loading

fsaad commented Jun 5, 2017

riastradh-probcomp commented Jun 6, 2017

riastradh-probcomp commented Jun 6, 2017

jar600 commented Aug 29, 2017

riastradh-probcomp commented Aug 29, 2017 via email

jar600 commented Aug 29, 2017

riastradh-probcomp commented Aug 29, 2017 via email

fsaad commented Jun 2, 2017 •

edited

Loading

fsaad commented Jun 2, 2017 •

edited

Loading