Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overload == and isequal for checking equality of Terms and Schemas #174

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

ghost
Copy link

@ghost ghost commented Feb 22, 2020

This PR aims to add functionality to compare two schema using == operator and isequal function. There is no difference between the two methods except that of the treatment of floating point numbers by == and isequal. Two schemas are equal if they are comprised by the same terms. All the terms to the two schema should be common. Every term in the first schema should have an equivalent term in the second schema.

  • Constant terms are equal if the same is number represented by them.
  • Formula terms are equal if the two terms have same LHS components and same RHS components respectively.
  • Function terms are equal if the same function applied is to both and same arguments are passed to both.
  • Interaction terms are equal if the same terms are present in the interaction for both terms.
  • Intercept terms are equal if they have the same width.
  • Continuous terms are equal if all invariants are equal.
  • Categorical terms are equal if the width and contrast matrices of the terms are equal respectively.
  • Terms with differently named variables are always unequal.

A special case may arise where the currently proposed implementation does not function properly. Example:-
df1 = (y = y, a = 1:9, b = b, c = repeat(["d", "e", "f"], 3))
df2 = (y = [df1.y; df1.y], a = [1:9; 1:9], b = [b; b], c = [df1.c; df1.c])
are unequal because the variance of the Continuous terms varies inversely with number of data points.

This PR will close #173.

@ghost ghost requested a review from kleinschmidt February 22, 2020 10:11
Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good but needs some work. At a high level, could you say more about what the desired behavior is in the summary of the pull request at the top? When ARE two terms equal? When are two schema equal? Are there any potentially surprising situations where two things would NOT be equal? Is the behavior of isequal and == different?

Also, there are a lot of changes here that don't have anything to do with checking equality, both whitespace/formatting, but also some functional stuff (at least one which breaks an existing test). Can you remove all the changes that aren't directly required to implement the proposed behavior for == and isequal? And make sure that the existing tests all run. Best to check that locally but the travis CI checks will also tell you that (you can view the builds on the

src/terms.jl Outdated Show resolved Hide resolved
test/schema.jl Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Feb 29, 2020

Hi, Thanks for reviewing. I'll do the changes and add a summary of expected behaviour.

@codecov-io
Copy link

codecov-io commented Mar 9, 2020

Codecov Report

Merging #174 into master will increase coverage by 0.4%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #174     +/-   ##
=========================================
+ Coverage   86.47%   86.87%   +0.4%     
=========================================
  Files           9        9             
  Lines         510      564     +54     
=========================================
+ Hits          441      490     +49     
- Misses         69       74      +5
Impacted Files Coverage Δ
src/schema.jl 84% <100%> (+2.01%) ⬆️
src/terms.jl 87.21% <81.25%> (-0.82%) ⬇️
src/contrasts.jl 87.95% <0%> (+0.45%) ⬆️
src/temporal_terms.jl 84% <0%> (+4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 749fdcd...ee16796. Read the comment docs.

Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very close! I think the last thing (which I didn't think of before) is that there aren't tests for some of the added == methods, in particular the FunctionTerm and MatrixTerms.

src/schema.jl Outdated Show resolved Hide resolved
src/schema.jl Outdated Show resolved Hide resolved
src/terms.jl Outdated Show resolved Hide resolved
test/schema.jl Outdated Show resolved Hide resolved
test/schema.jl Outdated Show resolved Hide resolved
test/schema.jl Show resolved Hide resolved
src/terms.jl Outdated Show resolved Hide resolved
src/terms.jl Outdated Show resolved Hide resolved
src/terms.jl Outdated Show resolved Hide resolved
src/terms.jl Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Mar 16, 2020

I have fixed all things you asked except adding tests for Matrix terms. I do not know how to create a matrix term or how to include one in a formula. Can you please give an example of one such term ?

@ghost ghost requested a review from kleinschmidt March 16, 2020 08:15
Co-Authored-By: Dave Kleinschmidt <[email protected]>
@kleinschmidt
Copy link
Member

They come from doing apply_schema, which creates them on the RHS of the formula.

@kleinschmidt
Copy link
Member

New tests still not passing.

@ghost
Copy link
Author

ghost commented Mar 31, 2020

Hi! I have fixed the broken tests. The latest build is now passing.

@test schema(df) != schema(df7)
@test schema(df) != schema(df8)
@test schema(df8) != schema(df)
@test apply_schema(f, schema(df)) == apply_schema(f, schema(df5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I was confused because schema(df) != schema(df5), so the result of applying them to the same formula shouldn't be equal. But then I realized that f has already had a schema applied to it above, so this isn't testing what I thought it was testing. Which makes me wonder, what is the purpose of this test? apply_schema(@formula(y ~ 1 + a + log(b) + c + b&c), schema(df5)) would produce an error because y isn't found in df5. But since you've already applied a schema with y in it above, you're actually just re-applying the schema terms that ARE found. I'm not sure that's a particularly useful test, unless you were specifically trying to test this edge case... so can you explain what the goal of this test is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overload == and isequal to compare schemas
3 participants