Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: t1.count() + t2.count() only uses t1 #7655

Closed
1 task done
NickCrews opened this issue Dec 4, 2023 · 11 comments
Closed
1 task done

bug: t1.count() + t2.count() only uses t1 #7655

NickCrews opened this issue Dec 4, 2023 · 11 comments
Labels
bug Incorrect behavior inside of ibis

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Dec 4, 2023

What happened?

import ibis

ibis.options.interactive = True

good_islands = ibis.memtable({"island": ["Torgersen", "Biscoe"]})
t = ibis.examples.penguins.fetch()
t.island.value_counts()
# ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
# ┃ island    ┃ island_count ┃
# ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
# │ string    │ int64        │
# ├───────────┼──────────────┤
# │ Dream     │          124 │
# │ Torgersen │           52 │
# │ Biscoe    │          168 │
# └───────────┴──────────────┘
t.count()
# 344


good = t.join(good_islands, "island", how="left_semi")
bad = t.join(good_islands, "island", how="anti")

good.count()
# 220 as expected

bad.count()
# 124 as expected

union = good.count() + bad.count()
union
# 440, not 344 as expected

ibis.show_sql(union)
# SELECT
#   COUNT(*) + COUNT(*) AS "Add(CountStar(), CountStar())"
# FROM (
#   SELECT
#     t0.species AS species,
#     t0.island AS island,
#     t0.bill_length_mm AS bill_length_mm,
#     t0.bill_depth_mm AS bill_depth_mm,
#     t0.flipper_length_mm AS flipper_length_mm,
#     t0.body_mass_g AS body_mass_g,
#     t0.sex AS sex,
#     t0.year AS year
#   FROM main.penguins AS t0
#   WHERE
#     EXISTS(
#       SELECT
#         1
#       FROM _ibis_pandas_memtable_jg47nwdavjft7diic45wxhxpxa AS t1
#       WHERE
#         t0.island = t1.island
#     )
# ) AS anon_1

If I use two unrelated tables:

t1 = ibis.examples.Cushings.fetch()
t2 = ibis.examples.penguins.fetch()
t1.count() + t2.count()

then I get RelationError: Selection expressions don't fully originate from dependencies of the table expression. instead. I would expect this to work.

What version of ibis are you using?

7.1.0

What backend(s) are you using, if any?

NA

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Dec 4, 2023
@NickCrews NickCrews changed the title bug: t1.count() + t2.count() uses values from t1 bug: t1.count() + t2.count() only uses t1 Dec 4, 2023
@cpcloud
Copy link
Member

cpcloud commented Dec 6, 2023

What would be the equivalent SQL for t1.count() + t2.count()?

@NickCrews
Copy link
Contributor Author

NickCrews commented Dec 6, 2023

This does actually seem to work in shell.duckdb.org to add scalar to scalar:

SELECT
   (SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet') +
   (SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet');

or to add a scalar with a vector:

SELECT
  (SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet') + c_custkey
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet';

But perhaps this doesn't work in all backends?

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2024

This is now achievable with our new-in-9.0 as_scalar() method, used here to explicitly construct a subquery:

In [28]: from ibis.interactive import *
    ...:
    ...: t1 = ibis.examples.Cushings.fetch()
    ...: t2 = ibis.examples.penguins.fetch()
    ...: expr = t1.count().as_scalar() + t2.count().as_scalar()

In [29]: expr
Out[29]:
┌───────────────┐
│ np.int64(371) │
└───────────────┘

In [30]: ibis.to_sql(expr, dialect="duckdb")
Out[30]:
SELECT
  (
    SELECT
      COUNT(*) AS "CountStar(Cushings)"
    FROM "Cushings" AS "t0"
  ) + (
    SELECT
      COUNT(*) AS "CountStar(penguins)"
    FROM "penguins" AS "t1"
  ) AS "Add(ScalarSubquery(), ScalarSubquery())"

The naked t1.count() + t2.count() does have a bizarre failure mode, so I'm looking into that, but this issue can be closed.

@cpcloud cpcloud closed this as completed Jun 27, 2024
@NickCrews
Copy link
Contributor Author

@cpcloud isn't .count() always a scalar? I definitely assumed it was, I think (no evidence) that others will think the same thing, especially if coming from pandas/python world, where semantically thye think t.count() ~= len(t). So could we make it so that at the end of def count(self): we finish with a .as_scalar(), so both of my examples just work out of the box?

@NickCrews NickCrews reopened this Jun 27, 2024
@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2024

count() is still tied to a table, and Ibis can't easily determine the intent with such an expression (using it combination with other essentially random unrelated expressions).

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2024

This issue also isn't super well motivated.

Is there some concrete use case here?

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2024

Also your examples using sql are exactly what calling as_scalar() will produce.

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2024

I think this is best we can do for now without a more strongly motivated use case or reasons why combining two scalar expressions in arbitrary ways must "just work".

It might help to list out the ways in which scalars can currently be used and then define semantics for those operations.

@NickCrews
Copy link
Contributor Author

Not at a computer right now to test, but one example of a scalar from one table needing to interact with another table is a "manual autoincrement": ibis.union(t_new.mutate(id=ibis.row_number() + t_old.id.max()), t_old)

@NickCrews
Copy link
Contributor Author

In SQL, is my assumption correct that you can combine two scalar expressions in arbitrary ways?

@ncclementi
Copy link
Contributor

Closing since this now works when doing.

In [8]: union = good.count().as_scalar() + bad.count().as_scalar()

In [9]: union
Out[9]: 
┌─────┐
│ 344 │
└─────┘

xref: #10124 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

No branches or pull requests

3 participants