Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery driver seems does not cast all numerical values equally to strings #1835

Closed
philippefutureboy opened this issue Jan 20, 2021 · 3 comments
Assignees
Labels
bug Something isn't working driver:bigquery Issues related to the BigQuery driver Roadmap: 2021 Q1 Cube.js team roadmap for Q1 of 2021.

Comments

@philippefutureboy
Copy link
Contributor

philippefutureboy commented Jan 20, 2021

EDIT: I realize that I am two minor versions behind, so I will update to 0.25 to check if the issue is solved
EDIT 2: I confirm the bug is still present on version 0.25.21.

Describe the bug

Six days ago, we upgraded our version of the @cubejs-backend/bigquery-driver from "0.23.8" to "0.23.14" and yesterday we stumbled on a type cast error for the results of the query which caused our following code to fail due to a type assumption.

Following a discussion on slack with @paveltiunov, turns out that Cube is transitioning towards returning all types as string for simplicity and for supporting large numbers.

The problem here is twofolds:

  1. This is a fairly major change which can introduce breakage in production systems, and thus should at least be a minor patch, and I'd even go as far as say as part of a major patch. If the team chooses to use a different approach to versioning than the standard semver, it would be important to express it somewhere that cannot be missed.
  2. The feature does not seem to be working reliably; two sum typed measures resulted in different types - one a string and one a float.

To Reproduce
Steps to reproduce the behavior:

Given a BigQuery view with two sum typed measures:

SELECT 
  `schedule_track`.clinic_key `schedule_track__clinic_key`, 
  CAST(`calendar`.date_key AS TIMESTAMP) `calendar__date_key`, 
  sum(
    `schedule_track_facts`.amount_cad
  ) `schedule_track_facts__charges_amount`, 
  sum(
    `schedule_track_objective_calendar`.charge_per_open_time_slot_hour_objective * `schedule_track_objective_calendar`._open_time_slot_hours_value
  ) `schedule_track_objective_calendar__charges_amount_objective` 
FROM ...

the result returned is like the following:

{
      "schedule_track.clinic_key": "cli91dcb088415e4d2d8bd058655d2",
      "calendar.date_key": "2021-01-06T00:00:00.000",
      "schedule_track_facts.charges_amount": "8796.5",
      "schedule_track_objective_calendar.charges_amount_objective": 8534.475
},

where "schedule_track_facts.charges_amount" should be a float or "schedule_track_objective_calendar.charges_amount_objective" a string, but not both different types.

Schema:

cube(`schedule_track_facts`, {
  sql: `
    SELECT
        schedule_track_calendar_view.clinic_key,
        schedule_track_calendar_view.date_key,
        schedule_track_calendar_view.schedule_track_key,
        schedule_track_facts_keyview.* EXCEPT(clinic_key, date_key, schedule_track_key)
    FROM
        \`${project}\`.${dataset_key}.schedule_track_calendar_view
        LEFT JOIN
        \`${project}\`.${dataset_key}.schedule_track_facts_keyview
            ON schedule_track_calendar_view.clinic_key = schedule_track_facts_keyview.clinic_key
            AND schedule_track_calendar_view.date_key = schedule_track_facts_keyview.date_key
            AND schedule_track_calendar_view.schedule_track_key = schedule_track_facts_keyview.schedule_track_key
  `,
  // ...
  measures: {
    // ...
    charges_amount: {
      sql: `${CUBE}.amount_cad`,
      type: `sum`,
    },
  },
})

Expected behavior

All FLOAT64, INT64, NUMERIC types should be returned as the same type in the resulting records, be it string or number.
It is very possible that the problem is that amount_cad is NUMERIC NULLABLE whereas

`schedule_track_objective_calendar`.charge_per_open_time_slot_hour_objective * `schedule_track_objective_calendar`._open_time_slot_hours_value

may be implicitly returned as FLOAT64 by the BQ engine due to _open_time_slot_hours_value being implicitly FLOAT NULLABLE.

Learning on our side

Moving forward we will cast explicitly all of our types coming out of Cube.js (or any other service) to the expected type so that we can handle them properly in pandas, and catch unexpected type conversions earlier in the code.

@philippefutureboy
Copy link
Contributor Author

philippefutureboy commented Jan 21, 2021

Given that you are planning to move towards standardization of your data exports as Map<string, string>, could you also consider adding the complete type information for the returned data?
Right now in the annotation.measures[key].type and annotation.dimensions[key].type you seem to have string, number and time, which leave some precious information out if I want to dynamically cast that data. Could you add a property like native_type which would include the type of the field from the driver itself? Then I could parse a float differently than an int, than a bool, etc.

@hassankhan
Copy link
Contributor

Hi @philippefutureboy 👋

I've created a "parent" issue to track this and other similar issues. Thank you for filing the issue 🙌

@hassankhan hassankhan added the bug Something isn't working label Jan 25, 2021
@keydunov keydunov added the Roadmap: 2021 Q1 Cube.js team roadmap for Q1 of 2021. label Mar 4, 2021
@hassankhan hassankhan added the driver:bigquery Issues related to the BigQuery driver label Apr 26, 2021
@igorlukanin
Copy link
Member

Hey everyone! I was able to confirm that this behavior is still something that is present in Cube as of version v0.36.2. However, I believe that it has since become a known issue rather than a bug. Also, the castNumerics option really helps to mitigate this in many cases:

Thus, I've created a separate issue to change this in the future: #8741

With that, I think it's fine to close this issue. Please feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working driver:bigquery Issues related to the BigQuery driver Roadmap: 2021 Q1 Cube.js team roadmap for Q1 of 2021.
Projects
None yet
Development

No branches or pull requests

5 participants