Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for standard aggregate methods (sum/avg/min/max) #393

Open
sam0x17 opened this issue Apr 14, 2020 · 19 comments
Open

support for standard aggregate methods (sum/avg/min/max) #393

sam0x17 opened this issue Apr 14, 2020 · 19 comments

Comments

@sam0x17
Copy link

sam0x17 commented Apr 14, 2020

I have a monkey patch I've been using locally for this for a while. The tricky part is handling types correctly, as Executor::Value and Executor::MultiValue are tricky to work with. In particular, this has to work with timestamps as well as with numeric columns.

The different aggregate functions have some special caveats when it comes to types:

  • avg can return a float when the column is an int type
  • avg can return nil when there are no records that match the query
  • min and max can return nil when there are no records that match the query
  • sum defaults to 0 in the appropriate type when there are no records
  • min, and max can apply to timestamps as well as float and int columns
  • Int32 vs Int64 and Float32 vs Float64 create some issues -- I try to stick with the larger type to be safe -- we might have to be more specific with this to handle things like BIGINT

Because of all the type constraints, this may have to have type-specific versions, e.g. min for ints, min_f for floats, min_t for time, etc., and I'd be fine with that.

In Rails these are called minimum / maximum / average / sum.

My expectation would be the ability to do things like:

Widget.where(:created_at, :gt, Time.now - 10.days).avg(:units_sold) # => Float64?
Widget.min(:created_at) # => Time?
Product.where(cool: true).sum(:coolness) # => Int64 (defaults to 0)
Comment.where(deleted_at: nil).max(:downvotes) => Int64?

Possibly we would provide a second argument allowing one to specify a default value, e.g.:
User.minimum(:created_at, Time.utc) # => Time (defaults to Time.utc if no records found)

I'm happy to submit a PR for this, just wanted to collect some feedback first.

@sam0x17 sam0x17 changed the title support for standard aggregate methods (sum / avg / min / max) support for standard aggregate methods (sum/avg/min/max) Apr 14, 2020
@sam0x17
Copy link
Author

sam0x17 commented Apr 14, 2020

This is the current monkey patch I am using. Note I had to change the base Value(Model, Scalar) class to accept nil as a default value.

module Granite::Query::Executor
  class Value(Model, Scalar)
    include Shared

    def initialize(@sql : String, @args = [] of Granite::Columns::Type, @default : Scalar = nil)
    end

    def run : Scalar
      log @sql, @args
      # db.scalar raises when a query returns 0 results, so I'm using query_one?
      # https://github.com/crystal-lang/crystal-db/blob/7d30e9f50e478cb6404d16d2ce91e639b6f9c476/src/db/statement.cr#L18

      # raise "No default provided" if @default.nil?

      Model.adapter.open do |db|
        db.query_one?(@sql, args: @args, as: Scalar) || @default #.not_nil!
      end
    end

    delegate :<, :>, :<=, :>=, to: :run
    delegate :to_i, :to_s, :to_t, to: :run
  end
end

macro define_granite_aggregate(name, aggregate, type_union, inner_type_union, default=nil, resolver=nil)
  abstract class Granite::Query::Assembler::Base(Model)
    def {{name.id}}(column : String | Symbol) : Executor::Value(Model, {{inner_type_union}})
      sql = build_sql do |s|
        s << "SELECT {{aggregate.id}}(#{column})"
        s << "FROM #{table_name}"
        s << where
        s << group_by
        s << order(use_default_order: false)
        s << limit
        s << offset
      end
  
      #if group_by
      #  Executor::MultiValue(Model, {{type_union}}).new sql, numbered_parameters, default: {{default}}
      #else
        Executor::Value(Model, {{inner_type_union}}).new sql, numbered_parameters, default: {{default}}
      #end
    end
  end
  
  abstract class Granite::Query::Builder(Model)
    def {{name.id}}(column : Symbol | String, default : Time | Int64 | Int32 | Float64 | Nil = nil) : {{type_union}}
      val = assembler.{{name.id}}(column).run
      return default if val.nil?
      {% if resolver %}
      val = val.{{resolver.id}}
      {% end %}
      val
    end
  end
end

define_granite_aggregate(:min_t, :MIN, Time?, Time?)
define_granite_aggregate(:max_t, :MAX, Time?, Time?)
define_granite_aggregate(:sum, :SUM, Int64, Int64 | Int32, 0_i64, :to_i64)
define_granite_aggregate(:min, :MIN, Int64?, Int64 | Int32 | Nil, nil, :to_i64)
define_granite_aggregate(:max, :MAX, Int64?, Int64 | Int32 | Nil, nil, :to_i64)
define_granite_aggregate(:avg, :AVG, Float64?, Float64 | PG::Numeric | Nil, nil, :to_f64)
define_granite_aggregate(:sum_f, :SUM, Float64, Float64 | PG::Numeric, 0_f64, :to_f64)
define_granite_aggregate(:min_f, :MIN, Float64?, Float64 | PG::Numeric | Nil, nil, :to_f64)
define_granite_aggregate(:max_f, :MAX, Float64?, Float64 | PG::Numeric | Nil, nil, :to_f64)

@sam0x17
Copy link
Author

sam0x17 commented Apr 14, 2020

note that this could be extended to work with group by, and that I'm using PG::Numeric when I should be using something more generic

@robacarp
Copy link
Member

This may be a situation where the different drivers have different offerings. For example, I know the way that postgres does group by statements is different than the way mysql does it, and that affects the way the select is built.

I like the idea of adding these more advanced aggregates, but I think that historically Granite has been careful about maintaining the top level DSL to be something uniform across the drivers. If people want the full flexibility of their database they have to drop down to a lower level.

@sam0x17
Copy link
Author

sam0x17 commented Apr 15, 2020

Yeah on that note I've been thinking of taking my ideas and making an ORM

@damianham
Copy link
Contributor

Hey Sam just a thought - what about a higher level ORM that delegates all but the aggregate methods to one of the supported ORMs ? Not sure how that would be constructed at the moment but the thought just struck me that a higher level ORM that provided all kinds of additional functionality, like these aggregate methods for example, on top of the existing ORMs might be a good idea.

@drujensen
Copy link
Member

The way I envisioned people using Granite to handle aggregate queries is to build a new read only model for each query. Maybe this is too laborious of a solution.

class UserAgeByCountry < Granite::Base
  select "SELECT country, count(age), min(age), max(age), avg(age), sum(age) FROM users GROUP BY country"

  column country : String
  column cnt_age : Int64
  column min_age : Int64
  column max_age : Int64
  column avg_age : Int64
  column sum_age : Int64
end

@drujensen
Copy link
Member

drujensen commented Apr 15, 2020

what if we create new column types for each aggregate that takes a type <T>?

class UserAgeByCountry < Granite::Base
  table :users

  column country : String, group_by: true
  column cnt_age : Count(:age, Int64)
  column min_age : Min(:age, Int64)
  column max_age : Max(:age, Int64)
  column avg_age : Avg(:age, Int64)
  column sum_age : Sum(:age, Int64)
end

@sam0x17
Copy link
Author

sam0x17 commented Apr 15, 2020

@drujensen I noticed this when I went looking around for how to do this. For my purposes this would be super impractical as there are literally hundreds of places I need to use aggregate methods (e.g. for the stats page in an admin control panel), and usually I am using complex where/scope chains, and then slapping an aggregate onto the end of those. If I had to create custom view models for each of these scoping situations, I would have 10-20 of them for each model.

@drujensen
Copy link
Member

drujensen commented Apr 15, 2020

Thanks @Blacksmoke16 and @sam0x17 for the feedback. I am torn if an ORM should return random query results or if that should be a separate product since its no longer mapping results to an Object. What this PR is doing is is creating a DSL for SQL and expecting an array of named tuples with no type checking, etc. It's not leveraging the mapping at all. However, I realize this is a much needed feature and see other ORM's doing similar solutions.

@Blacksmoke16
Copy link
Contributor

Blacksmoke16 commented Apr 15, 2020

If anything you could do something like (untested):

class User < Granite::Base
  ...

  def self.max_age : Int32
    self.scalar("SELECT MAX(age) FROM #{self.table_name}")
  end
end

Or really could generalize it, then you have context which would help with how to set things up.

@drujensen
Copy link
Member

drujensen commented Apr 15, 2020

@Blacksmoke16 you wouldn't be able to append a where, group_by, limit, order_by to that.

@drujensen
Copy link
Member

drujensen commented Apr 15, 2020

The Lucky Framework separates the table mapping and the query into two separate objects. Maybe that is something to consider architecturally instead of combining the two concerns into the same object. wdyt?

@damianham
Copy link
Contributor

Could this problem be solved with Virtual Models ? I am not sure if this would work or not but if it did it could be very useful. For instance User.all gives you an array of User records and in the view you can reference a property of the model such as user.email. However if you use #map to modify the data sent to the web client you get an array of NamedTuples and user.email gives 'undefined method'. So what would be cool would be virtual models that provide a view on the underlying model with a different field set (and thus property methods) and also additional methods such as aggregators. Any method not defined in the virtual model would delegate to the underlying model.

I don't want to have virtual model classes, but be able to create them in a single line of code e.g.

users = User.all.map{ |user| VirtualModel(User.class, properties: [name : String, email : String, created_at : DateTime]).new }

then users.to_json only gives name, email and created_at in the result and this works for both json results and also for ecr/slang templates.

For aggregation how about

stats = VirtualModel.all(User.class, properties: [ country : String,
 cnt_age : Int64,
min_age : Int64,
max_age : Int64,
avg_age : Int64,
sum_age : Int64
],
query: "SELECT country, count(age) cnt_age, min(age) min_age, max(age) max_age, avg(age) avg_age, sum(age) sum_age FROM users GROUP BY country"
)

@drujensen
Copy link
Member

@damianham I think what your proposing is pretty much the same as what I proposed, correct? Curious why you wouldn't want to declare it as a class instead. You can leverage the query builder if you use the existing model that sits on top of your query instead.

@sam0x17
Copy link
Author

sam0x17 commented Apr 16, 2020

@drujensen side note: is it possible to apply a .scalar like what @Blacksmoke16 used above but to a where chain? Because if so my life becomes a bit easier

@damianham
Copy link
Contributor

@damianham I think what your proposing is pretty much the same as what I proposed, correct? Curious why you wouldn't want to declare it as a class instead. You can leverage the query builder if you use the existing model that sits on top of your query instead.

I was just throwing the idea around to see what your thoughts were. I think it is similar to what you proposed except you don't populate your project with lots of classes, instead there is a single virtual model class that is part of the framework and you instantiate virtual instances inside your controllers.

I did something similar to this a long time ago in RoR where I used method_missing to get fields from an attribute hash that was populated with columns from the result set record.

@drujensen
Copy link
Member

Thanks @damianham I understand. So your idea is to have more of a smart object that can dynamically generate the properties. Ruby is good at that stuff but this might be tough to pull off using macros. I personally prefer having well defined objects for queries, but I see how you may just want to grab the max(:age) without creating a whole class and cluttering the code base.

@sam0x17 There is an executor that runs for scalar values:

I don't recall when this executor is invoked vs the list executor.

@sam0x17
Copy link
Author

sam0x17 commented Apr 19, 2020

The biggest difficulty is if you have a complex scope/query chain, and you need to grab a max or an avg off of that.

@sam0x17
Copy link
Author

sam0x17 commented May 1, 2020

So here's a real scenario where not having these makes things extremely difficult:

Let's say I have a User model with an Int32 field called upvotes, a country field, a state field, and a city field, and I want to know the average number of upvotes for users in a particular city, state, and country, created in January...

class User < Granite::Base
  connection pg
  table users

  column id : Int64, primary: true
  column email : String
  column password_hash : String
  column upvotes : Int32
  column city : String?
  column state : String?
  column country : String?

  timestamps
end

I would expect to just be able to do something like:

User.where(:created_at, :gt, Time.utc - 4.months)
  .where(:created_at, :lt, Time.utc - 3.months)
  .where(country: "US")
  .where(state: "MD")
  .where(city: "Baltimore")
  .avg(:upvotes)

This is an incredibly common sort of construction for people building web apps, and in particular for people coming from Rails where it is easy to chain aggregates onto complex where chains.

The current proposed solution is to make a custom version of the model (sort of like a materialized "view") that houses a custom select clause. Something like UsersBetweenDateInSpecificLocationView....... This would be sort of OK if I could just have that custom view inherit from my base granite model, but inheritance doesn't work in Granite, so I have to completely recreate the model with all the fields involved in the query just to get the answer to some aggregate.

One solution that would work would be the ability to have multiple custom select clauses per model, like allow a name to be passed along with the macro, so you get named custom select scopes, though I'd still prefer to just have working chainable aggregates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants