Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Functions #12635

Open
findepi opened this issue Sep 26, 2024 · 8 comments
Open

Simple Functions #12635

findepi opened this issue Sep 26, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@findepi
Copy link
Member

findepi commented Sep 26, 2024

Is your feature request related to a problem or challenge?

Verbosity

Currently implementing a scalar function is a pretty involved process. For example a simple function calculating greatest common divisor looks like this (a lot of detail omitted for brevity):

impl ScalarUDFImpl for GcdFunc {  
    fn name(&self) -> &str {
        "gcd"
    }

    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
        Ok(Int64)
    }

    fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
        make_scalar_function(gcd, vec![])(args)
    }
}

/// Gcd SQL function
fn gcd(args: &[ArrayRef]) -> Result<ArrayRef> {
    match args[0].data_type() {
        Int64 => {
            let arg1 = downcast_arg!(&args[0], "x", Int64Array);
            let arg2 = downcast_arg!(&args[1], "y", Int64Array);

            Ok(arg1
                .iter()
                .zip(arg2.iter())
                .map(|(a1, a2)| match (a1, a2) {
                    (Some(a1), Some(a2)) => Ok(Some(compute_gcd(a1, a2)?)),
                    _ => Ok(None),
                })
                .collect::<Result<Int64Array>>()
                .map(Arc::new)? as ArrayRef)
        }
        other => exec_err!("Unsupported data type {other:?} for function gcd"),
    }
}

/// Computes greatest common divisor using Binary GCD algorithm.
pub fn compute_gcd(x: i64, y: i64) -> Result<i64> {
   ...
}

This function is still "relatively simple" because:

  • it handles only int64 type (has no overloads)
    • a more complicated function would calculate return type based in input types in return_type and again in invoke
  • doesn't not support constant x or constant y parameters
    • the make_scalar_function helper expands scalar constant argument into an array, thus gcd(x, 5) is less efficient than gcd(x, y) because it first needs to allocate a temporary buffer for 5s repeated times the batch length
  • it doesn't handle dictionaries or run-end (REE / RLE) encoded values
    • i doubt people frequently calculate gcd on partitioning keys, but this is just an example
  • it doesn't handle all-non-null case, so probably misses optimization/vectorization opportunity

It should be possible to express functions using simple row-level operations, because in query engines most function logic is structured like that anyway (compute_gcd in the example).

Local data structures

Currently DataFusion functions are singletons plugged into the execution engine. They have no way to store and reuse buffers or compiled regular expressions, etc.
Thread local storage is not an option because DataFusion, when embedded, does not control creation of application threads.

Describe the solution you'd like

Simple

It should be possible to separate function logic from the mechanics.
Exemplary GCD function needs to provide fn compute_gcd(x: i64, y: i64) -> Result<i64> and the rest of boilerplate should not be hand-written for every function separately.

It should be possible to implement a function that accepts string values, without having to deal with the 5 different runtime representations that can carry string values: StringArray, LargeStringArray, StringViewArray, DictionaryArray<Key>, RunArray<Key> (maybe more than 5 because they are recursive in theory: can DictionaryArray contain a DictionaryArray? can it contain RunArray?)

Focused

Because SQL is statically typed, it is necessary to select function overload during planning time, so that return type is also known (the return_type function). The invoke function needs to implement same logic again. This process should be less error-prone: once the function is bound to the query call site, its signature is known and the implementation should not need to do type checks.

Performant / Efficient

It should be the compiler's / framework's job to provide vectorization, without hand-writing same logic in every function separately.

It should be the compiler's / framework's to do null checks, providing efficient tight loops for the all-non-null case without having to write such logic in every function separately.

It should be possible to write efficient functions that need thread local data structures, for example for regular expressions, without having to use thread locals and/or shared pools which introduce contention.

Describe alternatives you've considered

However, direct use of the library is not possible because

The library could still be part of the solution, but doesn't have to be and it's a non-goal to use a particular library.

Additional context

@findepi findepi added the enhancement New feature or request label Sep 26, 2024
@findepi
Copy link
Member Author

findepi commented Sep 26, 2024

@findepi
Copy link
Member Author

findepi commented Sep 26, 2024

FYI: I am doing some experiments how this could look like

@alamb
Copy link
Contributor

alamb commented Sep 26, 2024

I think the idea of making it easier to write functions that include specialized implementations for different types is a great idea. This would likely both make our code faster (more specializations, more uniformly handle constants) as well as reduce the replication (there are several different patterns for dispatch / specialization now -- consolidating would be really nice)

@alamb
Copy link
Contributor

alamb commented Sep 26, 2024

Currently DataFusion functions are singletons plugged into the execution engine. They have no way to store and reuse buffers or compiled regular expressions, etc.

here is some prior work on the precompiled regular expressions: #11146 (note compiling the regex hasn't ever shown up in any performance trace I did, probably because the cost of actually matching is so much bigger than the cost of compiling and the cost of compiling is spread over 8192 rows

@comphead
Copy link
Contributor

Thanks @findepi I think this process go through iterations, and easier than was before but still far from perfect.

The ScalarUDFImpl common trait is already a huge help, but for example dynamic typing, like MAX/MIN/GREATEST when return type depends on input is gonna be challenging, in Trino it seems also implemented in runtime rather than static.

Rust can benefit from macros and generate functions with all possible output types combinations, but this is just 1 problem.

I'm totally down if we can make it easier

@alamb
Copy link
Contributor

alamb commented Sep 27, 2024

Just to be clear, what I am imagining comes out of this work is:

  1. No changes to ScalarUDFImpl
  2. Some new way (generic functions, macros, etc) that would make creating the specialized function implementations (perhaps following the pattern of declarative macros on Prototype implementing DataFusion functions / operators using arrow-udf liibrary #11413)

@findepi
Copy link
Member Author

findepi commented Sep 28, 2024

FYI i touched upon the topic of types on DataFusion meetup in Belgrade yesterday.
The slides are here if anyone is interested: https://docs.google.com/presentation/d/1VW_JCGbN22lrGUOMRvUXGpAmlJopbG02hn_SDYJouiY . It was an attempt to summarize why we need both: simpler types (#11513), more types (#12644), and simple function "SDK" (#12635).
The document has comments disabled to avoid diverging the discussion from the github issue.

@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

. It was an attempt to summarize why we need both: simpler types (#11513), more types (#12644), and simple function "SDK" (#12635).

It was also a very nice and understandable talk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants