-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function names & URIs #634
Comments
I'm not sure I expect a solid answer but figured it would be good to solicit discussion and opinions. |
I'm currently facing some design choices related to this and this is roughly where my thinking is at this point: TLDR at bottom... Generally, uniquely qualifying function names is not a problem that consumers have to deal with because the extension references are already resolved in the plan they are receiving. For plan producers, it seems some specific preferences do need to be adopted in order to produce plans under practical constraints. Specifically, the input most producers are serializing into substrait will typically contain the function name and input types, but not the URI. Generally speaking this is coming from SQL but dataframes have the same issue. In the statement In general I think this is reasonable. If a plan producer wants to shadow the "default" I think making this arbitrary choice is awkward but in practice most consumers don't do this at all which is a source of bugs. The original issue (#631) is an example and it looks like ibis-substrait has a similar silent bug where the second occurrence of a ( TLDRGiven all this, I think one way to help simplify implementations would be to ensure that all functions defined under https://github.com/substrait-io/substrait/blob/main/extensions have a unique Thoughts on this? |
Consumers generally need to map the fully qualified name to an actual function implementation though. Here is a simple python example showing the pattern I usually see:
Notice that URIs are not involved (because I don't commonly see it). This can lead to a problem. If a producer decides to shadow "add" and creates
I do not agree the choice is arbitrary. For example, let's decide we want to fix the above consumer to properly implement function mapping:
Now, if the producer provides However, if the producer arbitrarily decides on
How would you know if a URI is user-defined or not?
I partly agree. This is also called out here:
I think it is technically legal for a function signature to be duplicated as long as the duplicate is in a different yaml file. This is because the yaml filename should be part of the URI. However, I would still discourage it, and am fine saying we don't want it, since it can be misleading. |
Thanks for the response @westonpace. I pretty much agree with all of this, and would like to clarify that I don't think there's actually any ambiguity in the spec as it is currently defined: a function is uniquely specified by ( In practice many implementations don't handle URIs correctly, but the implications are different on the producer vs the consumer side. Consider an implementation executing SQL queries via substrait:
I believe you were focusing on the right-hand side in your response, where a substrait message is mapped to a physical plan for the query engine to execute. The reason I say there isn't ambiguity on the consumer side is because the plan it receives should already have an explicit URI in the On the left side of the workflow a SQL query is getting planned and then serialized into a substrait message. Consider the query This only gets confusing IMO if we were to (hypothetically) define another extension file in the main substrait repo called for instance |
So substrait-java doesn't ignore the field, as I discovered during our version update internally in which plans using
Another example where URIs are very important. Substrait defines the - name: "avg"
impls:
- args:
- name: x
value: i64
options:
overflow:
values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: "STRUCT<i64,i64>"
return: i64? However, we might want to have a variant of - name: "avg"
impls:
- args:
- name: x
value: i64
options:
overflow:
values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: "STRUCT<i64,i64>"
return: fp64? # <-- RETURNS FLOATING POINT in A consumer supporting Postgres style More generally, a producer should only support a function if its implementation semantics match what is given in the function extension definition. A consumer should only use functions that match the semantics that they are trying to encode.
I actually agree that this is somewhat arbitrary in that a generic producer can chose any version that matches the semantics they are trying to encode. I would argue within the core substrait extensions we should avoid having functions that are identical except for URI |
I'm definitely rhs focused as well 😅 For your example starting from SQL, consider something like: SELECT avg(<i64 column>) FROM ... From the SQL side of things, this |
This is a good point @vbarua, especially since Currently some producers are auto-generating function definitions from the contents of https://github.com/substrait-io/substrait/blob/main/extensions. The function mapping for the SQL query above works fine as long import fancy_client_that_uses_substrait as fc
res_i64 = fc.query("SELECT avg(order_quantity) FROM orders")
res_fp64 = fc.query("SELECT avg(order_quantity) FROM orders", dialect="postgres") In this case the producer has the explicit information it needs to choose the correct extension file (if it assumes |
I am late to the party here and totally missed this thread. I need to figure out where these notification emails are going...
Not a fan. The intention has always been to create common function libraries in the substrait core project. So, postgres-functions, snowflake-functions, etc. In fact, I'm actually working on snowflake-functions right now and we're starting to introduce them. Forcing all the files to have different names for functions will just mean everyone is going to have to create mappings between differently named functions (yes, we called this avg_snowflake but you probably know it as avg).
Not really true. substrait-java (and Isthmus) use the local paths (e.g. "/functions_arithmetic.yaml). A decent amount of work/infrastructure was put in place to support that patterns. .... I agree that consumers and producers are ignoring the URI part of the spec and we have to adapt as best we can. I propose we add a few behaviors to incorporate laziness without excessively watering down the specification. The reality right now is that people could be lazy because they aren't that interested in binding to the right type patterns. E.g. output type derivation differences, etc. So:
Thoughts? |
This isn't quite what I meant, I agree with most of what you wrote. If I remember correctly, I think at the time I was asking about the "significance" of extensions defined in https://github.com/substrait-io/substrait/blob/main/extensions compared to arbitrary URI's. The examples you named are good ones to consider in this context e.g. postgres-functions, snowflake-functions. It would be helpful to have a name for these "sets" of extensions. I don't think this concept exists currently in the substrait spec. IMHO it would be helpful if function signatures are distinct within one of these system-specific sets. So postgres-functions can only define I think these system-specific collections are a great idea, but it does bring back the question about the ones in https://github.com/substrait-io/substrait/blob/main/extensions. They aren't specific to any system. Are they intended to be a neutral set of "canonical" extensions that other system-specific collections map to/from? (this has been my interpretation) It would be nice if it were possible to point to a "base-uri" and register functions within that set. Some examples:
If |
How do avg_snowflake and avg differ? Are there just missing options? Yes, most of the consumers don't handle options today but would handling them solve the problem more cleanly? As to naming, the point of the URI scheme is to namespace the functions so having different prefixes should be different. Another place we might need different function namespaces might be scalar vs aggregate vs window functions. It's entirely possible one would want an avg that works as all two or more of those types. |
Checking my understanding. It sounds like both @jacques-n and @joellubi are describing a new concept that is a "collection of extensions" (my brain immediately wants to use "dialect" but I'll stick with "collection" for this comment at least). Jacques is proposing an "ordering" suggesting that there can be duplicates in a collection but the order of the files determines how they are handled. Joel is proposing no duplicates within a collection. Both approaches solve the "lazy consumer" problem by allowing consumers to ignore URIs and putting the burden on producers (or in between components) to construct the appropriate collection that describes the consumer's behavior. -- If I had to pick then I think I'm partial to the ordering approach. Mainly because an "ordering" allows you to build up a collection in layers. I could start with the base "core substrait" layer and then add a "postgres layer" which doesn't have to be complete, but both adds new functions and replaces existing functions. |
We talked about this today in the sync. My sense of general consensus: The spec is fine. The fact that people are ignoring it isn't a problem we need to fix in the spec. Adding multiple files with the function signature (name + args) is fine. If that's an accurate synopsis of people's thinking, I think we close this out as "not a problem". I do think it is worthwhile to open tickets for "what is valid format for URI" but the main gist of this conversation has been more about the first point @westonpace made above which I think we're concluding is a non-issue. |
HI - just found this thread; as I've been running into these problems. As @westonpace noted at the start there is a wide variety of how these are used; what I think is missing is high level flow of how these URIs are meant to be handled. Say I'm implementing the ability to run substrait plans on my new WibbleDB (TM) when I encounter an extension URI what am I actually supposed to do with it? There are lot of extension files already defined. should these be loaded locally - and are versioned how? |
This is in relation to some discussion that came up in #631
My understanding is that a fully qualified function name is the triple:
Therefore, yes, these are different functions (because the function uri is different). However, there are (at least) three problems that have never really been fully resolved:
The major producers that I am aware of today (isthmus, duckdb, ibis) either set the function URI to undefined, the empty string, or
/
(I think we actually have all three behaviors across the three producers 😮💨 )Correspondingly, consumers tend to ignore this field. The one exception I'm aware of is Acero which will tolerate
/
, empty string, and undefined (Acero goes into a "fallback" mode where it does name-only matching and will match any registered function with the same name regardless of the URI) but which will accept URLs of the formhttps://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
and also has a special URIurn:arrow:substrait_simple_extension_function
which means "use the arrow compute function with the given name" (this is how we support UDFs).There are several choices. For example:
My preference is the former, for practical reasons.
This is discussed in more detail in #274
But the basic question is "what if we have tons of users and we decide to make a change to some function?"
The text was updated successfully, but these errors were encountered: