-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port of substrait-spark module from Gluten #271
Conversation
f7a5ed7
to
157276e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this will be one of the topics for the community meeting later today if you're interested.
} | ||
developers { | ||
developer { | ||
// TBD Get the list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should have this somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the pom section from the other build files in the project. We could just remove the developer section entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.
Side question: If this code was magically available in Java would it still be usable by existing callers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't know whether this would be an issue. The other PR that I referenced was also written in scala, and that didn't seem to be an issue in any of the comments. This code originally came from project gluten, which is all in scala. Spark itself is also written in scala.
We are using this library from java without any issues.
Side question: If this code was magically available in Java would it still be usable by existing callers?
I don't think it would make any difference from an end user point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interested party here!
I think we needed this PR for so long. And it won't be possible to do this in Java (AFAIU with low level Spark APIs), and creating a substrait-scala just for this doesn't make much sense. I guess it would be fine to add this code here.
cc @vbarua
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To echo @andrew-coleman comments - the only issue I've seen with calling the library from Java is that following the links in VSCode drops me into decompiled Scala code; not that readable in my view.
But that's it - so all together not bad at all :-)
assertSqlSubstraitRelRoundTrip( | ||
"select l_partkey from lineitem where l_shipdate < date '1998-01-01' " + | ||
"order by l_shipdate asc, l_discount desc") | ||
// assertSqlSubstraitRelRoundTrip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why these are commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. There are three tests commented out here. When I uncommented them, the first one failed (because the offset
clause has not been implemented yet) but the other two passed. So I'll push a new commit with the comments removed and the failing test in a separate ignore
block.
There are a few tests that are ignored due to unimplemented features. I plan to work on these in the coming weeks.
FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.
There are likely some annoyances it may cause (like deciding/aligning the code formatting for both), but as long as this is a separate subproject it should not affect the Java-only parts in any way really afaik.
Probably yes, but given Spark (esp. Catalyst) is written mostly in Scala, it's just much easier to traverse the Spark plans in Scala than in Java. |
Awesome! Then we should combine our efforts. |
Probably worth adding that I've got several examples of using this library with Spark; need to get these approved to be able to share them, currently writing up docs for this at present. |
* a, max(b) + 1 from table group by a</code>, We need create [[Project]] on top of [[Aggregate]] | ||
* to correctly support it. | ||
* | ||
* TODO: support [[Rollup]] and [[GroupingSets]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can create issues for these to be added later?
Pleased to see the positive discussion in the 2024-06-19 Substrait Community Sync Meeting regarding this PR. Is there anything I can do to help progress this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense for this code to live in substrait-java. I took a cursory pass at the code and it seems fine. While it could be nicer, I think it's the kind of thing that we can improved with time and I don't want to block the initial PR.
I can potentially help with reviews of this in the future, though it's been ~8 years since I touched Scala, and that was a very different style (think cats-effect). Which is to say it could take me a bit to ramp back up.
I did have one question about the build which I left a comment on.
# limitations under the License. | ||
%YAML 1.2 | ||
--- | ||
scalar_functions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta/future: it's potentially desirable to bring extensions like this into the core spec, either as core extensions or spark specific functions. Not for today though.
be74cd8
to
a04d72d
Compare
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.spark.substrait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor thing I just noticed. Both this class and ToSubstraitType
are under org.apache.spark.substrait
instead of io.substrait.spark
like everything else. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that in SparkTypeUtil
, the sameType
method is accessing a private member of org.apache.spark.sql.types.DataType
🤮 so it has to be in the org.apache.spark
namespace.
The other file though doesn't look like it needs to be there so I have moved it into io.substrait.spark
.
If we're happy with this for now, I'll try and recode this bit at a later date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely 🤮, but we can improve that in the future. Thanks for looking into it.
This module was part of the gluten project and subsequently removed. It is useful for converting spark query plans to and from substrait. Signed-off-by: andrew-coleman <[email protected]>
- uncomment unit tests - move ToSubtraitType class to package io.substrait.spark Signed-off-by: andrew-coleman <[email protected]>
a04d72d
to
6394dfc
Compare
I've rebased this on latest commit in |
As discussed in the chain of comments here, I’m offering up this PR for your consideration.
I can take no credit for its implementation, it is a copy of the
substrait-spark
module that was part of the gluten repo before it was removed, with a few minor corrections and additions. We are using this as part of a project here (IBM) and would like to see it remain available under the Apache ecosystem.It is useful for converting spark query plans to and from substrait.