Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collaboration on data analytics workloads in MLIR #506

Open
harsh-nod opened this issue Jun 2, 2022 · 12 comments
Open

Collaboration on data analytics workloads in MLIR #506

harsh-nod opened this issue Jun 2, 2022 · 12 comments

Comments

@harsh-nod
Copy link

Hi folks @ingomueller-net @webmiche,

@bsarden-rivos and myself are interested in running data analytics workloads (as found in popular frameworks such as pandas) on iree. To do that, we were trying to flesh out a path from pandas to mlir. I have a simple prototype that takes element-wise addition and lowers it to linalg here: https://github.com/nod-ai/pandas-mlir. But recently, based on @bsarden-rivos's findings, we have been thinking of using substrait (https://substrait.io) and more specifically, the ibis-subtrait compiler (https://github.com/ibis-project/ibis-substrait) as a starting point for lowering to MLIR (linalg on tensors).

Looking at your commits in this repo, seems like you are exploring an alternate path to get to MLIR and so would love to chat and brainstorm with you all about your project goals, roadmap, current state of things and how we can align efforts and collaborate in any way.

Thanks and looking forward to collaborating with you all,
Harsh

@nicolasvasilache
Copy link
Contributor

Hello @harsh-nod , thanks for sharing your plans!

Would you be available to meet next week with @webmiche ?
Since Harsh is on the W coast, I could meet Thursday 5-7pm CEST (8-10am PST) or after 8 pm CEST / 11am PST.

Do you have availabilities in these slots ?

@harsh-nod
Copy link
Author

Hi @nicolasvasilache ,

Unfortunately, I am out of the office on Thursday the 9th, but (8am PST) works for me on Mon, Tue, Wed of next week. Alternatively, if only Thursdays work, I can do Thursday June 16 at 8am PST. Do any of those days work for you?

@webmiche
Copy link
Collaborator

webmiche commented Jun 4, 2022

Hey, that sounds interesting, looking forward to the meeting!

In terms of timing, I can't make Thursday 9th since I will be on military service Thursday and Friday. Other than that, I can make 8am PST on any day, but Fridays.

@nicolasvasilache
Copy link
Contributor

Ok, let's do June 16th at 8am PST?
I can also other days but it seems that this particular day is already preidentified as working for all.

@bsarden-rivos
Copy link

bsarden-rivos commented Jun 6, 2022

Hi all, I'm excited for our chat! I can also do other days, but earlier than June 16th works better for me since I have some free cycles to work on this early this week / next week. My schedule is wide open, but how does Monday (6/13) at 8am work for folks? A few questions for @webmiche in the meantime...:

  1. Where would be the best place to start contributing? Looking at some of the PR's in flight I'm also interested in running a tcph query through MLIR, but not sure where to start.
  2. What would be the best path for running a query e2e? Does extending alp and the AlpRuntime to execute a query make sense, or is there already something in the works that I can help flesh out?

@harsh-nod
Copy link
Author

Unfortunately I am out of the office on Monday 6/13, 14 and 15. If we want something sooner, I can meet 8am PST tomorrow 6/7 or the day after 6/8?

@bsarden-rivos
Copy link

bsarden-rivos commented Jun 6, 2022

I can meet 8am PST tomorrow 6/7 or the day after 6/8?

Either time works for me!

@webmiche
Copy link
Collaborator

webmiche commented Jun 7, 2022

Unfortunately I am out of the office on Monday 6/13, 14 and 15. If we want something sooner, I can meet 8am PST tomorrow 6/7 or the day after 6/8?

For me, that time window would only work tomorrow (6/8).

  1. Where would be the best place to start contributing? Looking at some of the PR's in flight I'm also interested in running a tcph query through MLIR, but not sure where to start.

I think it would be very useful for our meeting, if you could look through the tpc-h queries and maybe think a bit about some of the challenges for modeling/running with mlir. I think I found "hard to solve" problems for most of them and I feel that Q6 is the most reasonable to get running first, but I would be happy about a second opinion.

  1. What would be the best path for running a query e2e? Does extending alp and the AlpRuntime to execute a query make sense, or is there already something in the works that I can help flesh out?

This is still very much an open question. The broad idea that we have is that since pandas stores data in columnar form and these columns are numpy arrays, we extract the numpy arrays from pandas and pass them to our mlir-functions (find the file here). This approach piggy backs off of parts of the sandbox. AFAIK, we have not yet developed a more concrete/complete idea of how this should look in the end.

@nicolasvasilache
Copy link
Contributor

6/8 at 8am PST is great, I'll post a link here

@harsh-nod
Copy link
Author

harsh-nod commented Jun 7, 2022

Not sure if you all have read this (just came out a few days ago), but found an interesting paper on implementing relational operators in PyTorch and running on TPC-H queries (including Q6) where they outperform DuckDB. Query Processing on Tensor Computation Runtimes

@nicolasvasilache
Copy link
Contributor

Here is the meeting for today's meeting.
Video call link: https://meet.google.com/ndw-fzsv-hqb
Or dial: ‪(CH) +41 31 560 24 00‬ PIN: ‪295 558 240 8107‬#
More phone numbers: https://tel.meet/ndw-fzsv-hqb?pin=2955582408107

@harsh-nod
Copy link
Author

I'm trying to join the meeting but its stuck at "Asking to join...".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants