Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024) #758

sujee · 2024-10-30T15:57:09Z

sujee
Oct 30, 2024

Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024)

This was an in person workshop, I ran at IBM techXchange conference in Las Vegas (Oct 21, 2024)

The workshop is in 2 parts (2 hrs total)

Part 1 - DPK Intro -showcasing core features of DPK
Part 2 - RAG application.

I used newly released IBM-GRANITE-3.0 model for RAG. Worked really well.

The intro notebooks (part 1) runs on Google colab. Lot of attendees did run this notebook using colab along with me.
This was great, as it gave them a pretty good idea of what DPK can be used for, without having to setup their laptop.

RAG application is designed to run on local python env. Some of them started setting up their local python env during the workshop. But pip install over conference wifi was slow (to be expected)

Afterwards, a few came up to me and chatted about their use cases.

Notes:

There is good amount of interest in what DPK can do for them.
We need to keep getting the word out.
We need to have as many example notebooks as colab friendly (ready to run on colab with single click). I have been doing this and will continue to advocate for this
One of them tried using their own PDFs and remarked that the pd2parquet conversion seems to be slow. This is a known issue, and we may want to prioritize investigating this.
[Bug] improve performance of pdf2parquet #573
Also interest in extraction of tables from PDF and processing OCR forms.
I think docling can do these already, need to create some tutorials (on my radar)
There was interest in PII remover. We need to have a good tutorial
native windows support . I think we are pretty close to achieving this now
[Bug] unable to install release 0.2.1 on windows (native) #644
There was a question about integration with InstructLab. This is something worth highlighting (tutorial, workshop ..etc)
Interest in processing HTML and EXCEL spreadsheets.
We can process HTML now. But the excel was interesting.
Question about how we can keep external metadata about documents. And how we can use them for vector search / RAG.

agoyal26 · 2024-11-12T13:00:16Z

agoyal26
Nov 12, 2024
Collaborator

@sujee I am looking at action items:

we are spreading the word- its ongoing
there is an effort to have colab friendly notebook for all transforms
you have opened an issue
do you want to open an issue and assign to yourself?
PII - working with a different team to build a demo
in progress
lets discuss and figure its priority in team meeting
same as above
same as above

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024) #758

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024) #758

sujee Oct 30, 2024

Replies: 1 comment

agoyal26 Nov 12, 2024 Collaborator

sujee
Oct 30, 2024

agoyal26
Nov 12, 2024
Collaborator