Replies: 1 comment
-
@sujee I am looking at action items:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Notes from Data Prep Kit workshop at IBM techXchange conference in Las Vegas (Oct 21, 2024)
This was an in person workshop, I ran at IBM techXchange conference in Las Vegas (Oct 21, 2024)
The workshop is in 2 parts (2 hrs total)
Part 1 - DPK Intro -showcasing core features of DPK
Part 2 - RAG application.
I used newly released IBM-GRANITE-3.0 model for RAG. Worked really well.
The intro notebooks (part 1) runs on Google colab. Lot of attendees did run this notebook using colab along with me.
This was great, as it gave them a pretty good idea of what DPK can be used for, without having to setup their laptop.
RAG application is designed to run on local python env. Some of them started setting up their local python env during the workshop. But pip install over conference wifi was slow (to be expected)
Afterwards, a few came up to me and chatted about their use cases.
Notes:
There is good amount of interest in what DPK can do for them.
We need to keep getting the word out.
We need to have as many example notebooks as colab friendly (ready to run on colab with single click). I have been doing this and will continue to advocate for this
One of them tried using their own PDFs and remarked that the pd2parquet conversion seems to be slow. This is a known issue, and we may want to prioritize investigating this.
[Bug] improve performance of pdf2parquet #573
Also interest in extraction of tables from PDF and processing OCR forms.
I think docling can do these already, need to create some tutorials (on my radar)
There was interest in PII remover. We need to have a good tutorial
native windows support . I think we are pretty close to achieving this now
[Bug] unable to install release 0.2.1 on windows (native) #644
There was a question about integration with InstructLab. This is something worth highlighting (tutorial, workshop ..etc)
Interest in processing HTML and EXCEL spreadsheets.
We can process HTML now. But the excel was interesting.
Question about how we can keep external metadata about documents. And how we can use them for vector search / RAG.
Beta Was this translation helpful? Give feedback.
All reactions