Template for single transform notebook examples #754

shahrokhDaijavad · 2024-10-29T20:55:26Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

As related to the second task in #753, we need a notebook template as a starting point.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

touma-I · 2024-11-05T16:49:13Z

@Bytes-Explorer You had mentioned you had a template we can use. Thanks

Bytes-Explorer · 2024-11-06T12:45:29Z

Team: pls see these from one of my previous projects https://github.ibm.com/data-readiness-for-ai/dart/tree/offering_dev/notebooks

shahrokhDaijavad · 2024-11-06T16:17:33Z

@agoyal26 Please discuss with @sujee and finalize and show us the outcome.

agoyal26 · 2024-11-06T16:40:41Z

As @sujee has lots of experience giving engaging demos over past 2 months for DPK showcasing multiple RAG notebooks - we want to leverage his learning and feedback to come up with sample template which is simple and engaging for a new user

sujee · 2024-11-06T17:03:32Z

@agoyal26 Please discuss with @sujee and finalize and show us the outcome.

@agoyal26 and I will be working on this 👍

agoyal26 · 2024-11-07T12:38:37Z

I am going to jot here my first proposal of template along with best practices as discussed with @sujee and some open points for discussion.
Best Practices

Ideally include a graphical/flowchart type representation of the data transformation flow pipeline
Notebook should be Google Collab friendly
Please add requirement.txt so users can see all packages in one place
Number and label each notebook cell for easy reference
Have a separate one section right after imports for setting config parameters - should not be spread throughout the notebook

Open Questions

Should there be 2 notebooks? Jupyter conda based and Google collab based or integrated into 1 ?
For notebooks - do we suggest pip install modules or assume people have done git clone - @sujee suggested latter as it avoids installing packages on machines
Do we check-in output of cells? Should we display output of parquet files ?
What is best way to show progress through the notebook? so that user can skim and learn if they don't want to run notebook for now

agoyal26 · 2024-11-07T13:02:17Z

Proposed structure for notebook:

Import Libraries: Import necessary libraries and any additional dependencies for visualization or analysis.
Set up Config parameters
Import and Load input Data
Module Application to input data and Demonstration of usage with clear comments about parameters
Output/Analysis of output data

sujee · 2024-11-07T19:31:24Z

If you want to see a sample notebook here is my attempt 😄 : https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb

unified notebook runs on local + colab
workflow diagram on top of the notebook
numbered steps for easy reference (step 4, Step 4.1 ..etc)
outputs are checked-in, so users can skim the notebook and see whats going on (without running)
I am printing out intermediate results (pq) as I go along, so we can track transformations.

sujee · 2024-11-07T19:31:51Z

@agoyal26 should we move this into a discussion?

shahrokhDaijavad · 2024-11-07T20:35:57Z

Thanks, @agoyal26 and @sujee This is a good discussion, and I like https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb as a model to use. Of course, for a "single" transform, it will be simpler than this example, e.g., after the diagram, setup, and configuration, one step for data ingestion (zip, pdf, html) and conversion to parquet and then a second step for running a single transform (parquet to parquet).
@sujee You are right that this would ideally go to the "discussion," but to keep it actionable, let's keep it in the issues.

agoyal26 · 2024-11-13T14:24:35Z

Team - please make your suggestions so that we can close on a template and share with transform owners.
@Bytes-Explorer @touma-I

touma-I · 2024-11-13T23:34:02Z

@agoyal26 I had to do one today for @sujee to use for his html2parquet and in fact, I end up very close to what you were proposing above: #754 (comment)

Here is the link to the notebook I did : https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb

Here is how I adapted your proposall:

pip install the required packages
Import Libraries: Import necessary libraries and any additional dependencies for visualization or analysis.
Set up Config parameters
Invoke run-time
Output/Analysis of output data

I am a big proponent of keeping things simple/minimal for this first iteration

cc: @shahrokhDaijavad @Bytes-Explorer

shahrokhDaijavad · 2024-11-14T00:01:35Z

@touma-I This is a nice, simple template, and I am all for it.
For transform owners who do parquet to parquet, should the input to the notebook be parquet or should the notebook have extra initial cells for starting with pdf or html, converting to parquet, before running the transform?

agoyal26 · 2024-11-14T04:26:55Z

I think we should include the additional cells - as usually the initial data will not be in parquet format

touma-I · 2024-11-14T10:54:34Z

@shahrokhDaijavad today, all transforms work the same way, they have an input folder and they produce an output folder. Most of the transform expect files with .parquet extension in the input folder except for the ingest transforms, such as html2parquet, pdf2parquet, code2parquet,etc who accept .html, .pdf, .py, etc. So the structure should be still the same for all examples, just the type of files in the input folder is different.

shahrokhDaijavad · 2024-11-14T16:20:24Z

@Ryan-Gordon-314159 Based on all the discussion above and the two notebook examples that @touma-I and @sujee are linking to, I think we have all the ingredients to build a nice template.

shahrokhDaijavad · 2024-11-14T23:05:06Z

I have been discussing this with @touma-I today, and we have concluded that the best way for us to make fast progress is to use an iterative method, i.e., instead of waiting for a "complete" template, start with a simple functional Notebook, a la what Maroun did for html2parquet (https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb) with added explanation of what each cell does. Then, in the next iteration, all the niceties (diagram, being able to run both local and on Colab), nicer formatting, etc. can come. This decision is also influenced by the discussion we have with Tsuzuku-san (PR #790) and Michele (PR #800) about adding some example code to their README files (that they have finished) and their argument that adding such code is redundant if a Notebook will be added.

sujee · 2024-11-15T01:18:47Z

yes, 100%. Let's not get bogged down in creating the 'perfect template'. We have some good examples already. We can iterate quickly

Bytes-Explorer · 2024-11-15T04:03:11Z

I agree, no need to boil the ocean. One small suggestion that I have is to add some comments before each cell and few lines at the top on what is the functionality being demonstrated in this notebook. @shahrokhDaijavad I believe you are already thinking of it.

I would suggest making the notebook colab compatible if possible - that really helps when we do hands on demos.

shahrokhDaijavad · 2024-11-15T06:18:20Z

Sure, @Bytes-Explorer. I like the idea of making each notebook colab comaptible, even in the first iteration.

touma-I · 2024-11-15T12:20:50Z

@shahrokhDaijavad @Bytes-Explorer I would keep the collab requirement as a nice to have but not a must have. I agree it is easy to do but we might hit some issues down the road. I think the ask here is for the developer to show us how their code is used in a notebook with as little constraints as possible.

Bytes-Explorer · 2024-11-15T13:26:04Z

We have some different perspectives here which is good and means it needs some discussion on what would be the ROI from doing this work. I have started a thread on internal channel as that would be good way to gather feedback from other users and people who have done socialisation activities with DPK in the past.

shahrokhDaijavad added the enhancement New feature or request label Oct 29, 2024

touma-I assigned Bytes-Explorer Nov 5, 2024

touma-I assigned agoyal26 Nov 5, 2024

touma-I added the simplify-DPK label Nov 6, 2024

shahrokhDaijavad mentioned this issue Nov 11, 2024

Uniform documentation and example Notebooks for all transforms! #753

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template for single transform notebook examples #754

Template for single transform notebook examples #754

shahrokhDaijavad commented Oct 29, 2024

touma-I commented Nov 5, 2024

Bytes-Explorer commented Nov 6, 2024

shahrokhDaijavad commented Nov 6, 2024 •

edited

Loading

agoyal26 commented Nov 6, 2024

sujee commented Nov 6, 2024

agoyal26 commented Nov 7, 2024

agoyal26 commented Nov 7, 2024

sujee commented Nov 7, 2024

sujee commented Nov 7, 2024

shahrokhDaijavad commented Nov 7, 2024

agoyal26 commented Nov 13, 2024

touma-I commented Nov 13, 2024

shahrokhDaijavad commented Nov 14, 2024

agoyal26 commented Nov 14, 2024

touma-I commented Nov 14, 2024

shahrokhDaijavad commented Nov 14, 2024

shahrokhDaijavad commented Nov 14, 2024

sujee commented Nov 15, 2024 •

edited

Loading

Bytes-Explorer commented Nov 15, 2024 •

edited

Loading

shahrokhDaijavad commented Nov 15, 2024

touma-I commented Nov 15, 2024

Bytes-Explorer commented Nov 15, 2024

Template for single transform notebook examples #754

Template for single transform notebook examples #754

Comments

shahrokhDaijavad commented Oct 29, 2024

Search before asking

Component

Feature

Are you willing to submit a PR?

touma-I commented Nov 5, 2024

Bytes-Explorer commented Nov 6, 2024

shahrokhDaijavad commented Nov 6, 2024 • edited Loading

agoyal26 commented Nov 6, 2024

sujee commented Nov 6, 2024

agoyal26 commented Nov 7, 2024

agoyal26 commented Nov 7, 2024

sujee commented Nov 7, 2024

sujee commented Nov 7, 2024

shahrokhDaijavad commented Nov 7, 2024

agoyal26 commented Nov 13, 2024

touma-I commented Nov 13, 2024

shahrokhDaijavad commented Nov 14, 2024

agoyal26 commented Nov 14, 2024

touma-I commented Nov 14, 2024

shahrokhDaijavad commented Nov 14, 2024

shahrokhDaijavad commented Nov 14, 2024

sujee commented Nov 15, 2024 • edited Loading

Bytes-Explorer commented Nov 15, 2024 • edited Loading

shahrokhDaijavad commented Nov 15, 2024

touma-I commented Nov 15, 2024

Bytes-Explorer commented Nov 15, 2024

shahrokhDaijavad commented Nov 6, 2024 •

edited

Loading

sujee commented Nov 15, 2024 •

edited

Loading

Bytes-Explorer commented Nov 15, 2024 •

edited

Loading