Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README docs for language transforms #800

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

dolfim-ibm
Copy link
Member

Why are these changes needed?

Updates for the pdf2parquer, doc_chunk and text_encoder transforms.

Related issue number (if any).

#753

Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dolfim-ibm @shahrokhDaijavad What do you guys think of adding a section like this one below to show how a user can invoke the transform once they have done a pip install (alternative to cloning the repo)::

import ast
import sys

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration

local_conf = {
"input_folder": “input”,
"output_folder": “output”,
}
params = {
"data_local_config": ParamsUtils.convert_to_ast(local_conf),
"data_files_to_use": ast.literal_eval("['.pdf','.docx','.pptx','.zip']"),
}
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())
launcher.launch()

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, @dolfim-ibm! Great job with the README files for all three transforms. They follow the template.

What @touma-I is suggesting would be to add these lines of code in the section that says "Code example" and has the link to the upcoming Notebook example. These lines, together with the pip install, will be used in the Notebook, but they could also be used in a Python example that is not Notebook. I am ok either way: 1) Wait for the Notebook or 2) Add the lines now.

@dolfim-ibm Please don't pick option 1 because it will make it easier on you! Maroun's question is how useful it is to have these lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I @shahrokhDaijavad I was actually adding the code block already, but then I realized it was 1-to-1 exactly the content of the example script. Instead of having to maintain multiple versions of it (with the high-risk) of being outdated, I think that linking to the example is still ok.

Honestly, I think the best is to plan in terms of a documentation engine which can embed working code examples, and to ensure in CI that those example codes are being executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants