Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize powerpoint file extraction #1

Open
nleroy917 opened this issue May 14, 2024 · 3 comments
Open

Optimize powerpoint file extraction #1

nleroy917 opened this issue May 14, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@nleroy917
Copy link
Owner

The PowerPoint file text extraction leaves a lot to be desired. It's a little over simplified and doesn't find text that isn't directly in the ppt/slides/ directory. Should it do this?

@nleroy917
Copy link
Owner Author

@Isaac-D-Cohen let me know if you have thoughts. Would love contribution! I'm winging all of this. Big rust noob

@Isaac-D-Cohen
Copy link

Thanks for opening this issue! Yeah, I agree that it should find text anywhere on the slides. I also found an example of a slide with text directly on it that the text extraction feature doesn't find:
test4.pptx

I don't know how to go about this though. I'm an even bigger noob, having just learned Rust this spring semester in one of my classes. But I would guess we need to find out where in a PPTX text can legally be located. It seems really daunting though.

@nleroy917
Copy link
Owner Author

It seems really daunting though.

yeah the open-xml specification is absurd. The powerpoint extractor would probably have to read the actual documentation for PresentationML to really figure it all out.

Realistically, the extractor will just have to be incrementally updated as the crate gets updated to parse it better and better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants