Skip to content
This repository has been archived by the owner on May 15, 2024. It is now read-only.

Extracting text from all slides #8

Open
nleroy917 opened this issue May 13, 2024 · 7 comments
Open

Extracting text from all slides #8

nleroy917 opened this issue May 13, 2024 · 7 comments

Comments

@nleroy917
Copy link

Hello! I know you are not actively maintaining this crate anymore, and that's ok, but I was trying to use it to extract some text from PowerPoint decks. I'm really interested in a pure-rust implementation for speed + portability (trying to just compile to WASM to do it on the browser).

Anyways... I got pretty close, but I am stuck trying to match on TextRun enums. Here is my code:

    fn extract(data: &[u8]) -> Result<String, anyhow::Error> {
        // create temp file to read from
        let mut file = NamedTempFile::new()?;
        file.write_all(data)?;

        // read pptx file
        let path = file.into_temp_path();
        let pptx = PPTXDocument::from_file(&path).unwrap();

        // start with empty text
        let mut text = String::new();

        // iterate over slides
        for (_, slide) in &pptx.slide_map {
            for shape in slide.common_slide_data.shape_tree.shape_array.iter() {
                match shape {
                    msoffice_pptx::pml::ShapeGroup::Shape(s) => {
                        match &s.text_body {
                            Some(text) => {
                                for paragraph in text.paragraph_array.iter() {
                                    for text_run in paragraph.text_run_list.iter() {
                                        //
                                        // I am stuck here.. can't import the proper enum to match on
                                        //
                                        match text_run {
                                            _ => ()
                                        }
                                    }
                                }
                            },
                            None => ()
                        }
                    },
                    msoffice_pptx::pml::ShapeGroup::GroupShape(_) => todo!(),
                    msoffice_pptx::pml::ShapeGroup::GraphicFrame(_) => (),
                    msoffice_pptx::pml::ShapeGroup::Connector(_) => (),
                    msoffice_pptx::pml::ShapeGroup::Picture(_) => (),
                    msoffice_pptx::pml::ShapeGroup::ContentPart(_) => (),
                }
            }
          }
        Ok("".to_string())
    }
   
    ```

It seems like the proper enums are inside `msoffice_shared`... but I can't import them.

Any help is appreciated!!!
@Isaac-D-Cohen
Copy link

Isaac-D-Cohen commented May 14, 2024

Major coincidence, but I happen to be trying the same exact thing!! I have a script that I use to search slides in my classes at school (so when my professor claims she covered something in class that wasn't covered, I can call her BS) and yesterday it occurred to me that maybe I can make it better by translating it to Rust. I found this crate, but I also can't get it to work, for a different reason.

First, here's a solution to your issue: The TextRun type resides in a different crate. Do cargo add msoffice_shared and then put use msoffice_shared::drawingml::TextRun; at the top of your file. The type there is TextRun. You can match like this:

match &s.text_body {
    Some(text) => {
        for paragraph in text.paragraph_array.iter() {
            for text_run in paragraph.text_run_list.iter() {
                match text_run {
                    TextRun::RegularTextRun(r) => println!("{}", r.text),
                    TextRun::LineBreak(l) => (),
                    TextRun::TextField(t) => match &t.text {
                    Some(text) => println!("{}", text),
                    None => (),
                }
            }
        }
    }
},

Now, I have a test slide that seems to make it panic. See attachment test3.pptx. Would anyone know why? It's a regular slide from one my classes.

Also, there is a fork of this repo that updates the dependencies. Can the original author pull in those changes?

@Isaac-D-Cohen
Copy link

I see on another thread @dam4rus says this repo is no longer maintained. That's sad. We may be out of luck with that broken test case.

@nleroy917
Copy link
Author

First, here's a solution to your issue: The TextRun type resides in a different crate. Do cargo add msoffice_shared and then put use msoffice_shared::drawingml::TextRun; at the top of your file. The type there is TextRun. You can match like this:

yeah, I noticed this and tired adding that crate... but didn't seem to work. Probably had something wrong; it was really late.

Regardless, I ended up actually abandoning this and wrote my own solution. I found this crate called dotext which extracts text from a bunch of file formats, but it also seems unmaintained. However, it claims to support .pptx files. So, I wanted to see how it was done there. If you look at the code, its rather simple, and it uses two (decently) well-maintained crates: zip and xml. This is the code:

impl MsDoc<Pptx> for Pptx {
    fn open<P: AsRef<Path>>(path: P) -> io::Result<Pptx> {
        let file = File::open(path.as_ref())?;
        let mut archive = ZipArchive::new(file)?;

        let mut xml_data = String::new();

        for i in 0..archive.len() {
            let mut c_file = archive.by_index(i).unwrap();
            if c_file.name().starts_with("ppt/slides") {
                let mut _buff = String::new();
                c_file.read_to_string(&mut _buff);
                xml_data += _buff.as_str();
            }
        }

        let mut buf = Vec::new();
        let mut txt = Vec::new();

        if xml_data.len() > 0 {
            let mut to_read = false;
            let mut xml_reader = Reader::from_str(xml_data.as_ref());
            loop {
                match xml_reader.read_event(&mut buf) {
                    Ok(Event::Start(ref e)) => match e.name() {
                        b"a:p" => {
                            to_read = true;
                            txt.push("\n".to_string());
                        }
                        b"a:t" => {
                            to_read = true;
                        }
                        _ => (),
                    },
                    Ok(Event::Text(e)) => {
                        if to_read {
                            let text = e.unescape_and_decode(&xml_reader).unwrap();
                            txt.push(text);
                            to_read = false;
                        }
                    }
                    Ok(Event::Eof) => break, // exits the loop when reaching end of file
                    Err(e) => {
                        return Err(io::Error::new(
                            io::ErrorKind::Other,
                            format!(
                                "Error at position {}: {:?}",
                                xml_reader.buffer_position(),
                                e
                            ),
                        ))
                    }
                    _ => (),
                }
            }
        }

        Ok(Pptx {
            path: path.as_ref().to_path_buf(),
            data: Cursor::new(txt.join("")),
        })
    }
}

So I just copied that over and tweaked it until I got it to work with my setup, and dropped this crate since its unmaintained. Perhaps you could do the same.

Alternatively... I wonder if you could just convert the powerpoint to a PDF and then use pdf-extract to get the text out for you. That one is well-maintained and has worked great for me so far.

@Isaac-D-Cohen
Copy link

Thanks for the suggestion! After messing around a bit more, it seems like the PPTX standard is dauntingly complicated. Text on a slide could be anywhere in the extracted directory tree of XMLs. How are you going about searching all the possible locations? In some of my test cases I have text in the diagrams/drawing*.xml. Do you systematically search? Or do you just search all files?

@Isaac-D-Cohen
Copy link

Isaac-D-Cohen commented May 14, 2024

I decided that maybe I can just use your crate. But yours doesn't either find text that isn't directly in the ppt/slides/ directory. Is it supposed to?

@nleroy917
Copy link
Author

Probably! I opened an issue over on my crate for this: nleroy917/textractor#1

We can discuss there so we aren't polluting the discussion on this repo...

@dam4rus
Copy link
Owner

dam4rus commented May 15, 2024

Thanks for opening the issue. As @Isaac-D-Cohen already mentioned this repo is and will be unmaintained since I originally planned to move the whole code base to a common Open Office XML file format deserializer which can be found here https://github.com/dam4rus/oox-rs. If I ever continue working on this it will be on that repository. I abandoned these crates because I no longer work with office documents and there was not much interest in it from the Rust community. I have seen some interest nowadays (Rust and WASM is getting more popular I guess) but these crates would need some major refactoring, so I do not think I will restart working on them.

Anyway, good luck on your endeavours with your crate since this can be a pretty complex problem to solve

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants