Extracting text from all slides #8

nleroy917 · 2024-05-13T23:40:55Z

Hello! I know you are not actively maintaining this crate anymore, and that's ok, but I was trying to use it to extract some text from PowerPoint decks. I'm really interested in a pure-rust implementation for speed + portability (trying to just compile to WASM to do it on the browser).

Anyways... I got pretty close, but I am stuck trying to match on TextRun enums. Here is my code:

    fn extract(data: &[u8]) -> Result<String, anyhow::Error> {
        // create temp file to read from
        let mut file = NamedTempFile::new()?;
        file.write_all(data)?;

        // read pptx file
        let path = file.into_temp_path();
        let pptx = PPTXDocument::from_file(&path).unwrap();

        // start with empty text
        let mut text = String::new();

        // iterate over slides
        for (_, slide) in &pptx.slide_map {
            for shape in slide.common_slide_data.shape_tree.shape_array.iter() {
                match shape {
                    msoffice_pptx::pml::ShapeGroup::Shape(s) => {
                        match &s.text_body {
                            Some(text) => {
                                for paragraph in text.paragraph_array.iter() {
                                    for text_run in paragraph.text_run_list.iter() {
                                        //
                                        // I am stuck here.. can't import the proper enum to match on
                                        //
                                        match text_run {
                                            _ => ()
                                        }
                                    }
                                }
                            },
                            None => ()
                        }
                    },
                    msoffice_pptx::pml::ShapeGroup::GroupShape(_) => todo!(),
                    msoffice_pptx::pml::ShapeGroup::GraphicFrame(_) => (),
                    msoffice_pptx::pml::ShapeGroup::Connector(_) => (),
                    msoffice_pptx::pml::ShapeGroup::Picture(_) => (),
                    msoffice_pptx::pml::ShapeGroup::ContentPart(_) => (),
                }
            }
          }
        Ok("".to_string())
    }
   
    ```

It seems like the proper enums are inside `msoffice_shared`... but I can't import them.

Any help is appreciated!!!

The text was updated successfully, but these errors were encountered:

Isaac-D-Cohen · 2024-05-14T19:13:50Z

Major coincidence, but I happen to be trying the same exact thing!! I have a script that I use to search slides in my classes at school (so when my professor claims she covered something in class that wasn't covered, I can call her BS) and yesterday it occurred to me that maybe I can make it better by translating it to Rust. I found this crate, but I also can't get it to work, for a different reason.

First, here's a solution to your issue: The TextRun type resides in a different crate. Do cargo add msoffice_shared and then put use msoffice_shared::drawingml::TextRun; at the top of your file. The type there is TextRun. You can match like this:

match &s.text_body {
    Some(text) => {
        for paragraph in text.paragraph_array.iter() {
            for text_run in paragraph.text_run_list.iter() {
                match text_run {
                    TextRun::RegularTextRun(r) => println!("{}", r.text),
                    TextRun::LineBreak(l) => (),
                    TextRun::TextField(t) => match &t.text {
                    Some(text) => println!("{}", text),
                    None => (),
                }
            }
        }
    }
},

Now, I have a test slide that seems to make it panic. See attachment test3.pptx. Would anyone know why? It's a regular slide from one my classes.

Also, there is a fork of this repo that updates the dependencies. Can the original author pull in those changes?

Isaac-D-Cohen · 2024-05-14T19:41:07Z

I see on another thread @dam4rus says this repo is no longer maintained. That's sad. We may be out of luck with that broken test case.

nleroy917 · 2024-05-14T20:35:51Z

First, here's a solution to your issue: The TextRun type resides in a different crate. Do cargo add msoffice_shared and then put use msoffice_shared::drawingml::TextRun; at the top of your file. The type there is TextRun. You can match like this:

yeah, I noticed this and tired adding that crate... but didn't seem to work. Probably had something wrong; it was really late.

Regardless, I ended up actually abandoning this and wrote my own solution. I found this crate called dotext which extracts text from a bunch of file formats, but it also seems unmaintained. However, it claims to support .pptx files. So, I wanted to see how it was done there. If you look at the code, its rather simple, and it uses two (decently) well-maintained crates: zip and xml. This is the code:

impl MsDoc<Pptx> for Pptx {
    fn open<P: AsRef<Path>>(path: P) -> io::Result<Pptx> {
        let file = File::open(path.as_ref())?;
        let mut archive = ZipArchive::new(file)?;

        let mut xml_data = String::new();

        for i in 0..archive.len() {
            let mut c_file = archive.by_index(i).unwrap();
            if c_file.name().starts_with("ppt/slides") {
                let mut _buff = String::new();
                c_file.read_to_string(&mut _buff);
                xml_data += _buff.as_str();
            }
        }

        let mut buf = Vec::new();
        let mut txt = Vec::new();

        if xml_data.len() > 0 {
            let mut to_read = false;
            let mut xml_reader = Reader::from_str(xml_data.as_ref());
            loop {
                match xml_reader.read_event(&mut buf) {
                    Ok(Event::Start(ref e)) => match e.name() {
                        b"a:p" => {
                            to_read = true;
                            txt.push("\n".to_string());
                        }
                        b"a:t" => {
                            to_read = true;
                        }
                        _ => (),
                    },
                    Ok(Event::Text(e)) => {
                        if to_read {
                            let text = e.unescape_and_decode(&xml_reader).unwrap();
                            txt.push(text);
                            to_read = false;
                        }
                    }
                    Ok(Event::Eof) => break, // exits the loop when reaching end of file
                    Err(e) => {
                        return Err(io::Error::new(
                            io::ErrorKind::Other,
                            format!(
                                "Error at position {}: {:?}",
                                xml_reader.buffer_position(),
                                e
                            ),
                        ))
                    }
                    _ => (),
                }
            }
        }

        Ok(Pptx {
            path: path.as_ref().to_path_buf(),
            data: Cursor::new(txt.join("")),
        })
    }
}

So I just copied that over and tweaked it until I got it to work with my setup, and dropped this crate since its unmaintained. Perhaps you could do the same.

Alternatively... I wonder if you could just convert the powerpoint to a PDF and then use pdf-extract to get the text out for you. That one is well-maintained and has worked great for me so far.

Isaac-D-Cohen · 2024-05-14T21:37:41Z

Thanks for the suggestion! After messing around a bit more, it seems like the PPTX standard is dauntingly complicated. Text on a slide could be anywhere in the extracted directory tree of XMLs. How are you going about searching all the possible locations? In some of my test cases I have text in the diagrams/drawing*.xml. Do you systematically search? Or do you just search all files?

Isaac-D-Cohen · 2024-05-14T21:47:49Z

I decided that maybe I can just use your crate. But yours doesn't either find text that isn't directly in the ppt/slides/ directory. Is it supposed to?

nleroy917 · 2024-05-14T22:24:06Z

Probably! I opened an issue over on my crate for this: nleroy917/textractor#1

We can discuss there so we aren't polluting the discussion on this repo...

dam4rus · 2024-05-15T06:10:02Z

Thanks for opening the issue. As @Isaac-D-Cohen already mentioned this repo is and will be unmaintained since I originally planned to move the whole code base to a common Open Office XML file format deserializer which can be found here https://github.com/dam4rus/oox-rs. If I ever continue working on this it will be on that repository. I abandoned these crates because I no longer work with office documents and there was not much interest in it from the Rust community. I have seen some interest nowadays (Rust and WASM is getting more popular I guess) but these crates would need some major refactoring, so I do not think I will restart working on them.

Anyway, good luck on your endeavours with your crate since this can be a pretty complex problem to solve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text from all slides #8

Extracting text from all slides #8

nleroy917 commented May 13, 2024

Isaac-D-Cohen commented May 14, 2024 •

edited

Loading

Isaac-D-Cohen commented May 14, 2024

nleroy917 commented May 14, 2024

Isaac-D-Cohen commented May 14, 2024

Isaac-D-Cohen commented May 14, 2024 •

edited

Loading

nleroy917 commented May 14, 2024

dam4rus commented May 15, 2024

Extracting text from all slides #8

Extracting text from all slides #8

Comments

nleroy917 commented May 13, 2024

Isaac-D-Cohen commented May 14, 2024 • edited Loading

Isaac-D-Cohen commented May 14, 2024

nleroy917 commented May 14, 2024

Isaac-D-Cohen commented May 14, 2024

Isaac-D-Cohen commented May 14, 2024 • edited Loading

nleroy917 commented May 14, 2024

dam4rus commented May 15, 2024

Isaac-D-Cohen commented May 14, 2024 •

edited

Loading

Isaac-D-Cohen commented May 14, 2024 •

edited

Loading