-
Notifications
You must be signed in to change notification settings - Fork 4
Extracting text from all slides #8
Comments
Major coincidence, but I happen to be trying the same exact thing!! I have a script that I use to search slides in my classes at school (so when my professor claims she covered something in class that wasn't covered, I can call her BS) and yesterday it occurred to me that maybe I can make it better by translating it to Rust. I found this crate, but I also can't get it to work, for a different reason. First, here's a solution to your issue: The TextRun type resides in a different crate. Do
Now, I have a test slide that seems to make it panic. See attachment test3.pptx. Would anyone know why? It's a regular slide from one my classes. Also, there is a fork of this repo that updates the dependencies. Can the original author pull in those changes? |
I see on another thread @dam4rus says this repo is no longer maintained. That's sad. We may be out of luck with that broken test case. |
yeah, I noticed this and tired adding that crate... but didn't seem to work. Probably had something wrong; it was really late. Regardless, I ended up actually abandoning this and wrote my own solution. I found this crate called impl MsDoc<Pptx> for Pptx {
fn open<P: AsRef<Path>>(path: P) -> io::Result<Pptx> {
let file = File::open(path.as_ref())?;
let mut archive = ZipArchive::new(file)?;
let mut xml_data = String::new();
for i in 0..archive.len() {
let mut c_file = archive.by_index(i).unwrap();
if c_file.name().starts_with("ppt/slides") {
let mut _buff = String::new();
c_file.read_to_string(&mut _buff);
xml_data += _buff.as_str();
}
}
let mut buf = Vec::new();
let mut txt = Vec::new();
if xml_data.len() > 0 {
let mut to_read = false;
let mut xml_reader = Reader::from_str(xml_data.as_ref());
loop {
match xml_reader.read_event(&mut buf) {
Ok(Event::Start(ref e)) => match e.name() {
b"a:p" => {
to_read = true;
txt.push("\n".to_string());
}
b"a:t" => {
to_read = true;
}
_ => (),
},
Ok(Event::Text(e)) => {
if to_read {
let text = e.unescape_and_decode(&xml_reader).unwrap();
txt.push(text);
to_read = false;
}
}
Ok(Event::Eof) => break, // exits the loop when reaching end of file
Err(e) => {
return Err(io::Error::new(
io::ErrorKind::Other,
format!(
"Error at position {}: {:?}",
xml_reader.buffer_position(),
e
),
))
}
_ => (),
}
}
}
Ok(Pptx {
path: path.as_ref().to_path_buf(),
data: Cursor::new(txt.join("")),
})
}
} So I just copied that over and tweaked it until I got it to work with my setup, and dropped this crate since its unmaintained. Perhaps you could do the same. Alternatively... I wonder if you could just convert the powerpoint to a PDF and then use |
Thanks for the suggestion! After messing around a bit more, it seems like the PPTX standard is dauntingly complicated. Text on a slide could be anywhere in the extracted directory tree of XMLs. How are you going about searching all the possible locations? In some of my test cases I have text in the diagrams/drawing*.xml. Do you systematically search? Or do you just search all files? |
I decided that maybe I can just use your crate. But yours doesn't either find text that isn't directly in the ppt/slides/ directory. Is it supposed to? |
Probably! I opened an issue over on my crate for this: nleroy917/textractor#1 We can discuss there so we aren't polluting the discussion on this repo... |
Thanks for opening the issue. As @Isaac-D-Cohen already mentioned this repo is and will be unmaintained since I originally planned to move the whole code base to a common Open Office XML file format deserializer which can be found here https://github.com/dam4rus/oox-rs. If I ever continue working on this it will be on that repository. I abandoned these crates because I no longer work with office documents and there was not much interest in it from the Rust community. I have seen some interest nowadays (Rust and WASM is getting more popular I guess) but these crates would need some major refactoring, so I do not think I will restart working on them. Anyway, good luck on your endeavours with your crate since this can be a pretty complex problem to solve |
Hello! I know you are not actively maintaining this crate anymore, and that's ok, but I was trying to use it to extract some text from PowerPoint decks. I'm really interested in a pure-rust implementation for speed + portability (trying to just compile to WASM to do it on the browser).
Anyways... I got pretty close, but I am stuck trying to match on
TextRun
enums. Here is my code:The text was updated successfully, but these errors were encountered: