-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llamaparse error when parsing docx file #1363
Comments
Checking this |
Fixed in #1364, for the workaround you have to pass |
My usecase is loading data from a bucket, so I have a buffer. (Unlike my simplified example above). So I'm using Ex:
I don't believe there's a way to pass the filename to this call or to the |
I have to leave this ticket to llama parse side? I cannot do things more here |
/cc @hexapode |
I think I should close this, I double tested on stackbliz that now it's should working. LlamaParse has some internal upgrade to fix this |
Please try this. If you have any more issue, please let me know https://stackblitz.com/edit/stackblitz-starters-k137wi?file=index.js |
Awesome. Thank you! With default config it works. The Multi-modal version fails:
|
could give me the parameter and maybe sample data? |
import { LlamaParseReader } from "llamaindex";
import fs from "fs";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";
export type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">> & {
language?: ParserLanguages | ParserLanguages[] | undefined;
apiKey?: string | undefined;
}
async function main() {
const path = "/tmp/sample.docx";
if (!fs.existsSync(path)) {
console.error(`File ${path} does not exist`);
process.exit(1);
} else {
console.log(`File ${path} exists`);
}
const apiKey = process.env.LLAMAINDEX_KEY;
const vendorMultimodalApiKey = process.env.LI_ANTHROPIC_KEY;
const params : LlamaParseReaderParams = {
verbose: true,
parsingInstruction: "Extract the text from the document along with any details of images and tables. This is a document for a course and a very detailed description of the contents of the images is important.",
fastMode: false,
gpt4oMode: false,
useVendorMultimodalModel: true,
vendorMultimodalModelName: "anthropic-sonnet-3.5",
vendorMultimodalApiKey: vendorMultimodalApiKey,
premiumMode: true,
resultType: "markdown",
apiKey: apiKey,
doNotCache: true,
};
// set up the llamaparse reader
const reader = new LlamaParseReader(params);
const buffer = fs.readFileSync(path);
const documents = await reader.loadDataAsContent(new Uint8Array(buffer));
let allText = "";
documents.forEach(doc => {
allText += doc.text;
});
console.log(allText);
}
main().catch(console.error).then((e) => {
console.error("error", e);
}); Using this file: |
i think this is same issue that docx parsed as pdf |
for now there's a workaround const magic = [80, 75, 3, 4];
let documents
if (buffer[0] === magic[0] && buffer[1] === magic[1] && buffer[2] === magic[2] && buffer[3] === magic[3]) {
documents = await reader.loadDataAsContent(new Uint8Array(buffer), 'filename.docx');
} |
Llamaparse parsing for docx doesn't work in 0.7.3. This works via the web UI which appears to use the public API.
I had hoped 1340 would address this but it has not.
Demonstration code. (Change the file name and the api key env.)
The text was updated successfully, but these errors were encountered: