Index PDF and MS Office documents #2

mdecimus · 2022-09-15T08:36:09Z

There is poor support in the Rust ecosystem for extracting text from PDF and MsOffice documents.
Look for alternatives.

dumblob · 2022-10-04T07:39:37Z

On one hand this is a useful feature. I see though one important issue with it though. And it is so much important that there should be a checkbox to disable this feature.

Namely this "content indexing" is largely unreliable in practice. Documents are super wild in their formats, producers, etc. that one can not guarantee proper indexing of them. But it gets much worse realizing that only a tiny subset of document types/formats can be indexed. Once a user finds out a document got indexed she automatically assumes all (incl. every single existing format ever made on this planet) will be indexed and searchable. Nothing is more distant from reality than this.

To sum up:

content indexing is useful
content indexing has to be optional (default on - but see point (3))
content indexing has to be fully explicit in the user-facing UI - namely when searching a mail box there has to be a checkbox explicitly listing document formats to be searched through their content

yodatak · 2022-11-18T17:51:43Z

Hello we could use a interface of ripgrep or ripgrepall i use It for search in pdf and docs

yodatak · 2022-11-18T17:52:35Z

https://github.com/phiresky/ripgrep-all

cptspff · 2022-12-12T08:28:16Z

How about supporting attachment parsing with an optional external solution like TikaServer? It's easy to implement, battle-tested and relatively complete.

Also, it's probably a good idea to offer a "limit content indexing to known contacts" setting due to the inherently risky business of content scanning. It's my understanding jmap-server will handle contacts as well in the future - please correct me if I'm wrong.

mdecimus · 2022-12-13T08:24:06Z

@cptspff and @yodatak I want to avoid using external software (unless these are available as libraries) since that will require users to install and maintain yet another component. But I am going to take a look at Tika and ripgrep to see how they're doing it, perhaps they have some of their functionality available as a library. Thanks!

mdecimus added the enhancement New feature or request label Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index PDF and MS Office documents #2

Index PDF and MS Office documents #2

mdecimus commented Sep 15, 2022

dumblob commented Oct 4, 2022

yodatak commented Nov 18, 2022

yodatak commented Nov 18, 2022

cptspff commented Dec 12, 2022

mdecimus commented Dec 13, 2022

Index PDF and MS Office documents #2

Index PDF and MS Office documents #2

Comments

mdecimus commented Sep 15, 2022

dumblob commented Oct 4, 2022

yodatak commented Nov 18, 2022

yodatak commented Nov 18, 2022

cptspff commented Dec 12, 2022

mdecimus commented Dec 13, 2022