examples/index_search_example.go
This program shows how to use the APIs in index_search.go
to
- create indexes over PDF files,
- search those indexes using full-text search, and
- mark up PDF files with the locations of the search matches on pages.
It has 3 types of index
- On-disk. These can be as large as your disk but are slower.
- In-memory with the index stored in a Go struct. Faster but limited to (virtual) memory size.
- In-memory with the index serialized to a []byte. Useful for non-Go callers such as web apps.
The repo also has a series of example programs for doing full text search on PDF files in pure Go. It uses UniDoc for PDF parsing and bleve for search.
The simple programs are to explore the UniDoc and Bleve libraries.
cd $GOPATH/src/github.com
mkdir -p github.com/unidoc
cd github.com/unidoc
git clone https://github.com/peterwilliams97/unidoc
cd unidoc
git checkout v3.imagemark
go get github.com/blevesearch/bleve/...
go get github.com/blevesearch/snowballstem
go get github.com/kljensen/snowball
go get github.com/willf/bitset
go get github.com/couchbase/moss
go get github.com/syndtr/goleveldb/leveldb
go get github.com/rcrowley/go-metrics
cd $GOPATH/src/github.com/blevesearch/
git clone https://github.com/peterwilliams97/bleve
cd bleve
git checkout explore.01
cd $GOPATH/src/github.com/blevesearch/
git clone https://github.com/peterwilliams97/blevex
cd blevex
brew update
brew install flatbuffers --HEAD
go get github.com/google/flatbuffers/go
cd $GOPATH/src/github.com/peterwilliams97/pdf-search/serial
flatc -g doc_page_locations.fbs
pushd $GOPATH/src/github.com/peterwilliams97/pdf-search/serial/cmd/locations
go run main.go
go test -test.bench .
simple_index.go Index some PDFs
simple_search.go Full text search the PDFs indexed by `simple_index.go`
e.g.
go run simple_index.go ~/testdata/adobe/PDF32000_2008.pdf
go run simple_search.go Adobe PDF
gives
searchResults=755 matches, showing 1 through 10, took 5.396772ms
1. 1faa31928e.284 (0.414407)
PDF 32000-1:2008
Table 119 – Character collections for predefined CMaps, by PDF version (continued)
CMAP PDF 1.2 PDF 1.3 PDF 1.4 PDF 1.5
GBK2K-H/V — — Adobe-GB1-4 Adobe-GB1-4
UniGB-UCS2-H/V — Adobe-…
2. 1faa31928e.6 (0.322457)
PDF 32000-1:2008
Foreword
On January 29, 2007, Adobe Systems Incorporated announced it’s intention to release the full Portable
Document Format (PDF) 1.7 specification to the American National Standa…
...
or
go run simple_index.go ~/testdata/adobe/*.pdf
go run simple_search.go Type1
gives
searchResults=220 matches, showing 1 through 10, took 12.175059ms
1. 1faa31928e.710 (0.721218)
…bj
7 0 obj
<< /Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
endobj
2. 1faa31928e.271 (0.427241)
…ut this encodingdoes play a role as a default encoding(as shown in Table 114). The regular encodings
used for Lat
concurrent_index_doc.go Index PDFs concurrently. Granularity is PDF file.
concurrent_index_page.go Index PDFs concurrently. Granularity is PDF page.
https://rwinslow.com/posts/use-flatbuffers-in-golang/
https://github.com/google/flatbuffers/blob/master/tests/go_test.go
https://google.github.io/flatbuffers/flatbuffers_guide_use_go.html
- Move to unidoc:v3
- Make go gettable
- Add tests
- Markup PDF files
-
- Locations -> Positions
-
- Efficient position encoding Line Y and height Start of word
-
- ReadDocPageLocations Gather all (doc,page) entries, open the necessary docs and return the pages.
- parts => strings.Builder https://golang.org/pkg/strings/#Builder
- Offset => Start, End
https://www.youtube.com/watch?v=RsOIiW_Ec4c 45:40
https://www.hathitrust.org/blogs/large-scale-search/tale-two-solrs-0
https://www.hathitrust.org/full-text-search-features-and-analysis