Skip to content

Commit

Permalink
Documentation improvements (#28)
Browse files Browse the repository at this point in the history
* Update README.md

A few simple documentation changes to the README.

* notes about  parseDelimitedFrom() and writeDelimitedTo() cf #27

* make a note about d-gaps and writeDelimitedTo

Co-authored-by: Craig Macdonald <[email protected]>
  • Loading branch information
JMMackenzie and cmacdonald authored Apr 12, 2021
1 parent fb99d94 commit edb959f
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 1 deletion.
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,15 @@ A CIFF export can be ingested into a number of different search systems.
+ [PISA](https://github.com/pisa-engine/pisa) via the [PISA CIFF Binaries](https://github.com/pisa-engine/ciff).
+ [OldDog](https://github.com/chriskamphuis/olddog) by [creating csv files through CIFF](https://github.com/Chriskamphuis/olddog/blob/master/src/main/java/nl/ru/convert/CiffToCsv.java)
+ [Terrier](http://terrier.org) via the [Terrier-CIFF plugin](https://github.com/terrierteam/terrier-ciff)

## Tips for writing your own CIFF Importer / Exporter

The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format.
Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which
should be noted.

1. The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers *as deltas (d-gaps).* Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion [here](https://github.com/osirrc/ciff/issues/19).

2. Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the `DocRecord` structure are *approximate* - see the discussion [here](https://github.com/osirrc/ciff/issues/21).

3. Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindings for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion [here](https://github.com/osirrc/ciff/issues/27).
3 changes: 2 additions & 1 deletion src/main/protobuf/CommonIndexFileFormat.proto
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ package io.osirrc.ciff;
// - A Header protobuf message,
// - Exactly the number of PostingsList messages specified in the num_postings_lists field of the Header
// - Exactly the number of DocRecord messages specified in the num_doc_records field of the Header
// Each message is written using message.writeDelimitedTo(), which prefixes each message with its varint encoded size.
// The protobuf messages are defined below.

// This is the CIFF header. It always comes first.
Expand Down Expand Up @@ -37,7 +38,7 @@ message Header {

// An individual posting.
message Posting {
int32 docid = 1;
int32 docid = 1; //the *delta-gap* compressed docid
int32 tf = 2;
}

Expand Down

0 comments on commit edb959f

Please sign in to comment.