Documentation improvements (#28)

* Update README.md A few simple documentation changes to the README. * notes about parseDelimitedFrom() and writeDelimitedTo() cf #27 * make a note about d-gaps and writeDelimitedTo Co-authored-by: Craig Macdonald <[email protected]>
osirrc · Apr 12, 2021 · edb959f · edb959f
1 parent fb99d94
commit edb959f
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -55,3 +55,15 @@ A CIFF export can be ingested into a number of different search systems.
 + [PISA](https://github.com/pisa-engine/pisa) via the [PISA CIFF Binaries](https://github.com/pisa-engine/ciff).
 + [OldDog](https://github.com/chriskamphuis/olddog) by [creating csv files through CIFF](https://github.com/Chriskamphuis/olddog/blob/master/src/main/java/nl/ru/convert/CiffToCsv.java)
 + [Terrier](http://terrier.org) via the [Terrier-CIFF plugin](https://github.com/terrierteam/terrier-ciff)
+
+##  Tips for writing your own CIFF Importer / Exporter
+
+The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format.
+Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which
+should be noted.
+
+1. The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers *as deltas (d-gaps).* Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion [here](https://github.com/osirrc/ciff/issues/19).
+
+2. Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the `DocRecord` structure are *approximate* - see the discussion [here](https://github.com/osirrc/ciff/issues/21).
+
+3. Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindings for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion [here](https://github.com/osirrc/ciff/issues/27).
diff --git a/src/main/protobuf/CommonIndexFileFormat.proto b/src/main/protobuf/CommonIndexFileFormat.proto
@@ -6,6 +6,7 @@ package io.osirrc.ciff;
 //  - A Header protobuf message,
 //  - Exactly the number of PostingsList messages specified in the num_postings_lists field of the Header
 //  - Exactly the number of DocRecord messages specified in the num_doc_records field of the Header
+// Each message is written using message.writeDelimitedTo(), which prefixes each message with its varint encoded size.
 // The protobuf messages are defined below.
 
 // This is the CIFF header. It always comes first.
@@ -37,7 +38,7 @@ message Header {
 
 // An individual posting.
 message Posting {
-  int32 docid = 1;
+  int32 docid = 1; //the *delta-gap* compressed docid
   int32 tf = 2;
 }