diff --git a/README.md b/README.md index ded0b13..341b58a 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,83 @@ # cuckoo-filter cuckoo-filter go implement. Custom by you + +transplant from [efficient/cuckoofilter](https://github.com/efficient/cuckoofilter) + +[中文文档](./README_ZH.md) + +Overview +-------- +Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space. + +Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%). + +For details about the algorithm and citations please use: + +["Cuckoo Filter: Practically Better Than Bloom"](http://www.cs.cmu.edu/~binfan/papers/conext14_cuckoofilter.pdf) in proceedings of ACM CoNEXT 2014 by Bin Fan, Dave Andersen and Michael Kaminsky + +## Implementation details + +The paper cited above leaves several parameters to choose. + +2. Bucket size(b): Number of fingerprints per bucket +3. Fingerprints size(f): Fingerprints bits size of hashtag + +In other implementation: + +- [seiflotfy/cuckoofilter](https://github.com/seiflotfy/cuckoofilter) use b=4, f=8 bit, which correspond to a false positive rate of `r ~= 0.03`. +- [panmari/cuckoofilter](https://github.com/panmari/cuckoofilter) use b=4, f=16 bit, which correspond to a false positive rate of `r ~= 0.0001`. +- [irfansharif/cfilter](https://github.com/irfansharif/cfilter) can adjust b and f, but only can adjust f to 8x, which means it is in Bytes. + +In this implementation, you can adjust b and f to any value you want, and the Semi-sorting Buckets mentioned in paper is also avaliable, which can save 1 bit per item. + +##### Why custom is important? + +According to paper + +- Different bucket size result in different filter loadfactor, which means occupancy rate of filter +- Different bucket size is suitable for different target false positive rate +- To keep a false positive rate, bigger bucket size, bigger fingerprint size + + Given a target false positive rate of `r` + +> when r > 0.002, having two entries per bucket yields slightly better results than using four entries per bucket; when decreases to 0.00001 < r ≤ 0.002, four entries per bucket minimizes space. + +with a bucket size `b`, they suggest choosing the fingerprint size `f` using + + f >= log2(2b/r) bits + +as the same time, notice that we got loadfactor 84%, 95% or 98% when using bucket size b = 2, 4 or 8 + +##### To know more about parameter choosing, refer to paper's section 5 + +Note: generally b = 8 is enough, without more data support, we suggest you choosing b from 2, 4 or 8. And f is max 32 bits + +## Example usage: + +``` go +package main + +import ( + "fmt" + "github.com/linvon/cuckoo-filter" +) + +func main() { + cf := cuckoo.NewFilter(4, 9, 3900, cuckoo.TableTypePacked) + fmt.Println(cf.Info()) + fmt.Println(cf.FalsePositiveRate()) + + a := []byte("A") + cf.Add(a) + fmt.Println(cf.Contain(a)) + fmt.Println(cf.Size()) + + b := cf.Encode() + ncf, _ := cuckoo.Decode(b) + fmt.Println(ncf.Contain(a)) + + cf.Delete(a) + fmt.Println(cf.Size()) +} +``` + diff --git a/README_ZH.md b/README_ZH.md new file mode 100644 index 0000000..ae2e89b --- /dev/null +++ b/README_ZH.md @@ -0,0 +1,83 @@ +# cuckoo-filter +cuckoo-filter 的 go 实现版本. 可按你的配置来定制 + +移植于 [efficient/cuckoofilter](https://github.com/efficient/cuckoofilter) + +[English Version](./README.md) + +概述 +-------- +布谷鸟过滤器是一种在近似集合隶属查询时替代布隆过滤器的数据结构。布隆过滤器是众所周知的一种用于查询类似于“x是否在集合中?”这类问题,且非常节省空间的数据结构,但不支持删除。其支持删除的相关变种(如计数布隆过滤器)通常需要更多的空间。 + +布谷鸟过滤器可以灵活地动态添加和删除项。布谷鸟过滤器是基于布谷鸟哈希的(这也是为什么称为布谷鸟过滤器)。 它本质上是一个存储每个键的指纹的布谷鸟哈希表。布谷鸟哈希表可以非常紧凑,因此对于需要更低假阳性率(<3%)的应用程序,布谷鸟过滤器可以比传统的布隆过滤器节省更多空间。 + +有关算法和引用的详细信息,请参阅: + +["Cuckoo Filter: Practically Better Than Bloom"](http://www.cs.cmu.edu/~binfan/papers/conext14_cuckoofilter.pdf) in proceedings of ACM CoNEXT 2014 by Bin Fan, Dave Andersen and Michael Kaminsky + +## 实现细节 + +上述的论文提供了几个参数供选择 + +1. 桶大小(b):一个桶存储几个指纹 +2. 指纹大小(f):每个指纹存储的键的哈希值的位数 + +在其他的实现中: + +- [seiflotfy/cuckoofilter](https://github.com/seiflotfy/cuckoofilter) 使用 b=4, f=8 bit,其假阳性率趋近于 `r ~= 0.03`。 +- [panmari/cuckoofilter](https://github.com/panmari/cuckoofilter) 使用 b=4, f=16 bit,其假阳性率趋近于 `r ~= 0.0001`。 +- [irfansharif/cfilter](https://github.com/irfansharif/cfilter) 可以调整 b 和 f,但只能调整 f 为 8 的倍数,即以字节为单位。 + +在这个实现中, 你可以调整 b 和 f 为任意你想要的值,并且论文中提到的半排序桶也是可以使用的, 该方法可以对每一项节省一个 bit。 + +##### 为什么定制很重要? + +根据论文 + +- 不同的桶大小会产生不同的过滤器负载因子,这代表着过滤器的最大空间利用率 +- 不同的桶大小适用于不同的目标假阳性率 +- 为了保持假阳性率不变,桶大小越大,需要的指纹大小就越大 + +假定我们需要的假阳性率为 `r` + +> 当r>0.002时。每桶有两个条目比每桶使用四个条目产生的结果略好;当ϵ减小到0.00001= log2(2b/r) bits + +同时,注意当使用桶大小为b = 2, 4 or 8时,对应的负载因子为 84%, 95% or 98%。 + +##### 想了解更多关于参数选择的内容,请参考论文的第五章节 + +注意: 通常情况下 b = 8 就足够了,由于没有更多数据的支持,我们建议你从2、4、8中选择桶大小。而 f 最大为 32 bits。 + +## 参考用例: + +``` go +package main + +import ( + "fmt" + "github.com/linvon/cuckoo-filter" +) + +func main() { + cf := cuckoo.NewFilter(4, 9, 3900, cuckoo.TableTypePacked) + fmt.Println(cf.Info()) + fmt.Println(cf.FalsePositiveRate()) + + a := []byte("A") + cf.Add(a) + fmt.Println(cf.Contain(a)) + fmt.Println(cf.Size()) + + b := cf.Encode() + ncf, _ := cuckoo.Decode(b) + fmt.Println(ncf.Contain(a)) + + cf.Delete(a) + fmt.Println(cf.Size()) +} +``` +