Skip to content

Go implementation of the AE chunking algorithm.

License

Notifications You must be signed in to change notification settings

mg98/ae-chunker-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AE Chunker (GO)

GoDoc Test codecov Go Report Card License

ae-chunker-go is a best-effort Go implementation of the chunking algorithm presented in AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication by Yucheng Zhang et al. (PDF).

Install

go get -u github.com/mg98/ae-chunker-go

Example

import (
    "bytes"
    "fmt"
    "io"
    "log"
    "math/rand"
    "time"
    "github.com/ae-chunker-go"
)

func main() {
    data := make([]byte, 1024*1024)  // 1 MiB
    rnd := rand.New(rand.NewSource(time.Now().Unix()))
    if _, err := rnd.Read(data); err != nil {
        log.Fatal(err)
    }

    chunker := ae.NewChunker(bytes.NewReader(data), &ae.Options{
    	AverageSize: 256*1024,  // 256 KiB
    	MaxSize: 512*1024,      // 512 KiB
    })
    var chunks [][]byte
    for {
    	chunk, err := chunker.NextBytes()
    	if err == io.EOF {
    		break
        } else if err != nil {
        	log.Fatal(err)
        }
        chunks = append(chunks, chunk)
    }
    
    fmt.Printf(
        "Data divided into %d chunks. First chunk is %d bytes.\n",
        len(chunks),
        len(chunks[0]),
    )
    // Example output: Data divided into 5 chunks. First chunk is 224098 bytes.
}

Benchmarks

Performance

The task was to divide 100 MiB of random bytes into chunks with an average size of 256 KiB (CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz).

Chunking Algorithm Speed Processed Bytes Allocated Bytes Distinct Mem. Alloc.
ae-chunker-go 168 sec/op 622.94 MB/s 507.27 MB/op 7769 allocs/op
fastcdc-go 88 sec/op 1194.74 MB/s 2.10 MB/op 3 allocs/op
go-ipfs-chunker (Rabin) 414 sec/op 253.54 MB/s 108.83 MB/op 1192 allocs/op
go-ipfs-chunker (Buzhash) 81 sec/op 1288.27 MB/s 106.48 MB/op 406 allocs/op
go-ipfs-chunker (Fixed Size) 22 sec/op 4773.13 MB/s 104.87 MB/op 405 allocs/op

Deduplication Efficiency

This metric measures how well deduplication performs with multiple versions of a file. More precisely, we define the Deduplication Elimination Ratio (DER) as the ratio of the size of the input data to the size of the altered data (the higher the better).

For the evaluation, the uncompressed TAR archives of 20 consecutive versions of the GCC source code were used (a total of 12 GB). The algorithms were run configured for an average chunk size of 8 KB and (where applicable) a maximum chunk size of 16 KB. Because the Buzhash library does not support flexible chunk sizes the tests were repeated with 256 KB average and 512 KB max size for better comparison. Generally, smaller chunk sizes make better deduplication.

Chunking Algorithm DER (8K/16K) DER (256K/512K)
ae-chunker-go 1.056510 1.002392
fastcdc-go 1.000643 1.000000
go-ipfs-chunker (Rabin) 1.354034 1.058422
go-ipfs-chunker (Buzhash) n/a 1.083399
go-ipfs-chunker (Fixed Size) 1.032097 1.000579

Chunk Size Variance

The following plots show the chunk size distribution on a set of random bytes of 1 GiB. The algorithm was run with the options &ae.Options{AverageSize: 256*1024} and &ae.Options{AverageSize: 256*1024, MaxSize: 512*1024}, respectively.