Distributed Deduplication

Goal

Use multiple kubernetes pods to deduplicate a dataset which is larger than the memory of any single pod.

Result

Accomplish the goal, but not in a clean or easily reproducable manner. Abandoned the project before wrapping it up neatly due to the complexity of testing.

Design

Motivation

I was frustrated with non-deterministic (dataset dependent) OOM errors in Spark which were difficult to unit test. This got me thinking about algorithims that adapt based on their memory usage (similar to a kubernetes horizontal pod autoscaler).

Learnings

A proxy is a much better design for this task
Sending and receiving serialized data over the wire
- started with TCP, but transitioned to gRPC
Debugging and avoiding deadlocks
Use of a kubernetes client libraries
Building abstraction on top of a BTreeMap
The challenges of making generic async traits
Design a contract which is self refrencing and tollerates downtime

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
kubernetes		kubernetes
proto		proto
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Deduplication

Goal

Result

Design

Motivation

Learnings

About

Releases

Packages

Languages

JoshuaPostel/distributed-deduplication

Folders and files

Latest commit

History

Repository files navigation

Distributed Deduplication

Goal

Result

Design

Motivation

Learnings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages