Skip to content

Commit

Permalink
Merge pull request #8 from aswinkarthik93/benchmark
Browse files Browse the repository at this point in the history
Add comparisons
  • Loading branch information
Aswin Karthik authored Apr 28, 2018
2 parents c066986 + 80c9ab9 commit 2f97dc6
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 3 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ A Blazingly fast diff tool for comparing csv files.

Csvdiff is a difftool to compute changes between two csv files.

* It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables.
* It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables. GNU diff tool is orders of magnitude faster on comparing line by line.
* Supports specifying group of columns as primary-key.
* Supports selective comparison of fields in a row.
* Process a million records csv in under 2 seconds
* Compares csvs of million records csv in under 2 seconds. Comparisons and benchmarks [here](/benchmark).

## Demo

Expand Down
66 changes: 65 additions & 1 deletion benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,68 @@
## Benchmark Results
## Comparison with other tools


### Setup

* Using the majestic million data. (Source in credits section)
* Both files have 998390 rows and 12 columns.
* Only one modification between both files.
* Ran on Processor: Intel Core i7 2.5 GHz 4 cores 16 GB RAM

0. csvdiff (this tool) : *0m2.085s*

```bash
time csvdiff run -b majestic_million.csv -d majestic_million_diff.csv

# Additions: 0
# Modifications: 1

real 0m2.085s
user 0m3.861s
sys 0m0.340s
```

1. [data.table](https://github.com/Rdatatable/data.table) : *0m4.284s*

* Join both csvs using `id` column.
* Check inequality between both columns
* Rscript in [data-table.r](/benchmark/data-table.r) (Can it be written better? New to R)

```bash
time Rscript data-table.r

real 0m4.284s
user 0m3.887s
sys 0m0.284s
```

2. [csvdiff](https://pypi.org/project/csvdiff/) written in Python : *0m48.115s*

```bash
time csvdiff --style=summary id majestic_million.csv majestic_million_diff.csv
0 rows removed (0.0%)
0 rows added (0.0%)
1 rows changed (0.0%)

real 0m48.115s
user 0m42.895s
sys 0m3.948s
```

3. GNU diff (Fastest) : *0m0.297s*

* Seems the fastest. Couldn't even come close here.
* However, it does line by line diff. Does not support compound keys of a csv or selective compare of columns. Hence the disclaimer, cannot be used a generic diff tool.
* On another note, lets see if we can reach this.

```bash
time diff majestic_million.csv majestic_million_diff.csv

real 0m0.297s
user 0m0.144s
sys 0m0.147s
```

## Go Benchmark Results

Benchmark test can be found [here](https://github.com/aswinkarthik93/csvdiff/blob/master/pkg/digest/digest_benchmark_test.go).

Expand Down
13 changes: 13 additions & 0 deletions benchmark/data-table.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
library(data.table)

csv1 = fread('majestic_million.csv')
csv2 = fread('majestic_million_diff.csv')

setkey(csv1,id)
setkey(csv2,id)

result <- merge(csv2, csv1, all.x=TRUE)

diff <- result[result$"col-1.x" != result$"col-1.y" | result$"col-2.x" != result$"col-2.y" | result$"col-3.x" != result$"col-3.y" | result$"col-4.x" != result$"col-4.y" | result$"col-5.x" != result$"col-5.y" | result$"col-6.x" != result$"col-6.y" | result$"col-7.x" != result$"col-7.y" | result$"col-8.x" != result$"col-8.y" | result$"col-9.x" != result$"col-9.y" | result$"col-10.x" != result$"col-10.y" | result$"col-11.x" != result$"col-11.y"]

diff

0 comments on commit 2f97dc6

Please sign in to comment.