From 80c9ab9b3f3c28291bb3e2430f868d3ec776b96f Mon Sep 17 00:00:00 2001 From: aswinkarthik93 Date: Sat, 28 Apr 2018 19:10:52 +0530 Subject: [PATCH] Add comparisons --- README.md | 4 +-- benchmark/README.md | 66 +++++++++++++++++++++++++++++++++++++++++- benchmark/data-table.r | 13 +++++++++ 3 files changed, 80 insertions(+), 3 deletions(-) create mode 100644 benchmark/data-table.r diff --git a/README.md b/README.md index b8c5a81..8fc2a75 100644 --- a/README.md +++ b/README.md @@ -12,10 +12,10 @@ A Blazingly fast diff tool for comparing csv files. Csvdiff is a difftool to compute changes between two csv files. -* It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables. +* It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables. GNU diff tool is orders of magnitude faster on comparing line by line. * Supports specifying group of columns as primary-key. * Supports selective comparison of fields in a row. -* Process a million records csv in under 2 seconds +* Compares csvs of million records csv in under 2 seconds. Comparisons and benchmarks [here](/benchmark). ## Demo diff --git a/benchmark/README.md b/benchmark/README.md index 0313777..e4d4dcd 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -1,4 +1,68 @@ -## Benchmark Results +## Comparison with other tools + + +### Setup + +* Using the majestic million data. (Source in credits section) +* Both files have 998390 rows and 12 columns. +* Only one modification between both files. +* Ran on Processor: Intel Core i7 2.5 GHz 4 cores 16 GB RAM + +0. csvdiff (this tool) : *0m2.085s* + +```bash +time csvdiff run -b majestic_million.csv -d majestic_million_diff.csv + +# Additions: 0 +# Modifications: 1 + +real 0m2.085s +user 0m3.861s +sys 0m0.340s +``` + +1. [data.table](https://github.com/Rdatatable/data.table) : *0m4.284s* + + * Join both csvs using `id` column. + * Check inequality between both columns + * Rscript in [data-table.r](/benchmark/data-table.r) (Can it be written better? New to R) + +```bash +time Rscript data-table.r + +real 0m4.284s +user 0m3.887s +sys 0m0.284s +``` + +2. [csvdiff](https://pypi.org/project/csvdiff/) written in Python : *0m48.115s* + +```bash +time csvdiff --style=summary id majestic_million.csv majestic_million_diff.csv +0 rows removed (0.0%) +0 rows added (0.0%) +1 rows changed (0.0%) + +real 0m48.115s +user 0m42.895s +sys 0m3.948s +``` + +3. GNU diff (Fastest) : *0m0.297s* + + * Seems the fastest. Couldn't even come close here. + * However, it does line by line diff. Does not support compound keys of a csv or selective compare of columns. Hence the disclaimer, cannot be used a generic diff tool. + * On another note, lets see if we can reach this. + +```bash +time diff majestic_million.csv majestic_million_diff.csv + +real 0m0.297s +user 0m0.144s +sys 0m0.147s +``` + +## Go Benchmark Results Benchmark test can be found [here](https://github.com/aswinkarthik93/csvdiff/blob/master/pkg/digest/digest_benchmark_test.go). diff --git a/benchmark/data-table.r b/benchmark/data-table.r new file mode 100644 index 0000000..6dd8fcf --- /dev/null +++ b/benchmark/data-table.r @@ -0,0 +1,13 @@ +library(data.table) + +csv1 = fread('majestic_million.csv') +csv2 = fread('majestic_million_diff.csv') + +setkey(csv1,id) +setkey(csv2,id) + +result <- merge(csv2, csv1, all.x=TRUE) + +diff <- result[result$"col-1.x" != result$"col-1.y" | result$"col-2.x" != result$"col-2.y" | result$"col-3.x" != result$"col-3.y" | result$"col-4.x" != result$"col-4.y" | result$"col-5.x" != result$"col-5.y" | result$"col-6.x" != result$"col-6.y" | result$"col-7.x" != result$"col-7.y" | result$"col-8.x" != result$"col-8.y" | result$"col-9.x" != result$"col-9.y" | result$"col-10.x" != result$"col-10.y" | result$"col-11.x" != result$"col-11.y"] + +diff