Skip to content

Commit

Permalink
Merge pull request #9 from aswinkarthik93/improve-performance
Browse files Browse the repository at this point in the history
Improve performance
  • Loading branch information
Aswin Karthik authored Apr 29, 2018
2 parents 2f97dc6 + beadab5 commit 0b29a09
Show file tree
Hide file tree
Showing 21 changed files with 550 additions and 422 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,6 @@ out/

# End of https://www.gitignore.io/api/go

vendor/
vendor/

majestic_million*.csv
42 changes: 30 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,10 @@ Csvdiff is a difftool to compute changes between two csv files.
## Usage

```bash
$ csvdiff run --base base.csv --delta delta.csv
$ csvdiff base.csv delta.csv
# Additions: 1
...

# Modifications: 20
# Rows:
...
```

Expand All @@ -37,29 +36,29 @@ $ csvdiff run --base base.csv --delta delta.csv
- For MacOS

```bash
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_darwin_amd64.tar.gz | tar xfz -
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_darwin_amd64.tar.gz | tar xfz -
```

- For centos

```bash
yum install https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_linux_64-bit.rpm
yum install https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_linux_64-bit.rpm
```

- For debian

```
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_linux_64-bit.deb -O
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_linux_64-bit.deb -O
dpkg --install csvdiff_*_linux_64-bit.deb
```

- For Linux

```bash
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_linux_amd64.tar.gz | tar xfz -
curl -sL https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_linux_amd64.tar.gz | tar xfz -
```

- For [Windows](https://github.com/aswinkarthik93/csvdiff/releases/download/v0.1.2/csvdiff_0.1.2_windows_amd64.tar.gz)
- For [Windows](https://github.com/aswinkarthik93/csvdiff/releases/download/v1.0.0/csvdiff_1.0.0_windows_amd64.tar.gz)

- Build using Go

Expand All @@ -85,22 +84,41 @@ go get -u github.com/aswinkarthik93/csvdiff

## Miscellaneous features

- By default, it marks the row as ADDED or MODIFIED by introducing a new column at last.

```bash
% csvdiff examples/base-small.csv examples/delta-small.csv
Additions 1
Modifications 1
Rows:
24564,907,completely-newsite.com,com,19827,32902,completely-newsite.com,com,1621,909,19787,32822,ADDED
69,1048,aol.com,com,97543,225532,aol.com,com,70,49,97328,224491,MODIFIED
```

- The `--primary-key` in an integer array. Specify comma separated positions if the table has a compound key. Using this primary key, it can figure out modifications. If the primary key changes, it is an addition.

```bash
% csvdiff run --base base.csv --delta delta.csv --primary-key 0,1
% csvdiff base.csv delta.csv --primary-key 0,1
```

- If you want to compare only few columns in the csv when computing hash,

```bash
% csvdiff run --base base.csv --delta delta.csv --primary-key 0,1 --value-columns 2
% csvdiff base.csv delta.csv --primary-key 0,1 --columns 2
```

- **Additions** and **Modifications** can be written to files directly instead of STDOUT.
- Supports JSON format for post processing

```bash
% csvdiff run --base base.csv --delta delta.csv --additions additions.csv --modifications modifications.csv
% csvdiff examples/base-small.csv examples/delta-small.csv --format json
{
"Additions": [
"24564,907,completely-newsite.com,com,19827,32902,completely-newsite.com,com,1621,909,19787,32822"
],
"Modifications": [
"69,1048,aol.com,com,97543,225532,aol.com,com,70,49,97328,224491"
]
}
```

## Build locally
Expand Down
47 changes: 25 additions & 22 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,29 @@
## Comparison with other tools
# Comparison with other tools


### Setup
## Setup

* Using the majestic million data. (Source in credits section)
* Both files have 998390 rows and 12 columns.
* Only one modification between both files.
* Ran on Processor: Intel Core i7 2.5 GHz 4 cores 16 GB RAM

0. csvdiff (this tool) : *0m2.085s*

```bash
time csvdiff run -b majestic_million.csv -d majestic_million_diff.csv
## Baseline

# Additions: 0
# Modifications: 1
0. csvdiff (this tool) : *0m1.159s*

real 0m2.085s
user 0m3.861s
sys 0m0.340s
```bash
time csvdiff majestic_million.csv majestic_million_diff.csv
Additions 0
Modifications 1
...

real 0m1.159s
user 0m2.167s
sys 0m0.222s
```

## Other tools

1. [data.table](https://github.com/Rdatatable/data.table) : *0m4.284s*

* Join both csvs using `id` column.
Expand Down Expand Up @@ -71,13 +74,13 @@ $ cd ./pkg/digest
$ go test -bench=. -v -benchmem -benchtime=5s -cover
```

| | | | | |
| ---------------------------- | ---------- | ----------------------- | -------------------- | ------------------- |
| BenchmarkCreate1-8 | 2000000 | 5967 ns/op | 5474 B/op | 21 allocs/op |
| BenchmarkCreate10-8 | 500000 | 16251 ns/op | 10889 B/op | 94 allocs/op |
| BenchmarkCreate100-8 | 100000 | 114219 ns/op | 67139 B/op | 829 allocs/op |
| BenchmarkCreate1000-8 | 10000 | 1042723 ns/op | 674239 B/op | 8078 allocs/op |
| BenchmarkCreate10000-8 | 1000 | 10386850 ns/op | 6533806 B/op | 80306 allocs/op |
| BenchmarkCreate100000-8 | 100 | 108740944 ns/op | 64206718 B/op | 804208 allocs/op |
| BenchmarkCreate1000000-8 | 5 | 1161730558 ns/op | 672048142 B/op | 8039026 allocs/op |
| BenchmarkCreate10000000-8 | 1 | 12721982424 ns/op | 6549111872 B/op| 80308455 allocs/op |
```
BenchmarkCreate1-8 200000 31794 ns/op 116163 B/op 24 allocs/op
BenchmarkCreate10-8 200000 43351 ns/op 119993 B/op 79 allocs/op
BenchmarkCreate100-8 50000 142645 ns/op 160577 B/op 634 allocs/op
BenchmarkCreate1000-8 10000 907308 ns/op 621694 B/op 6085 allocs/op
BenchmarkCreate10000-8 1000 7998083 ns/op 5117977 B/op 60345 allocs/op
BenchmarkCreate100000-8 100 81260585 ns/op 49106849 B/op 604563 allocs/op
BenchmarkCreate1000000-8 10 788485738 ns/op 520115434 B/op 6042650 allocs/op
BenchmarkCreate10000000-8 1 7878009695 ns/op 5029061632 B/op 60346535 allocs/op
```
69 changes: 29 additions & 40 deletions cmd/config.go
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
package cmd

import (
"io"
"log"
"os"
"errors"
"strings"

"github.com/aswinkarthik93/csvdiff/pkg/digest"
)
Expand All @@ -18,10 +17,7 @@ func init() {
type Config struct {
PrimaryKeyPositions []int
ValueColumnPositions []int
Base string
Delta string
Additions string
Modifications string
Format string
}

// GetPrimaryKeys is to return the --primary-key flags as digest.Positions array.
Expand All @@ -40,45 +36,38 @@ func (c *Config) GetValueColumns() digest.Positions {
return []int{}
}

// GetBaseReader returns an io.Reader for the base file.
func (c *Config) GetBaseReader() io.Reader {
return getReader(c.Base)
}

// GetDeltaReader returns an io.Reader for the delta file.
func (c *Config) GetDeltaReader() io.Reader {
return getReader(c.Delta)
}

// AdditionsWriter gives the output stream for the additions in delta csv.
func (c *Config) AdditionsWriter() io.WriteCloser {
return getWriter(c.Additions)
}

// ModificationsWriter gives the output stream for the modifications in delta csv.
func (c *Config) ModificationsWriter() io.WriteCloser {
return getWriter(c.Modifications)
}
// Validate validates the config object
// and returns error if not valid.
func (c *Config) Validate() error {
allFormats := []string{rowmark, jsonFormat}

func getReader(filename string) io.Reader {
file, err := os.Open(filename)
formatValid := false
for _, format := range allFormats {
if strings.ToLower(c.Format) == format {
formatValid = true
}
}

if err != nil {
log.Fatal(err)
if !formatValid {
return errors.New("Specified format is not valid")
}

return file
return nil
}

func getWriter(outputStream string) io.WriteCloser {
if outputStream != "STDOUT" {
file, err := os.Create(outputStream)

if err != nil {
log.Fatal(err)
}
const (
rowmark = "rowmark"
jsonFormat = "json"
)

return file
// Formatter instantiates a new formatted
// based on config.Format
func (c *Config) Formatter() Formatter {
format := strings.ToLower(c.Format)
if format == rowmark {
return &RowMarkFormatter{}
} else if format == jsonFormat {
return &JSONFormatter{}
}
return os.Stdout
return &RowMarkFormatter{}
}
56 changes: 49 additions & 7 deletions cmd/config_test.go
Original file line number Diff line number Diff line change
@@ -1,30 +1,72 @@
package cmd
package cmd_test

import (
"testing"

"github.com/aswinkarthik93/csvdiff/cmd"
"github.com/aswinkarthik93/csvdiff/pkg/digest"
"github.com/stretchr/testify/assert"
)

func TestPrimaryKeyPositions(t *testing.T) {
config := Config{PrimaryKeyPositions: []int{0, 1}}
config := cmd.Config{PrimaryKeyPositions: []int{0, 1}}
assert.Equal(t, digest.Positions([]int{0, 1}), config.GetPrimaryKeys())

config = Config{PrimaryKeyPositions: []int{}}
config = cmd.Config{PrimaryKeyPositions: []int{}}
assert.Equal(t, digest.Positions([]int{0}), config.GetPrimaryKeys())

config = Config{}
config = cmd.Config{}
assert.Equal(t, digest.Positions([]int{0}), config.GetPrimaryKeys())
}

func TestValueColumnPositions(t *testing.T) {
config := Config{ValueColumnPositions: []int{0, 1}}
config := cmd.Config{ValueColumnPositions: []int{0, 1}}
assert.Equal(t, digest.Positions([]int{0, 1}), config.GetValueColumns())

config = Config{ValueColumnPositions: []int{}}
config = cmd.Config{ValueColumnPositions: []int{}}
assert.Equal(t, digest.Positions([]int{}), config.GetValueColumns())

config = Config{}
config = cmd.Config{}
assert.Equal(t, digest.Positions([]int{}), config.GetValueColumns())
}

func TestConfigValidate(t *testing.T) {
var config *cmd.Config

config = &cmd.Config{}
assert.Error(t, config.Validate())

config = &cmd.Config{Format: "rowmark"}
assert.NoError(t, config.Validate())

config = &cmd.Config{Format: "rowMARK"}
assert.NoError(t, config.Validate())

config = &cmd.Config{Format: "json"}
assert.NoError(t, config.Validate())
}

func TestDefaultConfigFormatter(t *testing.T) {
config := &cmd.Config{}

formatter, ok := config.Formatter().(*cmd.RowMarkFormatter)

assert.True(t, ok)
assert.NotNil(t, formatter)
}

func TestConfigFormatter(t *testing.T) {
var config *cmd.Config
var formatter cmd.Formatter
var ok bool

config = &cmd.Config{Format: "rowmark"}
formatter, ok = config.Formatter().(*cmd.RowMarkFormatter)
assert.True(t, ok)
assert.NotNil(t, formatter)

config = &cmd.Config{Format: "json"}
formatter, ok = config.Formatter().(*cmd.JSONFormatter)
assert.True(t, ok)
assert.NotNil(t, formatter)
}
48 changes: 48 additions & 0 deletions cmd/formatter.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
package cmd

import (
"encoding/json"
"fmt"
"io"

"github.com/aswinkarthik93/csvdiff/pkg/digest"
)

// Formatter defines the interface through which differences
// can be formatted and displayed
type Formatter interface {
Format(digest.Difference, io.Writer)
}

// RowMarkFormatter formats diff by marking each row as
// ADDED/MODIFIED. It mutates the row and adds as a new column.
type RowMarkFormatter struct{}

// Format prints the diff to os.Stdout
func (f *RowMarkFormatter) Format(diff digest.Difference, w io.Writer) {
fmt.Fprintf(w, "Additions %d\n", len(diff.Additions))
fmt.Fprintf(w, "Modifications %d\n", len(diff.Modifications))
fmt.Fprintf(w, "Rows:\n")

for _, added := range diff.Additions {
fmt.Fprintf(w, "%s,%s\n", added, "ADDED")
}

for _, modified := range diff.Modifications {
fmt.Fprintf(w, "%s,%s\n", modified, "MODIFIED")
}
}

// JSONFormatter formats diff to as a JSON Object
type JSONFormatter struct{}

// Format prints the diff as a JSON
func (f *JSONFormatter) Format(diff digest.Difference, w io.Writer) {
data, err := json.MarshalIndent(diff, "", " ")

if err != nil {
panic(err)
}

w.Write(data)
}
Loading

0 comments on commit 0b29a09

Please sign in to comment.