Skip to content

Latest commit

 

History

History
110 lines (77 loc) · 3.35 KB

README.md

File metadata and controls

110 lines (77 loc) · 3.35 KB

mangle

-- import "github.com/grugnog/mangle"

Package mangle is a sanitization / data masking library for Go (golang).

Purpose

This library provides functionality to sanitize text and HTML data. This can be integrated into tools that export or manipulate databases/files so that confidential data is not exposed to staging, development or testing systems.

Getting Started

Install mangle:

go get github.com/grugnog/mangle

Try the command line tool:

go get github.com/grugnog/mangle/manglefile

# Generate a list of words using your system dictionary and download "War and Peace" for testing.
aspell -d en dump master | aspell -l en expand > corpus.txt
wget http://www.gutenberg.org/cache/epub/2600/pg2600.txt

# Run the sample text through manglefile via stdin and stdout.
cat pg2600.txt | manglefile -corpus=corpus.txt -secret=replace-with-a-secure-passphrase | less

For basic usage of mangle as a Go library for strings, see the Example below. For usage using io.Reader/io.Writer for text and html, see the manglefile source code.

Description

Secure - every non-punctuation, non-tag word is replaced, with no reuse of source words.

Fast - around 350,000 words/sec ("War and Peace" mangled in < 2s).

Deterministic - words are selected from a corpus deterministically - subsequent runs will result in the same word replacements as long as the user secret is not changed. This allows automated tests to run against the sanitized data, as well as for efficient rsync's of mangled versions of large slowly changing data sets.

Accurate - maintains a high level of resemblance to the source text, to allow for realistic testing of search indexes, page layout etc. Replacement words are all natural words from a user defined corpus. Source text word length, punctuation, title case and all caps, HTML tags and attributes are maintained.

Usage

func BuildCorpus

func BuildCorpus(scanner *bufio.Scanner) ([255][]string, error)

BuildCorpus is a helper function that reads a bufio.Scanner of words and returns an array of word lengths, each containing an array of words of that length.

func ReadCorpus

func ReadCorpus(filepath string) ([255][]string, error)

ReadCorpus is a helper function that opens and reads a corpus file of words and returns an array of word lengths, each containing an array of words of that length.

type Mangle

type Mangle struct {
	// Corpus of words to use as replacements. An array of word lengths, each
	// containing an array of words of that length.
	Corpus [255][]string
	// A sufficiently long secret, used as a salt so rainbow tables cannot be
	// used to reverse the hashes.
	Secret string
}

Mangle is used to configure an instance prior to mangling.

func (Mangle) MangleHTML

func (m Mangle) MangleHTML(r io.Reader, w io.Writer) error

MangleHTML operates on HTML using an io interface, preserving all HTML tags (including tag attributes), but mangling all content around and between tags.

func (Mangle) MangleIO

func (m Mangle) MangleIO(r io.Reader, w io.Writer) error

MangleIO operates on an io interface, parsing as plain text, and is preferable for long strings.

func (Mangle) MangleString

func (m Mangle) MangleString(s string) string

MangleString operates on strings, and is preferable if you have many short strings to operate on.