dtl
is the diff template library written in C++. The name of template is derived C++'s Template.
- Features
- Getting started
- Compare two strings
- Compare two data has arbitrary type
- Merge three sequences
- Patch function
- Difference as Unified Format
- Compare large sequences
- Unserious difference
- Calculate only Edit Distance
- Algorithm
- Computational complexity
- Comparison when difference between two sequences is very large
- Implementations with various programming languages
- Examples
- strdiff
- intdiff
- unidiff
- unistrdiff
- strdiff3
- intdiff3
- patch
- fpatch
- Running tests
- Building test programs
- Running test programs
- Old commit histories
- License
dtl
provides the functions for comparing two sequences have arbitrary type. But sequences must support random access_iterator.
To start using this library, all you need to do is include dtl.hpp
.
#include "dtl/dtl.hpp"
First of all, calculate the difference between two strings.
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.compose();
When the above code is run, dtl
calculates the difference between A and B as Edit Distance and LCS and SES.
The meaning of these three terms is below.
Edit Distance | Edit Distance is numerical value for declaring a difference between two sequences. |
---|---|
LCS | LCS stands for Longest Common Subsequence. |
SES | SES stands for Shortest Edit Script. I mean SES is the shortest course of action for tranlating one sequence into another sequence. |
If one sequence is "abc" and another sequence is "abd", Edit Distance and LCS and SES is below.
Edit Distance | 2 |
---|---|
LCS | ab |
SES | C a C b D c A d |
- 「C」:Common
- 「D」:Delete
- 「A」:ADD
If you want to know in more detail, please see examples/strdiff.cpp.
This calculates Edit Distance and LCS and SES of two strings received as command line arguments and prints each.
When one string is "abc" and another string "abd", the output of strdiff
is below.
$ ./strdiff abc abd
editDistance:2
LCS:ab
SES
a
b
-c
+d
$
dtl
can compare data has aribtrary type because of the C++'s template.
But the compared data type must support the random access_iterator.
In the previous example, the string data compared,
dtl
can also compare two int vectors like the example below.
int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int b[10] = {3, 5, 1, 4, 5, 1, 7, 9, 6, 10};
std::vector<int> A(&a[0], &a[10]);
std::vector<int> B(&b[0], &b[10]);
dtl::Diff< int > d(A, B);
d.compose();
If you want to know in more detail, please see examples/intdiff.cpp.
dtl
has the diff3 function.
This function is that dtl
merges three sequences.
Additionally dtl
detects the confliction.
typedef char elem;
typedef std::string sequence;
sequence A("qqqabc");
sequence B("abc");
sequence C("abcdef");
dtl::Diff3<elem, sequence> diff3(A, B, C);
diff3.compose();
if (!diff3.merge()) {
std::cerr << "conflict." << std::endl;
return -1;
}
std::cout << "result:" << diff3.getMergedSequence() << std::endl;
When the above code is run, the output is below.
result:qqqabcdef
If you want to know in more detail, please see examples/strdiff3.cpp.
dtl
can also translates one sequence to another sequence with SES.
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
string s1(A);
string s2 = d.patch(s1);
When the above code is run, s2 becomes "abd". The SES of A("abc") and B("abd") is below.
Common a
Common b
Delete c
Add d
The patch function translates a sequence as argument with SES. For this example, "abc" is translated to "abd" with above SES.
Please see dtl's header files about the data structure of SES.
dtl
can also treat difference as Unified Format. See the example below.
typedef char elem;
typedef std::string sequence;
sequence A("acbdeaqqqqqqqcbed");
sequence B("acebdabbqqqqqqqabed");
dtl::Diff<elem, sequence > d(A, B);
d.compose(); // construct an edit distance and LCS and SES
d.composeUnifiedHunks(); // construct a difference as Unified Format with SES.
d.printUnifiedFormat(); // print a difference as Unified Format.
The difference as Unified Format of "acbdeaqqqqqqqcbed" and "acebdabbqqqqqqqabed" is below.
@@ -1,9 +1,11 @@
a
c
+e
b
d
-e
a
+b
+b
q
q
q
@@ -11,7 +13,7 @@
q
q
q
-c
+a
b
e
d
The data structure Unified Format is below.
/**
* Structure of Unified Format Hunk
*/
template <typename sesElem>
struct uniHunk {
int a, b, c, d; // @@ -a,b +c,d @@
std::vector<sesElem> common[2]; // anteroposterior commons on changes
std::vector<sesElem> change; // changes
int inc_dec_count; // count of increace and decrease
};
The actual blocks of Unified Format is this structure's vector.
If you want to know in more detail, please see examples/unistrdiff.cpp and examples/unidiff.cpp and dtl's header files.
In addtion, dtl
has the function translates one sequence to another sequence with Unified Format.
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff<elem, sequence> d(A, B);
d.compose();
d.composeUnifiedHunks()
string s1(A);
string s2 = d.UniPatch(s1);
When the above code is run, s2 becomes "abd". The uniPatch function translates a sequence as argument with Unified Format blocks.
For this example, "abc" is translated to "abd" with the Unified Format block below.
@@ -1,3 +1,3 @@
a
b
-c
+d
When compare two large sequences, dtl
can optimizes the calculation of difference with the onHuge function.
This function is available when the compared data type is std::vector.
When you use this function, you may call this function before calling compose function.
typedef char elem;
typedef std::vector<elem> sequence;
sequence A;
sequence B;
/* ・・・ */
dtl::Diff< elem, sequence > d(A, B);
d.onHuge();
d.compose();
The calculation of difference is very heavy.
dtl
uses An O(NP) Sequence Comparison Algorithm.
Though this Algorithm is sufficiently fast, when difference between two sequences is very large,
the calculation of LCS and SES needs massive amounts of memory.
dtl
avoids above-described problem by dividing each sequence into plural subsequences
and joining the difference of each subsequence finally.
As this way repeats allocating massive amounts of memory,
dtl
provides other way. It is the way of calculating unserious difference.
For example, The normal SES of "abc" and "abd" is below.
Common a
Common b
Delete c
Add d
The unserious SES of "abc" and "abd" is below.
Delete a
Delete b
Delete c
Add a
Add b
Add d
Of course, when "abc" and "abd" are compared with dtl
, above difference is not derived.
dtl
calculates the unserious difference when dtl
judges the calculation of LCS and SES
needs massive amounts of memory and unserious difference function is ON.
dtl
joins the calculated difference before dtl
judges it and unserious difference finally.
As a result, all difference is not unserious difference when unserious difference function is ON.
When you use this function, you may call this function before calling compose function.
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onUnserious();
d.compose();
As using onOnlyEditDistance, dtl
calculates the only edit distance.
If you need only edit distance, you may use this function, because the calculation of edit distance is lighter than the calculation of LCS and SES.
When you use this function, you may call this function before calling compose function.
typedef char elem;
typedef std::string sequence;
sequence A("abc");
sequence B("abd");
dtl::Diff< elem, sequence > d(A, B);
d.onOnlyEditDistance();
d.compose();
The algorithm dtl
uses is based on "An O(NP) Sequence Comparison Algorithm" by described by Sun Wu, Udi Manber and Gene Myers.
An O(NP) Sequence Comparison Algorithm(following, Wu's O(NP) Algorithm) is the efficient algorithm for comparing two sequences.
The computational complexity of Wu's O(NP) Algorithm is averagely O(N+PD), in the worst case, is O(NP).
Calculating LCS and SES efficiently at any time is a little difficult.
Because that the calculation of LCS and SES needs massive amounts of memory when a difference between two sequences is very large.
The program uses that algorithm don't consider that will burst in the worst case.
dtl
avoids above-described problem by dividing each sequence into plural subsequences and joining the difference of each subsequence finally. (This feature is supported after version 0.04)
There are the Wu's O(NP) Algorithm implementations with various programming languages below.
https://github.com/cubicdaiya/onp
There are examples in dtl/examples.
dtl
uses SCons for building examples and tests. If you build and run examples and tests, install SCons.
strdiff
calculates a difference between two string sequences, but multi byte is not supported.
$ cd dtl/examples
$ scons strdiff
$ ./strdiff acbdeacbed acebdabbabed
editDistance:6
LCS:acbdabed
SES
a
c
+ e
b
d
- e
a
- c
b
+ b
+ a
+ b
e
d
$
intdiff
calculates a diffrence between two int arrays sequences.
$ cd dtl/examples
$ scons intdiff
$ ./intdiff # There are data in intdiff.cpp
1 2 3 4 5 6 7 8 9 10
3 5 1 4 5 1 7 9 6 10
editDistance:8
LCS: 3 4 5 7 9 10
SES
- 1
- 2
3
+ 5
+ 1
4
5
- 6
+ 1
7
- 8
9
+ 6
10
$
unidiff
calculates a diffrence between two text file sequences,
and output the difference between files with unified format.
$ cd dtl/examples
$ scons unidiff
$ cat a.txt
a
e
c
z
z
d
e
f
a
b
c
d
e
f
g
h
i
$ cat b.txt
a
d
e
c
f
e
a
b
c
d
e
f
g
h
i
$ ./unidiff a.txt b.txt
--- a.txt 2008-08-26 07:03:28 +0900
+++ b.txt 2008-08-26 03:02:42 +0900
@@ -1,11 +1,9 @@
a
-e
-c
-z
-z
d
e
+c
f
+e
a
b
c
$
unistrdiff
calculates a diffrence between two string sequences.
and output the difference between strings with unified format.
$ cd dtl/examples
$ scons unistrdiff
$ ./unistrdiff acbdeacbed acebdabbabed
editDistance:6
LCS:acbdabed
@@ -1,10 +1,12 @@
a
c
+e
b
d
-e
a
-c
b
+b
+a
+b
e
d
$
strdiff3
merges three string sequence and output the merged sequence.
When the confliction has occured, output the string "conflict.".
$ cd dtl/examples
$ scons strdiff3
$ ./strdiff3 qabc abc abcdef
result:qabcdef
$
There is a output below when conflict occured.
$ ./strdiff3 adc abc aec
conflict.
$
intdiff3
merges three integer sequence(vector) and output the merged sequence.
$ cd dtl/examples
$ scons intdiff3
$ ./intdiff3
a:1 2 3 4 5 6 7 3 9 10
b:1 2 3 4 5 6 7 8 9 10
c:1 2 3 9 5 6 7 8 9 10
s:1 2 3 9 5 6 7 3 9 10
intdiff3 OK
$
patch
is the test program. Supposing that there are two strings is called by A and B,
patch
translates A to B with Shortest Edit Script or unified format difference.
$ cd dtl/examples
$ scons patch
$ ./patch abc abd
before:abc
after :abd
patch successed
before:abc
after :abd
unipatch successed
$
fpatch
is the test program. Supposing that there are two files is called by A and B,
fpatch
translates A to B with Shortest Edit Script or unified format difference.
$ cd dtl/examples
$ scons fpatch
$ cat a.txt
a
e
c
z
z
d
e
f
a
b
c
d
e
f
g
h
i
$ cat b.txt
$ cat b.txt
a
d
e
c
f
e
a
b
c
d
e
f
g
h
i
$ ./fpatch a.txt b.txt
fpatch successed
unipatch successed
$
dtl
uses googletest and SCons with testing dtl-self.
If you build test programs for dtl
, run scons
in test direcotry.
$ scons
If you run all tests for dtl
, run 'scons check' in test direcotry. (it is necessary that gtest is compiled)
$ scons check
If you run sectional tests, you may exeucte dtl_test
directly after you run scons
.
Following command is the example for testing only Strdifftest.
$ ./dtl_test --gtest_filter='Strdifftest.*'
--gtest-filters
is the function of googletest. googletest has many useful functions for testing software flexibly.
If you want to know other functions of googletest, run ./dtl_test --help
.
Please see cubicdaiya/dtl-legacy.
Please read the file COPYING.