forked from tlwg/libdatrie
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.migration
182 lines (135 loc) · 7.23 KB
/
README.migration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
MIGRATION FROM 0.1.X TO 0.2.X
0.2.x breaks 0.1.x interoperability in many ways, to allow more use cases, and
to provide more storage capacity.
1. Binary Data Changes
1.1 All Trie Data in Single File
No more splitting of a trie into '{trie-name}.sbm', '{trie-name}.br' and
'{trie-name}.tl'. All parts are now stored in a single file, '{trie-name}.tri'.
Note, however, that a '{trie-name}.abm' (a renamed version of '{trie-name}.sbm'
after Unicode support) is still needed on first creation. But once created,
the '{trie-name}.tri' will incorporate the alphabet map data, and no
'{trie-name}.abm' is required in later uses. It will even be ignored if exists.
1.2 32-Bit Node Index
To accommodate larger word lists, trie node indices are now 32 bits, instead of
16 bits. This means 32,767 times capacity compared to the old format.
Therefore, the data size are doubled in general when migrating from old format,
but it can now hold exponentially more entries.
In addition, the tail block lengths are now 16 bits, instead of 8 bits, making
it possible to store longer suffixes, for dictionaries of extremely long words.
1.3 No Backward Compatibility
For simplicity of the code, it was decided not to read/write old format files.
If you still prefer using the old format, just stay with the old version. If
you like to gain more support from the new version, you can migrate your old
data by first dumping your dictionary with 0.1.x trietool into text file and
then creating the new dictionary with the dumped word list. Or if you already
have the word list, that makes things a lot easier. Just create the dictionary
with the new trietool.
Data Migration Steps:
a. If you have the word list source, just skip to next step. Otherwise, you
can dump the old data with 0.1.x trietool:
$ trietool {trie-name} list > words.lst
b. Prepare '{trie-name}.abm', listing ranges of characters used in the word
list, in terms of Unicode values. For example, for an English and Thai
dictionary:
[0x0041,0x005a]
[0x0061,0x007a]
[0x0e01,0x0e3a]
[0x0e40,0x0e4e]
c. Generate new trie with 0.2.x trietool-0.2. For example:
$ trietool-0.2 {trie-name} add-list -e TIS-620 words.lst
In this example, the '-e TIS-620' indicates that the 'words.lst' file
contains TIS-620 encoded text, which is most likely for word lists dumped
from the old trie with 8-bit Thai character code as the key encoding.
Replace it with your old encoding as necessary, such as ISO-8859-1 or the
like. If '-e' option is omitted, current locale encoding is assumed.
See trietool-0.2 man page for details.
2. API Changes
2.1 Non-File Trie Usage
In datrie 0.1.x, every trie was associated with a set of files. Now, this is
not only reduced to a single file, but zero file is also possible. That is, a
new trie can be created in memory, added words, removed words, queried words,
and then disposed without writing data to any file. Meanwhile, saving to file
is still possible.
Scenario 1: Loading trie from file, using it read-only.
1a. Open trie with trie_new_from_file(path).
1b. Use it.
1c. On exit:
- Close it with trie_free().
Scenario 2: Loading trie from file, updating file when finished.
2a. Open trie with trie_new_from_file(path).
2b. Use/update it.
2c. On exit:
- If trie_is_dirty(), then trie_save().
- Close it with trie_free().
Scenario 3: Create a new trie, saving it when finished.
3a. Prepare an alphabet map:
- Create new alphabet map with alpha_map_new().
- Add ranges with alpha_map_add_range().
3b. Create new trie with trie_new(alpha_map).
3c. Free the alphabet map with alpha_map_free().
3d. Use/update the trie.
3e. On exit:
- If trie_is_dirty(), then trie_save().
- Close the trie with trie_free().
Scenario 4: Create temporary trie, disposing it when finished.
4a. Prepare an alphabet map:
- Create new alphabet map with alpha_map_new().
- Add ranges with alpha_map_add_range().
4b. Create new trie with trie_new(alpha_map).
4c. Free the alphabet map with alpha_map_free().
4d. Use/update the trie.
4e. On exit:
- Close the trie with trie_free().
2.2 No More SBTrie
In datrie 0.1.x, SBTrie provided a wrapper to Trie implementation, converting
between real character codes and trie internal codes. This was for compactness,
as continuous character code range can cause more compact sparse table
allocation, while the real alphabet set needs not be continuous. However, in
datrie 0.2.x, this mapping feature has been merged into Trie class, to reduce
call layers. So, there is no SBTrie any more. You can call Trie directly in the
same way you called SBTrie in 0.1.x.
2.3 Characters are Now Unicode
datrie was previously planned to support multiple kinds of character encodings,
with only single-byte encoding as the available implementation for the time
being.
However, as there have been many requests for Unicode support, it seems to be
the most useful choice, into which all other encodings can be converted.
Furthermore, as datrie is mostly used in program's critical path, having too
many layers can contribute to being a bottleneck. So, only Unicode is accepted
in this version. It's now the application's duty to convert its keys into
Unicode before passing them to datrie. This should also allow any kind of
possible caching.
2.4 New Public APIs for Alphabet Map
As AlphaMap (alphabet map) is now necessary for creating a new empty trie, the
APIs for manipulating this data is now exposed to the public scope. See
<datrie/alpha-map.h> for the details.
2.5 Extensions to TrieState
trie_state_copy()
As part of performance profiling, allocating and freeing TrieState is found
to eat up CPU time at some degree. So, reusing existing TrieState where
possible does help. This function is added for copying TrieState data, as a
better alternative than trie_state_clone().
trie_state_is_single()
Sometimes, checking if a TrieState is a leaf state is too expensive for
program's critical path. It needs to check both whether the state is in a
non-branching path, that is, whether it is in a suffix node, and whether it
can be walked by a terminator. When a program only needs to check for the
former fact and not the latter, this method is at disposal.
3. Changes to TrieTool
3.1 Renaming
To allow co-existence with 0.1.x trietool, 0.2.x trietool is named
trietool-0.2.
3.2 '*.abm' Instead of '*.sbm'
As SBTrie has been eliminated in datrie 0.2.x, the corresponding '*.sbm'
(single-byte map) input file is also obsoleted. It is now renamed to '*.abm'
(alphabet map). Its format is also redefined to be Unicode-based. All alphabet
character ranges are defined in Unicode.
Besides, the '*.abm' file is required only once at trie creation time. It is
not needed at deployment, as the alphabet map is already included in the single
trie file.
3.3 Encoding Conversion Support
As datrie is now Unicode-based, conversion from other encodings can be useful.
This is possible for word list operations, namely add-list and delete-list, by
the additional '-e {enc}' or '--encoding {enc}' option. This option specifies
the character encoding of the word list file. And trietool-0.2 will convert the
contents to Unicode on-the-fly.