forked from wesm/PyTables
-
Notifications
You must be signed in to change notification settings - Fork 0
/
MIGRATING_TO_2.x.txt
267 lines (186 loc) · 10.2 KB
/
MIGRATING_TO_2.x.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
==================================
Migrating from PyTables 1.x to 2.x
==================================
:Author: Francesc Alted i Abad
:Contact: [email protected]
:Author: Ivan Vilata i Balaguer
:Contact: [email protected]
Next are described a series of issues that you must have in mind when
migrating from PyTables 1.x to PyTables 2.x series.
New type system
===============
In PyTables 2.x all the data types for leaves are described through a couple
of classes:
- ``Atom``: Describes homogeneous types of the atomic components in ``*Array``
objects (``Array``, ``CArray``, ``EArray`` and ``VLArray``).
- ``Description``: Describes (possibly nested) heterogeneous types in
``Table`` objects.
So, in order to upgrade to the new type system, you must perform the next
replacements:
- ``*Array.stype`` --> ``*Array.atom.type`` (PyTables type)
- ``*Array.type`` --> ``*Array.atom.dtype`` (NumPy type)
- ``*Array.itemsize`` --> ``*Array.atom.itemsize`` (the size of the item)
Furthermore, the PyTables types (previously called "string types") have
changed to better adapt to NumPy conventions. The next changes have been
applied:
- PyTables types are now written in lower case, so 'Type' becomes 'type'. For
example, 'Int64' becomes now 'int64'.
- 'CharType' --> 'string'
- 'Complex32', 'Complex64' --> 'complex64', 'complex128'. Note that the
numeric part of a 'complex' type refers now to the *size in bits* of the
type and not to the precision, as before.
See Appendix I of the Users' Manual on supported data types for more
information on the new PyTables types.
Important changes in ``Atom`` specification
===========================================
- The ``dtype`` argument of ``EnumAtom`` and ``EnumCol`` constructors
has been replaced by the ``base`` argument, which can take a
full-blown atom, although it accepts bare PyTables types as well.
This is a *mandatory* argument now.
- ``vlstring`` pseudo-atoms used in ``VLArray`` nodes do no longer imply UTF-8
(nor any other) encoding, they only store and load *raw strings of bytes*.
All encoding and decoding is left to the user. Be warned that reading old
files may yield raw UTF-8 encoded strings, which may be coverted back to
Unicode in this way::
unistr = vlarray[index].decode('utf-8')
If you need to work with variable-length Unicode strings, you may want to
use the new ``vlunicode`` pseudo-atom, which fully supports Unicode strings
with no encoding hassles.
- Finally, ``Atom`` and ``Col`` are now abstract classes, so you can't use
them to create atoms or column definitions of an arbitrary type. If you
know the particular type you need, use the proper subclass; otherwise, use
the ``Atom.from_*()`` or ``Col.from_*()`` factory methods. See the section
on declarative classes in the reference.
You are also advised to avoid using the inheritance of atoms to check for
their kind or type; for that purpose, use their ``kind`` and ``type``
attributes.
New query system
================
- In-kernel conditions, since they are based now in Numexpr, must be written
*as strings*. For example, a condition that in 1.x was stated as::
result = [row['col2'] for row in table.where(table.cols.col1 == 1)]
now should read::
result = [row['col2'] for row in table.where('col1 == 1')]
That means that complex selections are possible now::
result = [ row['col2'] for row in
table.where('(col1 == 1) & (col3**4 > 1)') ]
- [Pro] For the same reason, conditions for indexed columns must be written as
strings as well.
[Pro] New indexing system
=========================
The indexing system has been totally rewritten from scratch for PyTables 2.0
Pro Edition (http://www.pytables.com/moin/PyTablesPro). Due to this, your
existing indexes created with PyTables 1.x will be useless, and although you
will be able to continue using the actual data in files, you won't be able to
take advantage of any improvement in speed.
When using the Pro version you will be offered the possibility to
automatically re-create the indexes in PyTables 1.x format to the new 2.0
format by using the ``ptrepack`` utility.
New meanings for atom shape and ``*Array`` shape argument
=========================================================
With PyTables 1.x, the atom shape was used for different goals depending on
the context it was used. For example, in ``createEArray()``, the shape of the
atom was used to specify the *dataset shape* of the object on disk, while in
``CArray`` the same atom shape was used to specify the *chunk shape* of the
dataset on disk. Moreover, for ``VLArray`` objects, the very same atom shape
specified the *type shape* of the data type. As you see, all of these was
quite a mess.
Starting with PyTables 2.x, an ``Atom`` only specifies properties of the data
type (à la ``VLArray`` in 1.x). This lets the door open for specifying
multidimensional data types (that can be part of another layer of
multidimensional datasets) in a consistent way along all the ``*Array``
objects in PyTables.
As a consequence of this, ``File.createCArray()`` and ``File.createVLArray()``
methods have received new parameters in order to make possible to specify the
shapes of the datasets as well as chunk sizes (in fact, it is possible now to
specify the latter for all the chunked leaves, see below). Please have this
in mind during the migration process.
Another consequence is that, now that the meaning of the atom shape is clearly
defined, it has been chosen as the main object to describe homogeneous data
types in PyTables. See the Users' Manual for more info on this.
New argument ``chunkshape`` of chunked leaves
=============================================
It is possible now to specify the chunk shape for all the chunked leaves in
PyTables (all except ``Array``). With PyTables 1.x this value was
automatically calculated so as to achieve decent results in most of the
situations. However, the user may be interested in specifying its own chunk
shape based on her own needs (although this should be done only by advanced
users).
Of course, if this parameter is not specified, a sensible default is
calculated for the size of the leave (which is recommended).
A new attribute called ``chunkshape`` has been added to all leaves. It is
read-only (you can't change the size of chunks once you have created a leaf),
but it can be useful for inspection by advanced users.
New flavor specification
========================
As of 2.x, flavors can *only* be set through the ``flavor`` attribute of
leaves, and they are *persistent*, so changing a flavor requires that the file
be writable.
Flavors can no longer be set through ``File.create*()`` methods, nor the
``flavor`` argument previously found in some ``Table`` methods, nor through
``Atom`` constructors or the ``_v_flavor`` attribute of descriptions.
System attributes can be deleted now
====================================
The protection against removing system attributes (like ``FILTERS``,
``FLAVOR`` or ``CLASS``, to name only a few) has been completely removed. It
is now the responsibility of the user to make a proper use of this freedom.
With this, users can get rid of all proprietary PyTables attributes if they
want to (for example, for making a file to look more like an HDF5 native one).
Byteorder issues
================
Now, all the data coming from reads and internal buffers is always converted
on-the-fly, if needed, to the *native* byteorder. This represents a big
advantage in terms of speed when operating with objects coming from files that
have been created in machines with a byte ordering different from native.
Besides, all leaf constructors have received a new ``byteorder`` parameter
that allows specifying the byteorder of data on disk. In particular, a
``_v_byteorder`` entry in a Table description is no longer honored and you
should use the aforementioned ``byteorder`` parameter.
Tunable internal buffer sizes
=============================
You can change the size of the internal buffers for I/O purposes of PyTables
by changing the value of the new public attribute ``nrowsinbuf`` that is
present in all leaves. By default, this contains a sensible value so as to
achieve a good balance between speed and memory consumption. Be careful when
changing it, if you don't want to get unwanted results (very slow I/O, huge
memory consumption...).
Changes to module names
=======================
If your application is directly accessing modules under the ``tables``
package, you need to know that *the names of all modules are now all in
lowercase*. This allows one to tell apart the ``tables.Array`` *class* from
the ``tables.array`` *module* (which was also called ``tables.Array`` before).
This includes subpackages like ``tables.nodes.FileNode``.
On top of that, more-or-less independent modules have also been renamed and
some of them grouped into subpackages. The most important are:
- The ``tables.netcdf3`` subpackage replaces the old ``tables.NetCDF`` module.
- The ``tables.nra`` subpackage replaces the old ``nestedrecords.py`` with the
implementation of the ``NestedRecArray`` class.
Also, the ``tables.misc`` package includes utility modules which do not depend
on PyTables.
Other changes
=============
- ``Filters.complib`` is ``None`` for filter properties created with
``complevel=0`` (i.e. disabled compression, which is the default).
- 'non-relevant' --> 'irrelevant' (applied to byteorders)
- ``Table.colstypes`` --> ``Table.coltypes``
- ``Table.coltypes`` --> ``Table.coldtypes``
- Added ``Table.coldescr``, dictionary of the ``Col`` descriptions.
- ``Table.colshapes`` has disappeared. You can get it this way::
colshapes = dict( (name, col.shape)
for (name, col) in table.coldescr.iteritems() )
- ``Table.colitemsizes`` has disappeared. You can get it this way::
colitemsizes = dict( (name, col.itemsize)
for (name, col) in table.coldescr.iteritems() )
- ``Description._v_totalsize`` --> ``Description._v_itemsize``
- ``Description._v_itemsizes`` and ``Description._v_totalsizes`` have
disappeared.
- ``Leaf._v_chunksize`` --> ``Leaf.chunkshape``
----
**Enjoy data!**
-- The PyTables Team
.. Local Variables:
.. mode: rst
.. coding: utf-8
.. fill-column: 78
.. End: