Skip to content

altaris/turbo-broccoli

Repository files navigation

TurboBroccoli 🥦

Repository PyPI License Code style hehe Documentation

JSON (de)serialization extensions, originally aimed at numpy and tensorflow objects, but now supporting a wide range of objects.

Installation

pip install turbo-broccoli

Usage

To/from string

import numpy as np
import turbo_broccoli as tb

obj = {"an_array": np.array([[1, 2], [3, 4]], dtype="float32")}
tb.to_json(obj)

produces the following string (modulo indentation and the value of $.an_array.data.data):

{
  "an_array": {
    "__type__": "numpy.ndarray",
    "__version__": 5,
    "data": {
      "__type__": "bytes",
      "__version__": 3,
      "data": "QAAAAAAAAAB7ImRhd..."
    }
  }
}

For deserialization, simply use

tb.from_json(json_string)

To/from file

Simply replace turbo_broccoli.to_json and turbo_broccoli.from_json with turbo_broccoli.save_json and turbo_broccoli.load_json:

import numpy as np
import turbo_broccoli as tb

obj = {"an_array": np.array([[1, 2], [3, 4]], dtype="float32")}
tb.save_json(obj, "foo/bar/foobar.json")

...

obj = tb.load_json("foo/bar/foobar.json")

It is also possible to read/write compressed (with zlib) JSON files:

tb.save_json(obj, "foo/bar/foobar.json.gz")

...

obj = tb.load_json("foo/bar/foobar.json.gz")

The behaviour of turbo_broccoli.to_json and turbo_broccoli.from_json can be tweaked by using contexts. For example, to set a encryption/decryption key for secret types:

import nacl.secret
import nacl.utils
import turbo_broccoli as tb

key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
ctx = tb.Context(nacl_shared_key=key)
obj = {"user": "alice", "password": tb.SecretStr("dolphin")}
doc = tb.to_json(obj, ctx)

...

obj = tb.from_json(doc, ctx)

The behaviour of turbo_broccoli.save_json and turbo_broccoli.load_json can be tweaked in a similar manner. For convenience, the argument of the context can be passed directly to the method instead of creating a context object manually:

import nacl.secret
import nacl.utils
import turbo_broccoli as tb

key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
obj = {"user": "alice", "password": tb.SecretStr("dolphin")}
tb.save_json(obj, "foo/bar/foobar.json", nacl_shared_key=key)

See the documentation.

A turbo_broccoli.GuardedBlockHandler "guards" a block of code, meaning it prevents it from being executed if it has been in the past. Check out the documentation for some examples.

A mix of joblib.Parallel and turbo_broccoli.GuardedBlockHandler: a turbo_broccoli.Parallel object can be used to execute jobs in parallel, but those whose results have already been obtained in the past are skipped. See the documentation for some examples.

Custom encoders/decoders

You can register you own custom encoders and decoders using turbo_broccoli.register_encoder and turbo_broccoli.register_decoder:

import turbo_broccoli as tb

class MyClass:
    a: int
    b: np.ndarray
    c: np.ndarray
    def __init__(self, a: int, b: np.ndarray):
        self.a, self.b = a, b
        self.c = a + b

def encoder_c(obj: MyClass, ctx: tb.Context) -> dict:
    # If you register a decoder, you must include the key "__type__" and it
    # must have value "user.<name_of_type>"
    #       ↓
    return {"__type__": "user.MyClass", "a": obj.a, "b": obj.b}

def decoder_c(obj: dict, ctx: tb.Context) -> MyClass:
    return MyClass(obj["a"], obj["b"])

tb.register_encoder(encoder_c, "MyClass")
tb.register_decoder(decoder_c, "MyClass")

An encoder (for MyClass) is a function that takes two arguments: an object of type MyClass and a turbo_broccoli.Context, and returns a dict. That dict must contain objects that can be further serialized using TurboBroccoli (which includes all supported types and any other type for which you registered an encoder). The return dict needs not be flat.

If you register a decoder for MyClass (as in the example above), the dict must contain the key/value "__type__": "user.MyClass".

A decoder (for MyClass) is a function that takes two arguments: a dict and a turbo_broccoli.Context, and returns an object of type MyClass. The dict's values have already been deserialized.

Artifacts

If an object inside obj is too large to be embedded inside the JSON file (e.g. a large numpy array), then an artifact file is created:

import numpy as np
import turbo_broccoli as tb

obj = {"an_array": np.random.rand(1000, 1000)}
tb.save_json(obj, "foo/bar/foobar.json")

produces the JSON file

{
  "an_array": {
    "__type__": "numpy.ndarray",
    "__version__": 5,
    "data": {
      "__type__": "bytes",
      "__version__": 3,
      "id": "1e6dff28-5e26-44df-9e7a-75bc726ce9aa"
    }
  }
}

and a file foo/bar/foobar.1e6dff28-5e26-44df-9e7a-75bc726ce9aa.tb containing the array data. The artifact directory can be explicitely specified by setting it in the serialization context or by setting the TB_ARTIFACT_PATH environment variable (see below.). The code for loading the JSON file does not change:

obj = tb.load_json("foo/bar/foobar.json")

If using turbo_broccoli.to_json, since there is no output file path specified, the artifacts are storied in a temporary directory instead:

import numpy as np
import turbo_broccoli as tb

obj = {"an_array": np.random.rand(1000, 1000)}
doc = tb.to_json(obj)
# An artifact has been created somewhere in e.g. /tmp

Since no information about this directory is stored in the output JSON string, it is not possible to load doc using turbo_broccoli.from_json. If deserialization is necessary, instantiate a context:

import numpy as np
import turbo_broccoli as tb

ctx = tb.Context()
obj = {"an_array": np.random.rand(1000, 1000)}
doc = tb.to_json(obj, ctx)
# An artifact has been created in ctx.artifact_path

...

obj = tb.from_json(doc, ctx)

Supported types

Basic types

Generic objects

Serialization only. A generic object is an object that has the __turbo_broccoli__ attribute. This attribute is expected to be a list of attributes whose values will be serialized. For example,

class C:
    __turbo_broccoli__ = ["a", "b"]
    a: int
    b: int
    c: int

x = C()
x.a, x.b, x.c = 42, 43, 44
tb.to_json(x)

produces the following string:

{"a": 42,"b": 43}

Registered attributes can of course have any type supported by TurboBroccoli, such as numpy arrays. Registered attributes can be @property methods.

numpy.number, numpy.ndarray with numerical dtype, and numpy.dtype.

pandas.DataFrame and pandas.Series, but with the following limitations:

  • the following dtypes are not supported: complex, object, timedelta;

  • the column / series names cannot be ints or int-strings: the following are not acceptable

    df = pd.DataFrame([[1, 2], [3, 4]])
    df = pd.DataFrame([[1, 2], [3, 4]], columns=["0", "1"])

tensorflow.Tensor with numerical dtype, but not tensorflow.RaggedTensor.

  • torch.Tensor, Warning: loaded tensors are automatically placed on the CPU and gradients are lost;

  • torch.nn.Module, don't forget to register your module type using a turbo_broccoli.Context:

    # Serialization
    class MyModule(torch.nn.Module):
      ...
    
    module = MyModule()  # Must be instantiable without arguments
    doc = tb.to_json({"module": module})
    
    # Deserialization
    ctx = tb.Context(pytorch_module_types=[MyModule])
    module = tb.from_json(doc, ctx)

    Warning: It is not possible to register and deserialize standard pytorch module containers directly. Wrap them in your own custom module class. For following is not acceptable

    import turbo_broccoli as tb
    import torch
    
    module = torch.nn.Sequential(
        torch.nn.Linear(4, 2),
        torch.nn.ReLU(),
        torch.nn.Linear(2, 1),
        torch.nn.ReLU(),
    )
    obj = {"module": module}
    doc = tb.to_json(obj)  # works, but...
    tb.from_json(a, ctx)  # does't work

    but the following works:

    class MyModule(torch.nn.Module):
      module: torch.nn.Sequential  # Wrapped sequential
    
      def __init__(self):
          super().__init__()
          self.module = torch.nn.Sequential(
              torch.nn.Linear(4, 2),
              torch.nn.ReLU(),
              torch.nn.Linear(2, 1),
              torch.nn.ReLU(),
          )
    
      ...
    
    module = MyModule()  # Must be instantiable without arguments
    doc = tb.to_json({"module": module})
    
    ctx = tb.Context(pytorch_module_types=[MyModule])
    module = tb.from_json(doc, ctx)

    To circumvent all these limitations, use custom encoders / decoders.

  • torch.utils.data.ConcatDataset, torch.utils.data.StackDataset, torch.utils.data.Subset, torch.utils.data.TensorDataset, as long as the nested structure of datasets ultimately lead to torch.utils.data.TensorDatasets (e.g. a subset of a stack of subsets of tensor datasets is supported)

Just scipy.sparse.csr_matrix. ^^"

sklearn estimators (i.e. that inherit from sklean.base.BaseEstimator). Supported estimators are: AdaBoostClassifier, AdaBoostRegressor, AdditiveChi2Sampler, AffinityPropagation, AgglomerativeClustering, ARDRegression, BayesianGaussianMixture, BayesianRidge, BernoulliNB, BernoulliRBM, Binarizer, CategoricalNB, CCA, ClassifierChain, ComplementNB, DBSCAN, DecisionTreeClassifier, DecisionTreeRegressor, DictionaryLearning, ElasticNet, EllipticEnvelope, EmpiricalCovariance, ExtraTreeClassifier, ExtraTreeRegressor, ExtraTreesClassifier, ExtraTreesRegressor, FactorAnalysis, FeatureUnion, GaussianMixture, GaussianNB, GaussianRandomProjection, GraphicalLasso, HuberRegressor, IncrementalPCA, IsolationForest, Isomap, KernelCenterer, KernelDensity, KernelPCA, KernelRidge, KMeans, KNeighborsClassifier, KNeighborsRegressor, KNNImputer, LabelBinarizer, LabelEncoder, LabelPropagation, LabelSpreading, Lars, Lasso, LassoLars, LassoLarsIC, LatentDirichletAllocation, LedoitWolf, LinearDiscriminantAnalysis, LinearRegression, LinearSVC, LinearSVR, LocallyLinearEmbedding, LocalOutlierFactor, LogisticRegression, MaxAbsScaler, MDS, MeanShift, MinCovDet, MiniBatchDictionaryLearning, MiniBatchKMeans, MiniBatchSparsePCA, MinMaxScaler, MissingIndicator, MLPClassifier, MLPRegressor, MultiLabelBinarizer, MultinomialNB, MultiOutputClassifier, MultiOutputRegressor, MultiTaskElasticNet, MultiTaskLasso, NearestCentroid, NearestNeighbors, NeighborhoodComponentsAnalysis, NMF, Normalizer, NuSVC, NuSVR, Nystroem, OAS, OneClassSVM, OneVsOneClassifier, OneVsRestClassifier, OPTICS, OrthogonalMatchingPursuit, PassiveAggressiveRegressor, PCA, Pipeline, PLSCanonical, PLSRegression, PLSSVD, PolynomialCountSketch, PolynomialFeatures, PowerTransformer, QuadraticDiscriminantAnalysis, QuantileRegressor, QuantileTransformer, RadiusNeighborsClassifier, RadiusNeighborsRegressor, RandomForestClassifier, RandomForestRegressor, RANSACRegressor, RBFSampler, RegressorChain, RFE, RFECV, Ridge, RidgeClassifier, RobustScaler, SelectFromModel, SelfTrainingClassifier, SGDRegressor, ShrunkCovariance, SimpleImputer, SkewedChi2Sampler, SparsePCA, SparseRandomProjection, SpectralBiclustering, SpectralClustering, SpectralCoclustering, SpectralEmbedding, StackingClassifier, StackingRegressor, StandardScaler, SVC, SVC, SVR, SVR, TheilSenRegressor, TruncatedSVD, TSNE, VarianceThreshold, VotingClassifier, VotingRegressor.

Doesn't work with:

  • All CV classes because the score_ attribute is a dict indexed with np.int64, which json.JSONEncoder._iterencode_dict rejects.

  • Everything that is parametrized by an arbitrary object/callable/estimator: FunctionTransformer, TransformedTargetRegressor.

  • Other classes that have non JSON-serializable attributes:

    Class Non-serializable attr.
    Birch _CFNode
    BisectingKMeans function
    ColumnTransformer slice
    GammaRegressor HalfGammaLoss
    GaussianProcessClassifier Product
    GaussianProcessRegressor Sum
    IsotonicRegression interp1d
    OutputCodeClassifier _ConstantPredictor
    Perceptron Hinge
    PoissonRegressor HalfPoissonLoss
    SGDClassifier Hinge
    SGDOneClassSVM Hinge
    SplineTransformer BSpline
    TweedieRegressor HalfTweedieLossIdentity
  • Other errors:

    • FastICA: I'm not sure why...

    • BaggingClassifier: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices.

    • GradientBoostingClassifier, GradientBoostingRegressor, RandomTreesEmbedding, KBinsDiscretizer: Exception: dtype object is not covered.

    • HistGradientBoostingClassifier: Problems with deserialization of _BinMapper object?

    • PassiveAggressiveClassifier: some unknown label type error...

    • SequentialFeatureSelector: Problem with the unit test itself ^^"

    • KNeighborsTransformer: A serialized-deserialized instance seems to fit_transform an array to a sparse matrix whereas the original object returns an array?

    • RadiusNeighborsTransformer: Inverse problem from KNeighborsTransformer.

Bokeh figures and models.

All NetworkX graph objects.

Basic Python types can be wrapped in their corresponding secret type according to the following table

Python type Secret type
dict turbo_broccoli.SecretDict
float turbo_broccoli.SecretFloat
int turbo_broccoli.SecretInt
list turbo_broccoli.SecretList
str turbo_broccoli.SecretStr

The secret value can be recovered with the get_secret_value method. At serialization, the this value will be encrypted. For example,

## See https://pynacl.readthedocs.io/en/latest/secret/#key
import nacl.secret
import nacl.utils
import turbo_broccoli as tb

key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
ctx = tb.Context(nacl_shared_key=key)
obj = {"user": "alice", "password": tb.SecretStr("dolphin")}
tb.to_json(obj, ctx)

produces the following string (modulo indentation and modulo the encrypted content):

{
  "user": "alice",
  "password": {
    "__type__": "secret",
    "__version__": 2,
    "data": {
      "__type__": "bytes",
      "__version__": 3,
      "data": "gbRXF3hq9Q9hIQ9Xz+WdGKYP5meJ4eTmlFt0r0Ov3PV64065plk6RqsFUcynSOqHzA=="
    }
  }
}

Deserialization decrypts the secrets, but they stay wrapped inside the secret types above. If the wrong key is provided, an exception is raised. If no key is provided, the secret values are replaced by a turbo_broccoli.LockedSecret. Internally, TurboBroccoli uses pynacl's SecretBox. Warning: In the case of SecretDict and SecretList, the values contained within must be JSON-serializable without TurboBroccoli. The following is not acceptable:

import nacl.secret
import nacl.utils
import numpy as np
import turbo_broccoli as tb

key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
ctx = tb.Context(nacl_shared_key=key)
obj = {"data": tb.SecretList([np.array([1, 2, 3])])}
tb.to_json(obj, ctx)

See also the TB_SHARED_KEY environment variable below.

Embedded dict/lists

Sometimes, it may be useful to store part of a document in its own file and have referrenced in the main file. This is possible using EmbeddedDict and EmbeddedList. For example,

from turbo_broccoli import save_json, EmbeddedDict

data = {"a": 1, "b": EmbeddedDict({"c": 2, "d": 3})}
save_json(data, "data.json")

will result in a data.json file containing

{
  "a": 1,
  "b": {
    "__type__": "embedded.dict",
    "__version__": 1,
    "id": "4ea0b3f3-f3e4-42bd-9db9-1e4e0b9f4fae"
  }
}

(modulo indentation and the id), and an artefact file data.4ea0b3f3-f3e4-42bd-9db9-1e4e0b9f4fae.json containing

{"c": 2, "d": 3}

 External data

If you are serializing/deserializing from a file, you can use turbo_broccoli.ExternalData to point to data contained in another file without integrating it to the current document.

For example, let's say you want to create foo/bar.json where the a key points to the data contained in foo/foooooo/data.np:

import turbo_broccoli as tb

document = {
  "a": tb.ExternalData("foooooo/data.np"),
  ...
}

# data.np is loaded when creating the ExternalData object
print(document["a"].data)

# Saving
tb.save_json(document, "foo/bar.json")

# Loading
document2 = tb.load_json("foo/bar.json")

from numpy.testing import assert_array_equal
assert_array_equal(document["a"].data, document2["a"].data)

Warnings:

  • document["a"].data is read only, the following will have no effect on foo/foooooo/data.np:

    document["a"].data += 1
    tb.save_json(document, "foo/bar.json")
  • When serializing/deserializing a ExternalData object, an actual JSON document must be involved. In particular, using tb.to_json or tb.from_json is not possible.

  • The external data file's path must be a subpath of the output/intput JSON file, and provided either relative to the output/intput JSON file, or in absolute form:

    # OK, relative
    document = {"a": tb.ExternalData("foooooo/data.np")}
    tb.save_json(document, "foo/bar.json")
    
    # OK, absolute
    document = {"a": tb.ExternalData("/home/alice/foo/foooooo/data.np")}
    tb.save_json(document, "foo/bar.json")
    
    # ERROR, not subpath
    document = {"a": tb.ExternalData("/home/alice/data.np")}
    tb.save_json(document, "/home/alice/foo/bar.json")

Environment variables

Some behaviors of TurboBroccoli can be tweaked by setting specific environment variables. If you want to modify these parameters programatically, do not do so by modifying os.environ. Rather, use a turbo_broccoli.Context.

  • TB_ARTIFACT_PATH (default: output JSON file's parent directory): During serialization, TurboBroccoli may create artifacts to which the JSON object will point to. The artifacts will be stored in TB_ARTIFACT_PATH if specified.

  • TB_KERAS_FORMAT (default: tf, valid values are keras, tf, and h5): The serialization format for keras models. If h5 or tf is used, an artifact following said format will be created in TB_ARTIFACT_PATH. If json is used, the model will be contained in the JSON document (anthough the weights may be in artifacts if they are too large).

  • TB_MAX_NBYTES (default: 8000): The maximum byte size of a python object beyond which serialization will produce an artifact instead of storing it in the JSON document. This does not limit the size of the overall JSON document though. 8000 bytes should be enough for a numpy array of 1000 float64s to be stored in-document.

  • TB_NODECODE (default: empty): Comma-separated list of types to not deserialize, for example bytes,numpy.ndarray. Excludable types are:

    • bokeh, bokeh.buffer, bokeh.generic,

    • bytes, Warning excluding bytes will also exclude bokeh, numpy.ndarray, pytorch.module, pytorch.tensor, secret, tensorflow.tensor,

    • collections, collections.deque, collections.namedtuple, collections.set,

    • dataclass, dataclass.<dataclass_name> (case sensitive),

    • datetime, datetime.datetime, datetime.time, datetime.timedelta,

    • dict (this only prevents decoding dicts with non-string keys),

    • embedded, embedded.dict, embedded.list,

    • external,

    • generic,

    • keras, keras.model, keras.layer, keras.loss, keras.metric, keras.optimizer,

    • networkx, networkx.graph,

    • numpy, numpy.ndarray, numpy.number, numpy.dtype, numpy.random_state,

    • pandas, pandas.dataframe, pandas.series, Warning: excluding pandas.dataframe will also exclude pandas.series,

    • pathlib, pathlib.path, Warning: excluding pathlib.path will also exclude external,

    • pytorch, pytorch.tensor, pytorch.module, pytorch.concatdataset, pytorch.stackdataset, pytorch.subset, pytorch.tensordataset

    • scipy, scipy.csr_matrix,

    • secret,

    • sklearn, sklearn.estimator, sklearn.estimator.<estimator name> (case sensitive, see the list of supported sklearn estimators below),

    • tensorflow, tensorflow.sparse_tensor, tensorflow.tensor, tensorflow.variable.

    • uuid

    • external

  • TB_SHARED_KEY (default: empty): Secret key used to encrypt/decrypt secrets. The encryption uses pynacl's SecretBox. An exception is raised when attempting to serialize a secret type while no key is set.

Contributing

Dependencies

  • python3.10 or newer;

  • requirements.txt for runtime dependencies;

  • requirements.dev.txt for development dependencies.

  • make (optional);

Simply run

virtualenv venv -p python3.10
. ./venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements.dev.txt

Documentation

Simply run

make docs

This will generate the HTML doc of the project, and the index file should be at docs/index.html. To have it directly in your browser, run

make docs-browser

Code quality

Don't forget to run

make

to format the code following black, typecheck it using mypy, and check it against coding standards using pylint.

Unit tests

Run

make test

to have pytest run the unit tests in tests/.

Credits

This project takes inspiration from Crimson-Crow/json-numpy.