Thoughts about improving Tick object creation time #542

ghill2 · 2022-01-07T00:46:37Z

ghill2
Jan 7, 2022
Collaborator

As I work on checking my strategy rules, it takes a considerable amount of time to initialize Tick objects before adding them to the engine. (29,168,849 items, 1 yr of data)

I understand initializing 29,168,849 items is not a light task and should take some time. However, what are your thoughts on how to speed up the Tick object creation, such as bypassing expensive Python function calls or creating the objects on the C level in separate threads?

I am aware of the catalog's streaming functionality, but I'm trying to avoid adding time to the backtest.

Thankyou
George

limx0 · 2022-01-07T01:22:00Z

limx0
Jan 7, 2022
Collaborator

Hey @ghill2 how are you creating the Ticks? Are you loading from the catalog or parquet? Or are you parsing them directly from some raw data (csv etc).

There's a couple of things we can do here, but I'm keen to hear a little more about how you're loading the data if you don't mind sharing?

Also how long are we actually talking?

0 replies

ghill2 · 2022-01-08T00:28:40Z

ghill2
Jan 8, 2022
Collaborator Author

Apologies for the lack of detail in my inital post!
I'm just loading from feather files into a dataframe. The creation of the Tick objects takes 416secs on my machine running version 1.136.0

from nautilus_trader.model.data.bar import BarSpecification
from nautilus_trader.model.enums import VenueType, AccountType, OMSType, BarAggregation, AggregationSource
from nautilus_trader.model.enums import PriceType
from nautilus_trader.backtest.data.providers import TestInstrumentProvider
from nautilus_trader.model.identifiers import Venue
from nautilus_trader.backtest.data.wranglers import QuoteTickDataWrangler
from time import perf_counter

import pandas as pd

start_date = pd.Timestamp('2019-01-01', tz='UTC').to_pydatetime()
end_date =  pd.Timestamp('2020-01-01', tz='UTC').to_pydatetime()
instrument = TestInstrumentProvider.default_fx_ccy("EUR/USD", venue=Venue("SIM"))

ticks_df = pd.read_feather('/data/EURUSD-2019-T1.feather')
print(ticks_df.dtypes)
ticks_df.set_index('date',inplace=True)
print(ticks_df)

wrangler = QuoteTickDataWrangler(instrument=instrument)

start = perf_counter()
tick_objs = wrangler.process(ticks_df)
stop = perf_counter()
print(f"Elapsed time {stop-start} secs")

date datetime64[ns]
bid float64
ask float64
volume int64
dtype: object
bid ask volume
date
2019-01-01 22:02:37.254 1.14598 1.14682 4
2019-01-01 22:02:38.590 1.14599 1.14682 2
2019-01-01 22:02:39.138 1.14599 1.14684 4
2019-01-01 22:02:55.787 1.14598 1.14684 4
2019-01-01 22:03:02.060 1.14598 1.14684 4
... ... ... ...
2019-12-31 21:59:33.224 1.12093 1.12154 1
2019-12-31 21:59:43.025 1.12098 1.12154 1
2019-12-31 21:59:56.279 1.12088 1.12154 1
2019-12-31 21:59:57.829 1.12076 1.12154 1
2019-12-31 22:00:00.120 1.12072 1.12154 1

[29168849 rows x 3 columns]
Elapsed time 416.1686198259995 secs

0 replies

limx0 · 2022-01-08T04:07:02Z

limx0
Jan 8, 2022
Collaborator

Hey @ghill2 - thanks so much for the detailed example! I'll grab some similar data and do some performance testing.

I think saving as parquet will provide some speed ups but we need to spend a little time looking at this - give us a couple of days and we'll come back with some more details

0 replies

squire-of-milverton · 2022-01-21T20:17:14Z

squire-of-milverton
Jan 21, 2022

Also interested in any solutions or ideas for this

0 replies

limx0 · 2022-01-23T10:09:57Z

limx0
Jan 23, 2022
Collaborator

I had a bit of a look at this, but theres nothing obvious here other than creating objects (classes) is pretty slow in python (even though we're actually doing this in cython).

You might see some speed up doing some caching using pickle and reusing the objects (though in my tests it didn't seem to be a huge improvement).

The next steps here would be refactoring the class creation and using some more cython tricks (memory pools or other optimisations), or storing this data in parquet, streaming it up via the cython parquet api in a background process. This will still involve pickling the objects - which I think will still be a pretty big blocker.

@cjdsellers may have some more thoughts on this

0 replies

cjdsellers · 2022-01-23T11:44:16Z

cjdsellers
Jan 23, 2022
Maintainer

@limx0 has given a good summary on where we are at here.

I think we would see the best improvements leveraging a lower level parquet API, and possibly bypassing the standard Python class initialization. I'm currently investigating re-writing some of the core objects using the PyO3 Rust bindings for Python. I don't have any hard figures that this would be any faster than Cython however. Anecdotally others have seen a 3x speedup of PyO3 over Cython, however it all depends on what is being measured and compared too.

Another idea is chunking and parallelizing the object creation using some multiprocessing.

So in any case, @limx0 and I are looking at the parquet Cython API nearer term, which could pay off performance wise for object creation. More medium term expect to see some Rust coming into the codebase.

0 replies

yohplala · 2022-01-23T17:18:35Z

yohplala
Jan 23, 2022

Hi @cjdsellers and @limx0 ,

For my information, once this list of QuoteTick objects is created, how / where is it used?
If each quote is a single object, I am understanding that subsequent process does not make use of vectorized functions. Is this the correct understanding?

Thanks for your help,
Bests,

4 replies

cjdsellers Jan 23, 2022
Maintainer

Hi @yohplala

This is correct, since the platform is event-driven, the tick objects are encapsulating the data in a type safe way. For a backtest all data objects are organizied into a monotonic stream, and then fed through the main loop inside the backtest engine one at a time (in C).

This sort of processing is more realistic than vectorized methods (all of the core components are the same for live or backtest), however it is much slower performance wise than broadcasting over a pandas DataFrame etc. Many things simply can't be simulated with vectorized methods though, especially for more complicated strategies.

yohplala Jan 23, 2022

Thanks @cjdsellers!

Many things simply can't be simulated with vectorized methods though, especially for more complicated strategies.

Possibly. I have been successful at doing very complex things with numba's @guvectorize but I am no expert in this domain, and maybe @guvectorize is pseudo-vectorization. (yet it is lightspeed made easy to use :))

But what I can notice here is that doing list(map()) to create the object list books the RAM incrementally instead of all at once would we create a numpy array. This can also contribute to slow down things.

An idea, perhaps not good.

at this step, why not create a 'meta' object, which would be the dataframe itself (i.e. the collection of 'yet to be' QuoteTick instances), only checking columns (that all components are provided, with correct dtypes and so on)? Its creation would be transparent, it already exists.
it could have special accessor, and most notably when used in Nautilus core components, it would yield (and at this time materialize) the individual QuoteTick instances? I am aware it pushes the trouble only few steps later, but memory wise, we could keep the memory space taken by one of the instance at an iteration, and re-use it for the next object instance at next iteration (each instance take same space, 7 int64/float64). This way, we could save:
- memory (only one instance of QuoteTick is existing at a time, not the full list - we still have the full dataframe, so memory saving is reduced, I agree - but combined with Dask to load data as it is needed from parquet files, it can be further improved)
- and also we would save some time for booking the memory.
  A parameter would be needed in the generator/accessor function to keep the pointer to this memory buffer.

The impact in the main loop would be instead of iterating the list, use this generator (a method that is actually an accessor to the original dataframe, and that yields one row at a time as a QuoteTick object)

Again, I am no expert, it is easy to say all sort of things, and is a completely different world to actually implement them.
Only proposing a vague idea.

yohplala Jan 23, 2022

Super naive idea :):

is a list actually needed?
map is a generator. Could it be forwarded to the main loop without being touched?
(memory still need to be booked for each instance, but the time seen @ghill2 would be spread over the backtesting time)

ghill2 Jan 23, 2022
Collaborator Author

Thanks for the input @yohplala
I'm aware that initializing the entire list of ticks is not actually needed as the system can process a test in batches, but even with batch processing it just spreads the tick initialization time throughout the test. (like you mentioned)

cjdsellers · 2022-01-23T19:58:45Z

cjdsellers
Jan 23, 2022
Maintainer

These are all good thoughts.

Actually the initial versions of the platform generated the data objects 'on the fly' similar to your suggestions.

Something to bear in mind is that right now the design is very simple, all data objects are built up front and then sorted, its made development and debugging much easier. Delaying the instantiations and sorting 'on the fly' could introduce a host of bugs, in fact we've seen some previously with this method.

The problem with the idea of a single quote tick instance is there can be thousands of quote ticks existing in a running system, queued up in caches, indicators, bar builders etc, so we can't simply hold one reference and update the backing struct in C. However this is possibly a good path, to hold only the number of total quote ticks the system needs at any one time, and recycle the objects by simply reassigning the backing struct members (this would have to be coded extremely carefully though, and departs from the simple design mentioned above). Object pools and recycling are a common pattern in high performance OO systems, so this deserves further thought and investigation. Now that the platform is becoming more stable and mature, introducing complexity for the trade-off of performance gains could start to be worth it.

Your thoughts on list vs map are interesting, however we need to know up front all of the timestamps so we can sort the data stream in order (the stream can and often does include any class which inherits Data which includes a ts_init which is used to sort on). Actually those data wranglers were put together just to get something going, I thought at a later time I would come up with something more performant however here we are.

If you're able to run any performance tests and show improvements, then we're more than willing to merge any improvements you might be able to make at the wrangler level.

1 reply

yohplala Jan 23, 2022

Thanks for the detailed answer @cjdsellers , I appreciate it very much. I am always attracted by performance challenge, but I will unfortunately not delve into that one in the short term. Yet, I am keeping it in mind! :)
Thanks again!

ghill2 · 2022-01-23T21:23:48Z

ghill2
Jan 23, 2022
Collaborator Author

I really appreciate the input on this @cjdsellers @limx0 @yohplala

I agree about the slow Python object initialization. My test showed that 5.5 min (78%) of the total test time is calling the Python Decimal object init method (_value attribute of BaseDecimal class).

Will using the low level pyarrow api avoid this?

Also, what are your thoughts about initializing the Python decimal objects in C or C++ instead? Would this approach:

Avoid calling the expensive init method of the Python Decimal class
Avoid the GIL to parallelize the Decimal object initialization over threads not processes to avoid serialization/deserialization

I understand this approach doesn't avoid every init method but from my understand the only way to avoid all of them would be (like you suggested) to re-code the Quantity, Price and Base in Rust so they can initialized using faster calls and with parallelization.

1 reply

limx0 Jan 23, 2022
Collaborator

Thank you @ghill2 for your input also! Always much appreciated.

I think this is something to be seriously considered - given it takes up such a large portion of the time and these are super low level objects with a tiny surface area, it makes sense to me (rust or c/c++)

cjdsellers · 2022-01-23T23:22:43Z

cjdsellers
Jan 23, 2022
Maintainer

My current plan is to replace the BaseDecimal value objects with a Rust structs which use a fixed precision int64 as the _value.

Then PyO3 can expose this to Python (keeping the same API), and we avoid the overhead of that decimal.Decimal initialization.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts about improving Tick object creation time #542

{{title}}

Replies: 10 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Thoughts about improving Tick object creation time #542

ghill2 Jan 7, 2022 Collaborator

Replies: 10 comments · 6 replies

limx0 Jan 7, 2022 Collaborator

ghill2 Jan 8, 2022 Collaborator Author

limx0 Jan 8, 2022 Collaborator

squire-of-milverton Jan 21, 2022

limx0 Jan 23, 2022 Collaborator

cjdsellers Jan 23, 2022 Maintainer

yohplala Jan 23, 2022

cjdsellers Jan 23, 2022 Maintainer

yohplala Jan 23, 2022

yohplala Jan 23, 2022

ghill2 Jan 23, 2022 Collaborator Author

cjdsellers Jan 23, 2022 Maintainer

yohplala Jan 23, 2022

ghill2 Jan 23, 2022 Collaborator Author

limx0 Jan 23, 2022 Collaborator

cjdsellers Jan 23, 2022 Maintainer

ghill2
Jan 7, 2022
Collaborator

Replies: 10 comments 6 replies

limx0
Jan 7, 2022
Collaborator

ghill2
Jan 8, 2022
Collaborator Author

limx0
Jan 8, 2022
Collaborator

squire-of-milverton
Jan 21, 2022

limx0
Jan 23, 2022
Collaborator

cjdsellers
Jan 23, 2022
Maintainer

yohplala
Jan 23, 2022

cjdsellers Jan 23, 2022
Maintainer

ghill2 Jan 23, 2022
Collaborator Author

cjdsellers
Jan 23, 2022
Maintainer

ghill2
Jan 23, 2022
Collaborator Author

limx0 Jan 23, 2022
Collaborator

cjdsellers
Jan 23, 2022
Maintainer