-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Bruker code rework #343
Comments
This sounds good and makes a lot of sense.
Maybe this should be removed, because its reasoning is possibly not valid anymore, now there are more tooling to build c extension and it is significantly easier. |
While parts of the original rationale for this rule may no longer be valid, other parts remain valid, and, as @ericpre points out, there are new reasons to privilege pure Python code. For me, the main interest in requiring a pure Python version of the code is that it improves its long-term maintainability. It is early to know its true impact, but if mojo delivers on its promises, the pure Python version of the code may be better able to stand the test of time. Would it be possible to implement the new version in pure Python, speed it up with numba, and if memory management becomes a limitation, build the Cython version from the Python code? |
Does numba recompiles code every time it is launched? |
By default, you it will compile at runtime, the first time the function is used. Cache can be used to avoid compiling at runtime: https://numba.readthedocs.io/en/stable/developer/caching.html |
well seems that escalated rather quickly and I have a definitive answer. First, This attempt is not "from scratch" but a rework, tidying up, code refactoring. The cython code to parse the embedded binary cubes encoded in packed Delphi Pascal arrays is already implemented for 8 years and it sits as the only one cython extension of the library this much time, it is the only code which practically had least anything to fix. Albeit I noticed significant improvement with compiled binary size down to ~200kb from ~1MB 8 years ago). Nothing new to create there as it perfectly works and is time proven solution. Testing suit is already testing both cython and python implementations as the same parsing function is accessible as single python function, thus it is actually easy to test 3 cases: cython, pure_python and python_encapsulated with numba using
Let me show the error of numba failure: def py_parse_hypermap(virtual_file, shape, dtype, downsample=1):
<source elided>
buffer1 = next(iter_data)
height, width = strct_unp("<ii", buffer1[:8])
^
This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'rsciio.bruker.sfslib.SFSTreeItem>' cython has no problems using python objects, where numba fails flat on its face trying. It is all nice and shiny at setup examples... but throw real work and it gave up. SFSTreeItem feeds the bytestring buffer for the parser. Dumba is numb enough to not figure out that final products of that class is fed into struct.unpack, which only and only accepts bytestrings - seems to hard to figure out. So much for miraclous intelligent tools... |
getting back to benchmark times:
one needs to be taken into consideration: those benchmark times contain also time spent for reading from disk, and so it is not 5.8 vs 90 (which results in x15.5), but 2 seconds are wasted in both cases for reading from disk. thus fighting implementation are 3.8 vs 88 (which makes cython 23 times faster) |
I was thinking the same and I agree: the cython implementation has been very very low maintenance (off the top of my head, there may be some deprecation warning which may have appeared recently with a recent version of cython) since it has been implemented - less than numba for example.. In the context of this refactor, it makes sense to keep it but at the same time, we should also keep the pure python implementation! |
Major Bruker Code refactoring and expansion
While it is preferred to focus issues and PR's on single features and incremental additions of features, I find that bruker part of code had overgrown with confusing scruff. At this point making small bug fixes (like i.e. #326) just will grow unnecessary complexity of the code. Thus I think that it is a right moment to lean over and do major code refactoring and cleaning up of bruker code. It should allow bringing in some new features much more easily and help others with troubleshooting and bringing new features.
General Plan
SFS
_api.py
tosfslib.py
(naming allusion tozlib
, alternative naming considerations: libsfs, glibsfs (rather not if we change license from GPL), libresfs, minisfs (as there is limited functionality and only reading of files) ).sfslib
should not be private as code should easily accessible for other Bruker file inspection and development work.- get rid of any mention of "chunk/-s", replace in docstrings and variable names with "block/-s" as initialy used terminology brings in unnecessary confusion. SFS blocks are file system blocks, where dask chunks are organisation of arrays.
- tidy up naming, optimize
- add content random access method and pre-parser (light parser which jumps through all zlib-deflated block headers (which needs to be done sequentially) and makes pointer table of zlib blocks)
-
setuptools
allows making console script absolutely easily - developunsfs
script (unsfs
function insfslib.py
) which will extract all files from sfs type of files (.bcf; .pan files) into default created or provided directory. Easy command line tool takes away all need of bizarre explanations how to extract files using SFSReader. Easing this up will easy up testing, as we will no more care about too-big test files – we will extract header from any provided bcf file and will test only fraction of code in charge of its parsing on small XML file (header). Also it will make easier for inspecting of other kind of bruker files.Bruker BCF parsers
guess_hv
; replace it with function (moved to rsciio.utils.tools)guess_tem_sem
which would be not by HV, but by EDS detectors elevation angle (this could later be applied for other formats ifexspy
is impossible to make to drop EDS_SEM, EDS_TEM unnecessary subdivision.original_metadata
dict, and use unified mapping to map existing entries to hyperspy expected metadata dict. mapping will allow to skip explicit if/else checking. currently there is chain of functions generating bits of metadata.Then there is a question:
Documentation states that numba should be used where fast code is required and cython should be used only if python version alongside is implemented. To be honest in last 8 years I had not came into situation where cython extension (which btw is distributed as C code) would be problematic to install on different computers and toaster like machines. I had no issue to get it working even on linux subsytem on Android tablet!
I would happily get rid of alternative slow python functions, or would move it to external file to unclutter the _api.py. Also for lazy random access implementation, I would want to make it use only cython.
Why I wont write these in pure python and use numba decorators? because its memory management is harder to control. Especially in situations where data is larger than memory and implementation will address that situation. For memory consumption cython is predictable, numba is less.
The text was updated successfully, but these errors were encountered: