Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Bruker code rework #343

Open
sem-geologist opened this issue Dec 11, 2024 · 7 comments
Open

Major Bruker code rework #343

sem-geologist opened this issue Dec 11, 2024 · 7 comments

Comments

@sem-geologist
Copy link
Contributor

sem-geologist commented Dec 11, 2024

Major Bruker Code refactoring and expansion

While it is preferred to focus issues and PR's on single features and incremental additions of features, I find that bruker part of code had overgrown with confusing scruff. At this point making small bug fixes (like i.e. #326) just will grow unnecessary complexity of the code. Thus I think that it is a right moment to lean over and do major code refactoring and cleaning up of bruker code. It should allow bringing in some new features much more easily and help others with troubleshooting and bringing new features.

General Plan

SFS
  • move all SFS code out from _api.py to sfslib.py (naming allusion to zlib, alternative naming considerations: libsfs, glibsfs (rather not if we change license from GPL), libresfs, minisfs (as there is limited functionality and only reading of files) ). sfslib should not be private as code should easily accessible for other Bruker file inspection and development work.
  • refactor SFS code:
    - get rid of any mention of "chunk/-s", replace in docstrings and variable names with "block/-s" as initialy used terminology brings in unnecessary confusion. SFS blocks are file system blocks, where dask chunks are organisation of arrays.
    - tidy up naming, optimize
  • expand SFS code:
    - add content random access method and pre-parser (light parser which jumps through all zlib-deflated block headers (which needs to be done sequentially) and makes pointer table of zlib blocks)
    - setuptools allows making console script absolutely easily - develop unsfs script (unsfs function in sfslib.py) which will extract all files from sfs type of files (.bcf; .pan files) into default created or provided directory. Easy command line tool takes away all need of bizarre explanations how to extract files using SFSReader. Easing this up will easy up testing, as we will no more care about too-big test files – we will extract header from any provided bcf file and will test only fraction of code in charge of its parsing on small XML file (header). Also it will make easier for inspecting of other kind of bruker files.
Bruker BCF parsers
  • get rid of guess_hv; replace it with function (moved to rsciio.utils.tools) guess_tem_sem which would be not by HV, but by EDS detectors elevation angle (this could later be applied for other formats if exspy is impossible to make to drop EDS_SEM, EDS_TEM unnecessary subdivision.
  • refactor code and get rid of numerous object creation and attributes. Instead get all data into original_metadata dict, and use unified mapping to map existing entries to hyperspy expected metadata dict. mapping will allow to skip explicit if/else checking. currently there is chain of functions generating bits of metadata.
  • parse pixel address table (readily present file in bcf v2) or make one (for bcf v1). That is a first step toward random access of pixel data and improved lazy capabilities.
  • Include FWHM which can be read from Sigma of signal (which is actually Variance (sigma^2) at 0kV). I had investigated and found out that Bruker use modified Fiori equation. I am not surprised as Fiori made it decades ago - no SDD detectors was there worse counting electronics. The only difference between Fiori equation and that used by Bruker is factor 2.5, which for Bruker is slightly modified into 2.455.
  • Include live time estimation from 0keV pulser.

Then there is a question:
Documentation states that numba should be used where fast code is required and cython should be used only if python version alongside is implemented. To be honest in last 8 years I had not came into situation where cython extension (which btw is distributed as C code) would be problematic to install on different computers and toaster like machines. I had no issue to get it working even on linux subsytem on Android tablet!
I would happily get rid of alternative slow python functions, or would move it to external file to unclutter the _api.py. Also for lazy random access implementation, I would want to make it use only cython.

Why I wont write these in pure python and use numba decorators? because its memory management is harder to control. Especially in situations where data is larger than memory and implementation will address that situation. For memory consumption cython is predictable, numba is less.

@ericpre
Copy link
Member

ericpre commented Dec 11, 2024

This sounds good and makes a lot of sense.

Documentation states that numba should be used where fast code is required and cython should be used only if python version alongside is implemented.

Maybe this should be removed, because its reasoning is possibly not valid anymore, now there are more tooling to build c extension and it is significantly easier.
Currently, there may still be benefit to keep the slow python functions to be to run on pyodide and this is useful for teaching purpose or for people to try without having to install. Currently, the installation on pyodide doesn't use the c extension because it is not straightforward to build and distribute. This is most likely to be improved in the near future - tools like cibuildwheel can already build pyodide build but pypi.org can't handle them! Alternatively, we could contribute to a package in the pyodide distribution.

@francisco-dlp
Copy link
Member

francisco-dlp commented Dec 12, 2024

While parts of the original rationale for this rule may no longer be valid, other parts remain valid, and, as @ericpre points out, there are new reasons to privilege pure Python code. For me, the main interest in requiring a pure Python version of the code is that it improves its long-term maintainability. It is early to know its true impact, but if mojo delivers on its promises, the pure Python version of the code may be better able to stand the test of time.

Would it be possible to implement the new version in pure Python, speed it up with numba, and if memory management becomes a limitation, build the Cython version from the Python code?

@sem-geologist
Copy link
Contributor Author

For me, the main interest in requiring a pure Python version of the code is that it improves its long-term maintainability. It is early to know its true impact, but if mojo delivers on its promises, the pure Python version of the code may be better able to stand the test of time.

Does numba recompiles code every time it is launched?

@ericpre
Copy link
Member

ericpre commented Dec 12, 2024

By default, you it will compile at runtime, the first time the function is used. Cache can be used to avoid compiling at runtime: https://numba.readthedocs.io/en/stable/developer/caching.html

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Dec 12, 2024

By default, you it will compile at runtime, the first time the function is used. Cache can be used to avoid compiling at runtime: https://numba.readthedocs.io/en/stable/developer/caching.html

Would it be possible to implement the new version in pure Python, speed it up with numba, and if memory management becomes a limitation, build the Cython version from the Python code?

well seems that escalated rather quickly and I have a definitive answer.

First, This attempt is not "from scratch" but a rework, tidying up, code refactoring. The cython code to parse the embedded binary cubes encoded in packed Delphi Pascal arrays is already implemented for 8 years and it sits as the only one cython extension of the library this much time, it is the only code which practically had least anything to fix. Albeit I noticed significant improvement with compiled binary size down to ~200kb from ~1MB 8 years ago). Nothing new to create there as it perfectly works and is time proven solution. Testing suit is already testing both cython and python implementations as the same parsing function is accessible as single python function, thus it is actually easy to test 3 cases: cython, pure_python and python_encapsulated with numba using @jit_ifnumba decorator. Initially I wanted also to test memory consumption of numba, but let me show execution time comparison for largish file (118MB XRF bcf) first:

implementation time from time 0 to full load of EDS hypercube [s]
cython 5.81
python 90.02
python_wrapped_with_numba_jit  -Inf (failed to execute)

Let me show the error of numba failure:

def py_parse_hypermap(virtual_file, shape, dtype, downsample=1):
    <source elided>
    buffer1 = next(iter_data)
    height, width = strct_unp("<ii", buffer1[:8])
    ^ 

This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'rsciio.bruker.sfslib.SFSTreeItem>'

cython has no problems using python objects, where numba fails flat on its face trying. It is all nice and shiny at setup examples... but throw real work and it gave up. SFSTreeItem feeds the bytestring buffer for the parser. Dumba is numb enough to not figure out that final products of that class is fed into struct.unpack, which only and only accepts bytestrings - seems to hard to figure out. So much for miraclous intelligent tools...
solutions?: A)find out how to jit_ifnumba a class?; B) refactor py_parse_hypermap() to accept iterator (but wont dumba will have problem also with it?); C) forget about jit_ifdumba and use time for more productive activities (also ignoring dumba gives 20MB less of bloat - the amount of download conda made to install it).

@sem-geologist
Copy link
Contributor Author

getting back to benchmark times:

implementation total execution time
cython 5.81
python 90.02

one needs to be taken into consideration: those benchmark times contain also time spent for reading from disk, and so it is not 5.8 vs 90 (which results in x15.5), but 2 seconds are wasted in both cases for reading from disk. thus fighting implementation are 3.8 vs 88 (which makes cython 23 times faster)

@ericpre
Copy link
Member

ericpre commented Dec 13, 2024

I was thinking the same and I agree: the cython implementation has been very very low maintenance (off the top of my head, there may be some deprecation warning which may have appeared recently with a recent version of cython) since it has been implemented - less than numba for example..
Numba has its limiting but its main advantages are that it is much more easy to use for any contributor (no need to learn about cython) and its allow to have the pure python and accelerated version with very minimal code changes. This allows us to make pure python wheels which has its use cases (some are mentioned above).

In the context of this refactor, it makes sense to keep it but at the same time, we should also keep the pure python implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants