-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up and extend normalizations #47
Comments
Cytotransform uses the FastLogicle C++ implementation from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4761345/. How do you feel about making cytotransform a dependency of pytometry? Might be easier to manage rather than ingesting the C++ code into this repo? It also accepts a Pandas DataFrame as input so a wrapper for anndata should be easy to create. I'm happy to pick this one up. |
I'm happy to make CytoTransform a dependency of pytometry. The wrapper should be relatively easy because most data matrices for flow data are dense. My only concern is that the pandas DataFrame is less memory efficient than a numpy 2D matrix, which might become problematic as soon as you use 10M+ cells. So I think that using the 2D numpy matrix as input would be preferable, especially when the C++ implementation will convert the input to a 2D array anyways. |
That makes perfect sense. It will be a simple refactor to cytotransform to fix this since under the hood it acts on the numpy array of the pandas dataframe anyway. I'll refactor cytotransform base class to accept a dataframe or numpy array and skip wrangling the output into a dataframe when numpy arrays are provided as input - see burtonrj/cytotransform#5 |
Hi Maren and Ross, It's great to see the Python cytometry ecosystem growing. However, I'm concerned that we may be unnecessarily fracturing our efforts. It seems we have 3 different implementations of transformations here. Perhaps there is a way to work together? Kind regards, |
Hi Scott, Best, |
Maren, Most transformations in FlowKit are modeled after the GatingML 2.0 specification, and their implementation is defined in the FlowUtils package. The exception is the FlowJo biex transformation, as you've discovered. The history of FlowUtils comes from inheriting the original fcm code base, which I broke apart with the hope that the individual parts would be more useful to others. I realize the requirements / needs for projects may differ, but the transformation functions are one of those rare, well-defined requirements that seem better served by many eyes on one set of implementations. I can move the FJ biex xform to FlowUtils if that would help. The reason I kept it separate was the added dependency of SciPy, where FlowUtils has the singular dependency of NumPy. And, I see Ross has parallelized some of the other xforms in the CytoTransform package. I also thought about doing this as the parallelization for transforming FCM events is obvious, but multiprocessing (mp) in Python has a double-edge. Nesting mp in Python is "here be dragons" territory. I use mp in FlowKit for processing multiple FCS samples using the same gating strategy, with the rationale that a user would unlikely mp different gating strategies. Adding mp to the transformations that are necessary for the gating functions threw a wrench in the works. Still, it is an interesting idea Ross has for making the xform classes optionally mp. Enough rambling by me, curious what you both think and what ideas you have. -Scott |
Hi guys! Thanks for the input Scott! Much appreciated and I too agree transformations is an area where we should all be aligned regardless of what else we are doing - maybe also on things like compensations, various utilities, gate definitions, but we can discuss that another time - then we should allow the user some choice around what frameworks they want to work in. Maren is doing great work expanding cytometry into the scanpy ecosystem and I am right behind her on that! In fact I even hope to retire CytoPy and point people this direction. As for cytotransform, I took inspiration from your work Scott and tried to make it GatingML 2.0 compatible and also performing testing against the GatingML 2.0 definitions (https://github.com/burtonrj/cytotransform/blob/main/tests/test_transformer.py). The mp approach is an ambitious attempt to brute force transforms on enormous datasets....but mp in Python is a swamp I will agree (hopefully to be fixed in 3.13 onwards https://peps.python.org/pep-0703/). Recognising that mp is not always desirable, cytotransform will just execute the function in a single process if |
Oh, also I've had far less installation woes with the FastLogicle.cpp implementation coupled with PyBind11. I haven't compared the performance of FastLogicle vs your C implementation though @whitews, not sure which is faster or more precise. |
Hi there, thank you all for your suggestions. I might be more taciturn, but I enjoy the discussion.
|
Maren & Ross, It sounds like we are pretty much on the same page. What if I move the transform classes from FlowKit to FlowUtils, along with moving the FlowJo Biex implementation there. This adds SciPy as a dependency, which is why it was not done previously, but does put all the transforms in one place. Ross, would you be available to help maintain FlowUtils if we did this? We could add the optional mp as well. I do feel that the more eyes we have on these critical implementations the better, inline with the Torvalds quote "given enough eyeballs, all bugs are shallow". One other point is that the Sample class in FlowKit does allow for a dictionary of transforms mapped to individual channels. The Gate classes are also designed to be used independently from the heavy Session & Workspace classes. Not sure if it was clear, but FlowKit is more or less feature complete. It will never expand to have clustering algos, a GUI, etc. This was a decision made early on after being a bit stung by the algo chase building ReFlow. |
Hi Scott, that sounds good to me. It would be great if the FlowJo Biexp implementation would also be in FlowUtils. |
Hi all, I'm happy with this call, I was going to have to maintain cytotransform anyway so it makes no difference to me, other than I won't be maintaining it on my own - completely agree with you @whitews, the more eyes the better! Let me know next steps RE changes to FlowUtils. |
Great, I've started a discussion on the FlowUtils repo here: We can plan the changes there (and avoid spamming this issue :) |
Bumping this as I plan to release new versions of FlowUtils and FlowKit soon to support Python 3.12. I've discussed the proposed changes with my colleagues at Duke and have the go ahead. We are also considering moving the base Gate classes to FlowUtils, as these are also well-defined and would likely be helpful to others...though there are some implementation details to be worked out, particularly around the ratio dimension. |
Do you also plan to add autologicle normalisation, like used in the R package cytofkit (cytofkit::autoLgcl, https://github.com/JinmiaoChenLab/cytofkit/blob/master/R/cytof_preProcess.R)? |
I haven't personally looked into parameter optimisation for logicle transform, I've always left parameters as options for the user. I would say we could incorporate this but first we need to develop a consensus in methodologies which (as discussed above) appears to be FlowUtils |
@alefrol638 The autologicle normalization function is now implemented in |
I would be in favor of unifying normalization functions between the different ecosystems, which if I followed this discussion correctly, means relying on
I think multiprocessing is important when moving towards datasets with 10s to 100s of millions of cells. I agree that Python multiprocessing is not perfect. Ideally the C functions can be paralellized (where applicable), or numba could be an option for this kind of code. I also made good experiences with replacing python multiprocessing with joblib.Parallel which is more robust and supports different backends, including |
I agree on unifying the well-defined components like the transforms and that it is not ideal to have such a critical repo with a single person. Equally important IMO is the stability of such a well-defined codebase, i.e. there's no need to change it unless something is broken. On that point, I did make a local branch of FlowUtils with class-based transforms (something mentioned earlier in the thread). In the end, I decided against it. It was awkward to have an end user of FlowKit use those "pass through" classes, as the docstrings are in a second tier import. Seems easy enough to wrap the transform functions in a class as close to the end user as possible. The important part is to unify the underlying math of the transforms. One other point, related to #65, is FlowUtils already has inverse functions for all the transforms. There are also a few other functions, including compensation and a GatingML compatible For optimization, I think it would be useful to define some requirements / expectations prior to any implementation. Some data sets and benchmarks would be good as part of that as well. There are several ways to implement this, each with pros and cons. A good case can be made that parallelization could be a downstream responsibility, it is rather easy to batch transforms per sample or even on a more granular level. Just a proposal, and I'd need to verify this on my end, but what if we move FlowUtils to scverse? I'd be in support of this if we can agree to keep it stable and focused, avoiding breaking changes for downstream packages such as pytometry and FlowKit. |
Fully agree.
I'd be up for that, but this is something I would like to ask in the core team first. I'll put it on the agenda for next week.
We now have some infrastructure for continuous benchmarking in place for scverse repos. See in action e.g. here: Wherever it makes sense, I would be slightly in favor of implementing paralellization in the central repo. As discussed above, it's trivial to do parallelization, but doing it well might not be that easy. |
If you need help with the benchmark setup, ask @ilan-gold or me. It’s easy to set up but you might need help with permissions. It’s also currently in alpha state so expect it to break sometimes. |
Hi @whitews, core team is fine with moving FlowUtils to scverse. One concern that came up is that the release process is well documented (we don't have a lot of experience with packages that have C extensions) and ideally automated via GitHub actions. So we all agree that this is the way forward and pytometry will replace its normalization code with FlowUtils? @mbuttner @quentinblampey? |
Hi @grst Thank you for the update. I am happy with the suggested action to replace pytometry's normalization code with FlowUtils. |
Hello @grst, I'm also fine with this! One small question though: since the Python package is called |
@grst OK, so how would this work? Would I remain the lead on FlowUtils? |
Of course! You remain (co-)owner of the repository. You can also add a section to the readme to make clear you are the lead developer. I guess @quentinblampey would get commit rights, too. Additionally all scverse core developers will have committ rights in theory, but not use them without your approval as long as you are reachable. |
To initiate the transfer it's probably easiest if you make me an owner of the flowutils repo. |
Current normalization tools in pytometry use the FlowKit implementation and are FlowJo input compatible, but could use a speed up, especially the functions for logicle and biexponential transformations. The Cytotransform package implements commonly used transformations and is much faster. An adaptation of these functions to the anndata format would be amazing.
The text was updated successfully, but these errors were encountered: