Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_jump_image_iter, fix tqdm #44

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

HugoHakem
Copy link
Collaborator

1) Motivation:

  • get_jump_image does not support iterable as an input which prevent from loading every images associated to a list of desired metadata.

2) Solution:

  • Instead of simply looping on get_jump_image, get_jump_image_iter is introduced and build on batch_processing and parallel to load images in a threaded fashion.

A) Description of the Solution

Which Input ?

  1. metadata:(pl.DataFrame):

Metadata information is often stored in a pl.DataFrame (for instance, the output of get_item_location_info). Then naturally, get_jump_image_iter takes as an input a pl.DataFrame with exclusively those information and in this following order (coherently with get_jump_image):

(source, batch, plate, well)

  1. channel: List[str] (the desired channel)
  2. site: List[str] (the desired site)
  3. correction:str='Orig' (the desired correction)
  4. print_progress: bool=True (whether to print progression of the work with tqdm)

Which Output ?

  1. features: (pl.DataFrame):
    A polars dataframe storing every metadata (including channel site and correction) + the array containing the images information
  2. work_fail: (List(tuple)):
    get_jump_image proved to fail and raise the following error:

"More than one site found"

This seems to be an issue of the data itself. Then when the work fail, the tuple of input leading to this failure is stored into work_fail.

B) Subsequent modification to enable Solution

  1. To pass the error:

"More than one site found"

The function try_function has been created:

  • If it pass for a function f: return a tuple storing the input and the output: (*input, f(input))
  • Else, return only the input.
  1. A suggestion of modification has been made for parallel and batch_processing to support tqdm
    tqdm position used to be enforced using this parameter:

position=job_idx

It was not behaving has desired in Jupyter notebook. Then, I suggest using

position=0, desc=f"worker #{job_idx}, leave=True, disable=not print_progess"

It prints on the same line the every tqdm bar, but whenever one worker is done, the remaining bar are updated on the next bar.
This is not ideal but this solution still enable to to see both where the worker are in their process + which worker is done.
Other solution exists on the web, but they only enable to see which worker is done. To my opinion it is not as useful as worker goes relatively at the same pace. Then if the work of each worker is tremendous, it will takes a lot of time before having any update anyway.

Then parallel and batch_processing are modified to support the print_progess variable.

3) Test

The function has been tested using the following code:

metadata_pre = get_item_location_info("MYT1")
features_pre, work_fail = get_jump_image_iter(metadata_pre.select(pl.col(
["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"])),
                                                        channel=['DNA'],#, 'ER', 'AGP', 'Mito', 'RNA'],
                                                        site=["1"],
                                                        correction='Orig',
                                                        print_progress=True) #None, 'Illum'

4) Question

  1. Do we want get_jump_image_iter to be more flexible on the input DataFrame?
  2. There would be no need for try_function the "More than one site found" in the data was addressed. Is there a way to tackle this issue?
  3. Is the tqdm solution satisfying?

@HugoHakem HugoHakem added enhancement New feature or request portrait labels Aug 19, 2024
from jump_portrait.utils import batch_processing, parallel

from jump_portrait.utils import batch_processing, parallel, try_function
from typing import List
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by just 'list' since python3.9 IIRC, so no need for the extra import.

Comment on lines 128 to 129
Load jump image associated to metadata in a threaded fashion.
----------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -72,7 +75,7 @@ def parallel(
jobs = len(iterable)
slices = slice_iterable(iterable, jobs)
result = Parallel(n_jobs=jobs, timeout=timeout)(
delayed(func)(chunk, idx, *args, **kwargs)
delayed(func)(chunk, idx, print_progress, *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print_progress is a bit too verbose. Let us rename it to "verbose" to follow conventions.

for item in item_list:
# pbar.set_description(f"Processing {item}")
for item in tqdm(item_list, position=0, leave=True,
disable=not print_progress,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'leave' may cause troubles. We should test it on the command line, by running it in a script, and alongside notebooks (which it was not supporting originally).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try both and let you know. From what I remember it allows to work both on notebook and script.

Comment on lines 133 to 138
channel(List[str]): list of channel desired
Must be in ['DNA', 'ER', 'AGP', 'Mito', 'RNA']
site(List[str]): list of site desired
For compound, must be in ['1' - '6']
For ORF, CRISPR, must be in ['1' - '9']
correction(str): Must be 'Illum' or 'Orig'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings should be easy to read for humans, thus syntax like List[str] is not ideal, replace them with 'list of strings'

print_progress=print_progress)

img_list = sorted(img_list, key=lambda x: len(x))
fail_success = {k: list(g) for k, g in groupby(img_list, key=lambda x: len(x))}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of groupby

@@ -119,6 +120,52 @@ def get_jump_image(
return result


def get_jump_image_iter(metadata: pl.DataFrame, channel: List[str],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to get_jump_image_iter to get_jump_image_batch

@@ -84,7 +85,7 @@ def get_jump_image(
Site identifier (also called foci), default is 1.
correction : str
Whether or not to use corrected data. It does not by default.
apply_illum : bool
apply_correction : bool
When correction=="Illum" apply Illum correction on original image.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I forgot to update the argument description). Please make it "When apply_correction==...."


img_list = sorted(img_list, key=lambda x: len(x))
fail_success = {k: list(g) for k, g in groupby(img_list, key=lambda x: len(x))}
if len(fail_success) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as 'if len(fail_success):'

Copy link
Collaborator

@afermg afermg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things necessary:

  • format the files with ruff
  • make docstrings human-legible
  • replace print_progress with verbose
  • run tests so things run (we should automate this at some point)
  • Fix the issues indicated in the per-line comments
  • the batched image function will need some refactoring, as I specified in the comment under that function. The general idea is that we do not test by default that the input yielded images or not. This messes up the order and silent errors are hard to debug. I suggest to give users the option to ignore errors and use that to define whether or not to wrap the get_jump_image function in a try-except block or not.

My main concern with the try-except wrapper around the batcher is that the interface is different from the normal get_image... stuff. On the other hand, it makes sense if we are batching a ton of images.

My solution is to pass "ignore_errors" as an arguments (false by default) and then wrap the function in a try-except (or not) based on that argument. This changes the shape of the output, so the user must be conscious of making that decision.

Comment on lines 148 to 149
iterable = [(*metadata.row(i), ch, s, correction)
for i in range(metadata.shape[0]) for s in site for ch in channel]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace triple-nested loop with itertools.product

If it success, return a tuple of function parameters + its results
If it fails, return the function parameters
'''
# This assume parameters are packed in a tuple
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This assumes"

Comment on lines 160 to 163
features = pl.DataFrame(img_success,
schema=["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well",
"channel", "site", "correction",
"img"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not put numpy arrays inside DataFrames. If you want to return a set of data+metadata, return them as tuples. Normally I would suggest to do so by stacking all images but I know they don't all have the same size so let us use a tuple of image,meta pairs. A Dataframe is an overkill for this. Just specify what metadata is included in the output within a comment by the end of the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main motivation for using a Dataframe is that it allows me to do some grouping if I need to, or order things easily. But understood, I will remove that

Comment on lines 30 to 32
from jump_portrait.utils import batch_processing, parallel

from jump_portrait.utils import batch_processing, parallel, try_function
from typing import List
from itertools import groupby
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort imports. Ruff should fix it automatically

@afermg afermg mentioned this pull request Sep 4, 2024
5 tasks
@@ -373,3 +362,11 @@ def get_gene_images(
)

return images

metadata_pre = get_item_location_info("MYT1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were not supposed to be here. These are the lines for testing from the readme, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request portrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants