Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to avoid translating tuple parameter values as strings, or else allow a user-supplied Translator? #86

Open
blakeNaccarato opened this issue Nov 11, 2023 · 8 comments

Comments

@blakeNaccarato
Copy link

Parametrizing a notebook on parameters={"list_param": [0,1], "tuple_param": (0,1)} will result in list_param value being injected as a list, but the tuple_param value being injected as the stringified representation of a tuple, e.g. "(0, 1)" (literally with the quotes). I understand that ploomber-engine strives to match the papermill API, so maybe this is intentional to maintain that compatibility, but is there a chance for the parameter translator to faithfully inject tuples?

Maybe a more sensible request would be to allow the user to pass a custom Translator to PloomberClient, which could pass the custom Translator subclass to parametrize_notebook (below), allowing the user to customize behavior, in the case that it is desired to mimic the papermill API surface by default?

Contents of translate

Here we see that tuple falls through to the "last resort" translator which stringifies it.

@classmethod
def translate(cls, val):
"""Translate each of the standard json/yaml types to appropiate objects."""
if val is None:
return cls.translate_none(val)
elif isinstance(val, str):
return cls.translate_str(val)
# Needs to be before integer checks
elif isinstance(val, bool):
return cls.translate_bool(val)
elif isinstance(val, int):
return cls.translate_int(val)
elif isinstance(val, float):
return cls.translate_float(val)
elif isinstance(val, dict):
return cls.translate_dict(val)
elif isinstance(val, list):
return cls.translate_list(val)
# Use this generic translation as a last resort
return cls.translate_escaped_str(val)

Abbreviated contents of parametrize_notebook

def parametrize_notebook(nb, parameters):
"""Add parameters to a notebook object"""
...
params_translated = translate_parameters(
parameters=parameters, comment="Injected parameters"
)

@edublancas
Copy link
Contributor

We want to keep compatibility with papermill (although I agree that converting tuples into strings is an odd API choice).

I'm open to extending the API so users can customize the translator, what's the use case for this? wouldn't it be easier to convert the tuple into a list before executing the notebook?

@blakeNaccarato
Copy link
Author

blakeNaccarato commented Nov 21, 2023

We want to keep compatibility with papermill

Yeah, the papermill compatibility guarantee is understandable.

wouldn't it be easier to convert the tuple into a list before executing the notebook?

Yes, that's what I ended up doing in my case.

I'm open to extending the API so users can customize the translator, what's the use case for this?

I think I'm hitting these surprises because I'm interacting with namespaces of parametrized notebooks, a less-common use-case. I chose ploomber-engine instead of Ploomber proper to keep it lightweight and more adaptable to the patterns I'm using (particularly in testing).

I am injecting parameters and handling namespaces in calling code. One rationale is to keep file-IO logic out of the notebook logic. In tests, I pass different parameters through a notebook and assert certain namespace results.

Allowing a substitute translator would probably be better than special-casing the existing translator. I figured it could be a slippery-slope if you do tuple today, then someone wants slices, and so-on. An alternative extension point would allow users to pass an override mapping of types to handler functions, which wouldn't require subclassing Translator.

But a reasonable choice is also "the translator works the way that it does, we'll make sure the surprises are well-documented". It's really not that big of a deal to be limited to str, bool, int, float, dict, and list. If you like, I'm open to authoring the documentation change if that's the path chosen. I could also manage the feature implementation if that's the choice, once an approach is decided.

As an aside, get_namespace() doesn't directly accept parametrization like execute_notebook()does. So this requires importing some ploomber_engine internals, ploomber_engine._util.parametrize_notebook. Facilitating that on get_namespace() would be really cool!

I've provided more implementation details of this use-case below.

Details

Function get_nb_ns(), getting parametrized namespace using ploomber-engine internals

Also allows returning a subset of the notebook namespace, e.g. usually just a few attributes for manipulation in module code.

def get_nb_ns(
    nb: str, params: Params = NO_PARAMS, attributes: Attributes = NO_ATTRS
) -> SimpleNamespace:
    """Get notebook namespace, optionally parametrizing it."""
    nb_client = get_nb_client(nb)
    # We can't just `nb_client.execute(params=...)` since`nb_client.get_namespace()`
    # would execute all over again
    if params:
        parametrize_notebook(nb_client._nb, parameters=params)  # noqa: SLF001
    namespace = nb_client.get_namespace()
    return SimpleNamespace(
        **({attr: namespace[attr] for attr in attributes} if attributes else namespace)
    )

Implementation of export_contours()

The below export_contours() is run concurrently for different inputs via ProcessPoolExecutor to run the notebook for different conditions and write a resulting namespace object to HDF. This facilitates "sandwiching" the notebook logic with parameters and file I/O.

def export_contours(params: Params):
    """Export contours."""
    dest = EXP_DATA / "contours"
    ns = get_nb_ns(nb=read_exp_nb("find_contours"), params=params)
    dest.mkdir(exist_ok=True)
    path = (dest / f"contours_{ns.PATH_TIME}").with_suffix(".h5")
    ns.contours.to_hdf(path, key="contours", complib="zlib", complevel=9)

Some tests exercising notebook namespaces (parametrization boilerplate omitted)

...
def test_custom_features(ns, ax):
    ns.objects.plot(ax=ax)
    assert_index_equal(ns.objects.columns, ns.my_objects.columns, check_order=False)  # type: ignore
...
def test_find_collapse(ns):
    objects = ns.nondimensionalized_departing_long_lived_objects
    assert objects["Dimensionless bubble diameter"].min() < 0.2
...
def test_get_thermal_data(ns):
    assert ns.data.subcool.mean() == pytest.approx(3.65, abs=0.01, rel=0.01)

@edublancas
Copy link
Contributor

Thanks for the detailed answer!

I think we can break down the action items:

  1. Document how the parameter translation works so users know which data types are supported (and that any unsupported one is converted into a string)
  2. We can adopt a subclass approach to customize PloomberClient (see code below)
  3. If you also want to work on adding parametrization support for get_namespace, feel free to! Just open a new issue so we can track progress separately

Subclassing PloomberClient:

class MyCustomClient(PloomberClient):

    def translate_parameters(self, parameters):
        # custom translation logic

client = MyCustomClient(...)
client.execute(...)

This would imply modifying PloomberClient so it uses a default translator but subclasses could override this by re-writing a translate_parameters method. Thoughts?

@blakeNaccarato
Copy link
Author

blakeNaccarato commented Nov 22, 2023

Thanks for the detailed answer!

Thanks! It was so detailed, that I got self-conscious and folded half of it under <details> 😁. I think I'm about to outdo myself with this essay of a response, though. I've numbered relevant bullet points so you can refer to them in shorthand, e.g. (3), or help me decide the necessary ones per acceptance critera for the current work.

Also, it's nearly Thanksgiving, so know our communication will likely pause abruptly. I plan to work on it a bit over the weekend, but have enough to do in dev environment setup/documentation that I won't get into the implementation details this week, I'm sure.

  • (1) If I create this as a draft PR, will that avoid noise for maintainers and assignments? I don't want to create noise until the PR is ready or I explicitly reach out with further questions.
  • (2) If I need to mention one of the maintainers with an issue, should I ping you or someone else? Here or in the PR?

If you also want to work on adding parametrization support for get_namespace, feel free to! Just open a new issue so we can track progress separately

(3) I'll open an issue on this sometime soon, but I'll prioritize working on the current matter first. If someone else comes along and wants to work the other PR in the meantime, that's fine, too.

Plan

(4) Unless otherwise noted, I'll assume the following work happens in one PR. Let me know if you recommend a different grouping/sequence of work, e.g. if certain tasks should wait for a follow-up.

Documentation

Document how the parameter translation works so users know which data types are supported (and that any unsupported one is converted into a string)

  • (5) PloomberClient API doc will grow a translate_parameters method. This could be the source of truth documenting default parameter translation behavior. Or should there be a different source of truth that this and the following points back to?
  • (6) Also document compatible overrides of the class as features are implemented.
  • (7) Other places, e.g. in execute_notebook API reference, refer to parameters like "Arbitrary keyword arguments to pass to the notebook parameters." Reword this or add an admonition/warning below the basic description, linking to the source of truth detailing potentially surprising default behaviors and how to override.
  • (8) Also link to the source of truth in the User Guide section on parametrizing notebooks, but otherwise don't materially change that guide.
  • (9) Add a CHANGELOG entry.
  • (10) Separate PR? Once this is implemented, a user guide page detailing an example of sub-classing PloomberClient and modifying execute_notebook? Could be considered part of the optional demo recommendation.
  • (11) Separate PR? Should any Changed in version ... placeholders be added in this work/PR, or is that handled in a different process?

Modification of PloomberClient

We can adopt a subclass approach to customize PloomberClient. [...] This would imply modifying PloomberClient so it uses a default translator but subclasses could override this by re-writing a translate_parameters method. Thoughts?

  • (12) Write a default translate_parameters method.
    • (12a) Since ploomber_engine._translator.translate_parameters() is already used in a few places, should PloomberClient.translate_parameters() just call that existing function?
    • (12b) Or should it instantiate ploomber_engine._translator.PythonTranslator directly?
    • (12c) A different approach?
  • (13) Should existing calls to parameter translation throughout the codebase be changed to use the PloomberClient.translate_parameters() method instead? Worried about leaving dangling logic pointing to prior implementation.
  • (14) How can current exposed functionality like execute_notebook() operate on a custom user PloomberClient? It seems it's instantiated now via INIT_FUNCTION, so a client with custom translation won't be able to be used here? Should a client argument be added to such functions, allowing a custom client in? This will likely be incompatible with PloomberMemoryProfilerClient unless users actually subclass that one. I guess documentation in (7) is tied to this change.
  • (15) Separate PR? Verify existing subclasses of PloomberClient still work (already covered by tests?), e.g. PloomberLogger (especially since its own execute method override has a FIXME indicating duplicating some logic from PloomberClient) and PloomberMemoryProfilerClient. Maybe these are already sufficiently isolated and won't be affected as long as they don't override the translate_parameters method.
  • (16) If users are encouraged to subclass PloomberClient for translate_parameters handling, are there any breaking uses I should be aware of while implementing this method? I'm particularly interested in how a given client instance gets its hooks, like self.on_notebook_start and self.on_cell_* (used here). Can a user subclass inadvertently mess up those hooks? Or am I mixing these up, it seems maybe PloomberNotebookClient is actually not involved here?

Tests

  • (17) Write tests that exercise a basic custom subclass of PloomberClient, probably in tests/test_ipython.py.
  • (18) Separate PR? If necessary, also write tests that check function of existing PloomberClient subclasses detailed in (14).

Conclusion

This work will wire up the papermill-consistent translation to a translate_parameters method on PloomberClient, and expose this method for overriding through subclassing. Unexpected interactions are likely wherever existing subclasses of PloomberClient are in use, and it's not exactly clear to me how to handle those right now.

Maybe this work can be broken into a few PRs. Most important is the proposed feature, directly-related documentation, tests, and exposing this feature through various helper functions like execute_notebook(). I think that enabling custom translation is worth the effort, though.

@edublancas
Copy link
Contributor

@neelasha23: looping you so you're aware of this

(1) If I create this as a draft PR, will that avoid noise for maintainers and assignments? I don't want to create noise until the PR is ready or I explicitly reach out with further questions.

That's fine, opening a draft PR won't create any noise

(2) If I need to mention one of the maintainers with an issue, should I ping you or someone else? Here or in the PR?

you can tag me and @neelasha23, she'll be reviewing your code as well

(11) Separate PR? Should any Changed in version ... placeholders be added in this work/PR, or is that handled in a different process?

Please add an "added in version" in the docstring for the new translate_parameters method

(12) Write a default translate_parameters method.

we can use ploomber_engine._translator.translate_parameters()

(13) Should existing calls to parameter translation throughout the codebase be changed to use the PloomberClient.translate_parameters() method instead? Worried about leaving dangling logic pointing to prior implementation.

Unsure about this, where in the codebase are we calling the translate parameters? I'm tempted to keep those as is since calling PloomberClient.translate_parameters() should only happen inside the PloomberClient class. But let me know where you find other call instances

(14) How can current exposed functionality like execute_notebook() operate on a custom user PloomberClient? It seems it's instantiated now via INIT_FUNCTION, so a client with custom translation won't be able to be used here? Should a client argument be added to such functions, allowing a custom client in? This will likely be incompatible with PloomberMemoryProfilerClient unless users actually subclass that one. I guess documentation in (7) is tied to this change.

Fair point. I think let's leave this point out of the initial work. We can think of ways for users to call execute_notebook with a custom client. But let's not add too many things to a single PR.

(15) Separate PR? Verify existing subclasses of PloomberClient still work (already covered by tests?), e.g. PloomberLogger (especially since its own execute method override has a FIXME indicating duplicating some logic from PloomberClient) and PloomberMemoryProfilerClient. Maybe these are already sufficiently isolated and won't be affected as long as they don't override the translate_parameters method.

Good catch. If you feel like there are missing unit tests, please add them. This can be the first thing to tackle, as it'll ensure you that your changes aren't breaking existing functionality.

(16) If users are encouraged to subclass PloomberClient for translate_parameters handling, are there any breaking uses I should be aware of while implementing this method? I'm particularly interested in how a given client instance gets its hooks, like self.on_notebook_start and self.on_cell_* (used here). Can a user subclass inadvertently mess up those hooks? Or am I mixing these up, it seems maybe PloomberNotebookClient is actually not involved here?

Right PloomberNotebookClient is not involved. PloomberNotebookClient is a legacy client that derives from nbclient, this was our first implementation, before we implement a notebook executor from scratch. We consider this legacy code and might remove it in the future.


I hope I didn't miss anything important! Feel free to tag me in your draft PR if you have questions

@blakeNaccarato
Copy link
Author

Thanks for the thorough response. I created a draft PR, so I'll continue the discussion over there and ping you both as needed for clarification. First I'll establish a rough scope, set up my environment, and write tests covering existing subclass usage as needed, before getting into the actual implementation.

@edublancas
Copy link
Contributor

sounds good!

@blakeNaccarato
Copy link
Author

It looks like time has gotten away from me here, so I figured I'd chime in to say I'm still alive. I haven't gotten around to the implementation, though I do still intend to, and would like to put time into it this month. If any passers-by see this and are eager to work on this themselves, just @ me and I'll try to respond promptly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants