Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python computing supplement #24

Open
bmreiniger opened this issue Dec 23, 2023 · 18 comments
Open

python computing supplement #24

bmreiniger opened this issue Dec 23, 2023 · 18 comments

Comments

@bmreiniger
Copy link

I'd be interested in helping with a python computing supplement.

Did you have a format in mind? It seems likely that after the setup section, most sections could be tightly coupled between the R and python versions, which suggests maybe having two independent repositories isn't ideal? I think Quarto supports panelsets (as "tabsets"); that strikes me as a nice way to display the two, but also would mean both codes should be updated when a change is made.

One other thing that would be nice to decide on early: which python plotting library to use? plotnine mimics ggplot, matplotlib is already used by sklearn+pandas, others are slicker...

@topepo
Copy link
Contributor

topepo commented Dec 24, 2023

We would make a computing-python repo and keep the same css and organization. The owner of that repo would have to decide on using Jupyter Notebooks or a more markdown approach for the most Pythonic approach.

For libraries... ¯\(ツ)/¯ I'd like to avoid extra complexity but would defer to the python community for those decisions.

@mermast
Copy link

mermast commented Dec 26, 2023

I would also like to work on python supplement. @bmreiniger may we collaborate on this?

@ddixonAI
Copy link

I'll also toss in my hat for a collaboration on sklearn/python code. Could be a fun project!

@lcrmorin
Copy link

I'd like to work on this too. I have a decent knowledge of python for ML (Kaggle notebooks GM). As already mentionned it is difficult to imagine working without pandas sklearn and matplotlib. If plotnine is mentionned to replace matplotlib, I should mention polars that has a grammar closer to the tidyverse and is significantly better than pandas.

@mermast
Copy link

mermast commented Dec 29, 2023

There is another plotting option, lets-plot

@sulphatet
Copy link

I too would like to work on the python code, @bmreiniger lets collaborate on this?

@topepo
Copy link
Contributor

topepo commented Jan 2, 2024

I'll recant my previous statement:

I'd like to avoid extra complexity but would defer to the python community for those decisions.

Use whatever libraries you see fit. We use a ton of R packages to make the book (that's the way R is); use anything that you think makes the best results.

@topepo
Copy link
Contributor

topepo commented Jan 2, 2024

I suggest creating a starter repo using the structure and styling of the computing-tidymodels repo. Once you get all the Python bits set up, ping me.

@topepo
Copy link
Contributor

topepo commented Jan 2, 2024

Also, I can export the data sets to a more suitable format to Python to ingest. What do you suggest? csv?

@bmreiniger
Copy link
Author

bmreiniger commented Jan 3, 2024

I'll probably be more useful on content, but I have a little site deployment experience; when I get some time I'll draft something. If anybody else knows more and/or has more time, jump in. My first thoughts:

  1. Quarto in the same repo as tidyverse coding, with panelsets. I still think this is attractive enough to do a demo of. On the other hand, fully rebuilding the site would require both an R and a python env...
  2. Quarto with qmd files and python snippets. This mirrors the tidyverse version the closest, and styling should be trivially very close as well.
  3. Quarto with ipynb files. Nice that the jupyter notebooks could be downloaded and executed directly, but git diffs will be unpleasant.
  4. sphinx-gallery I think is how sklearn generates its examples. Straight python means easy diffs and easily runnable, markup in comments for text sections. But styling will be harder, I imagine.
  5. ...?

As for data format, csv is probably fine. At least until something comes up to suggest otherwise.

On plotting, I'd lean toward starting out with matplotlib (and using the plotting functionality of pandas and sklearn), and if anyone can make much nicer plots much easier with another package, then make a PR for us all to look at. Similarly, I'd start with pandas, but if @lcrmorin or others can make something look nicer (or much faster, even for the toy datasets I imagine we'll have here?) using polars then let's see that and decide together.

@bmreiniger
Copy link
Author

bmreiniger commented Jan 3, 2024

@ddixonAI
Copy link

ddixonAI commented Jan 3, 2024

I like that! Sphix-gallery from option 3 looks nice as well but this is an area I'm not well versed in so I don't have a strong opinion.

On the subject of plotting libraries, another option I'm fond of is using the Seaborn objects API: https://seaborn.pydata.org/tutorial/objects_interface.html

This allows one to approximate a ggplot-like grammer of graphics using method chaining. As it says in the docs, it's still early in development but might be worth trying out.

@topepo
Copy link
Contributor

topepo commented Jan 4, 2024

We've experimented with side-by-side R/python code and I've never seen it work all that well. I think that it should be Python only.

Based on other things that I've done, many of the people consuming the main site and these computing pages are not going to be well versed in Python or R. We'll need to strike a balance between helpful content for beginners and more experienced readers (including "how to install" docs).

That said, I think that @bmreiniger's options 1 and 2 are good 9but I've never seen Sphix-gallery until now and don't know if that works with Quarto).

@topepo
Copy link
Contributor

topepo commented Jan 4, 2024

The demo looks good!

There are some nice Posit Python packages for tables and interactivity and many others unrelated to Posit (obviously).

Data splitting. sklearn’s train_test_split doesn’t support stratifying on a continuous outcome.

I was asked to discuss a PR or maybe a pip about this pre-pandemic. ¯\(ツ)

There will be a lot of inconsistencies where R or Python have different (or more extensive) capabilities. It doesn't have to be a perfect reproduction of what is on the main site.

@lcrmorin
Copy link

lcrmorin commented Jan 5, 2024

Best way to go is usually to stratify by pd.cut(df.target, n_grp, labels=False)... regarding code translation I have found LMMs to be very good at the task. Might be interesting to try this solution.

@bmreiniger
Copy link
Author

bmreiniger commented Jan 6, 2024

I think Sphinx would be instead of Quarto. I'd like to put the same sort of demo together for that, but I suspect it'll end up being similar amount of setup/work, with a very slight benefit of being pure .py scripts, and the detriment of being styled very differently from the rest of the project (barring a lot of work in defining a sphinx style/template).

I had some trouble getting renv set up, but now have a working demo of R+python in tabsets. Since it's in a branch of this repo, I don't know how to most readily make it viewable; you can download the html view it here. But (1) it requires managing both envs (python inside of reticulate), (2) during render both sets of code run, effectively doubling the runtime and memory usage, and (3) switching between the rendered tabsets make the rest of the page jump around when they're of different length; so I agree with @topepo that it's not worth it.

So it seems approach (1) is probably best, and I'll try to clean it up, complete with a python env. (Maybe I'll still demo sphinx for the sake of having done it.) So, another early question: which environment manager? I'd suggest conda or Pipenv; I find conda more intuitive, and Pipenv more rigorous.

@topepo
Copy link
Contributor

topepo commented Jan 8, 2024

I want to keep the repos on Quarto just so that they are in one format. :-/

You can use Jupyter notebooks or basic Python chunks; you won't need R for anything.

@bmreiniger
Copy link
Author

I've got a start in my org, if folks want to collaborate there. Ideally at some point it'd get moved under the aml4td org (with a name change)?
https://github.com/bmreiniger/aml4td-computing-python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants