Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

Open
pganssle opened this issue Jan 31, 2017 · 10 comments
Open

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

pganssle opened this issue Jan 31, 2017 · 10 comments

Comments

@pganssle
Copy link
Contributor

pganssle commented Jan 31, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame([[0, 0],
                  [3, 0],
                  [1, 1]], index=list('ABC'), columns=list('ab'))

gby = df.groupby(by='b')

print(type(gby))            # <class 'pandas.core.groupby.DataFrameGroupBy'>
print(type(gby.idxmin()))   # <class 'pandas.core.frame.DataFrame'>

Problem description

According to the documentation, this is supposed to output a pandas.core.Series. To me, that seems to be what makes sense, but I'm not sure how or why this ended up returning a DataFrame. Is this just an issue with the documentation, or is it an issue with the code?

Edit: Ah, I understand why this returns a DataFrame now - if you have multiple columns the idxmin() might be different for each column. Seems that the documentation needs to be updated. I can make a PR if that's appropriate.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.9.6-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.0.3
Cython: 0.25.2
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

The documentation is actually right. What happens is that .groupby uses the doc string for certain 'whitelisted' methods, IOW, methods that are effectively called like

df.groupby(...).apply(lambda x: getattr(x, method)(...)) so in this case
df.groupby(...).apply(lambda x: x.idxmin())

then the results are inferred to be of the appropriate shape. Since this is a groupby, you get a Series for each group which are then stacked.

So the result is correct and the method of generating the document is 'correct', but this is a method that generates a Series (and not a scalar return), so .apply infers the correct output shape.

So we would need to update the templates a bit to fix this. A PR is welcome for this. I suspect there are several other methods that have this issue as well, but you would have to scan the doc-strings I think to see.

we actually just had some discussion w.r.t. this for .describe #15260 (comment) and in #15272 so these are going to be solved differently (but the idea is the same)

@pganssle
Copy link
Contributor Author

pganssle commented Feb 1, 2017

OK, I think I understand what you are saying, but I'm a bit confused still about what the "expected result" should be here. If I'm reading you correctly, you're saying that the templates need to be changed such that DataFrameGroupby.idxmin() (and, presumably, idxmax() and possibly a host of other methods where the same thing happens) explicitly states that it returns a DataFrame, right?

I'm just confused because you say that "the documentation is actually right", but then go on to explain why the documentation says the wrong thing. I'm guessing that you meant that it's the right documentation for x.idxmin(), but that when it gets inherited by things that stack it, the final documentation is wrong.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

@pganssle yes, I mean the groupby documention is wrong, but the source itself (DataFrame.idxmin/Series.idxmin) is right. We simply use that documentation.

We need a slightly more sophisticated strategy of using the source documentation, but changing the return type to what it actually is for some groupby methods. A way to do this would be to make the return type a replaceable parameter in the doc-string for Series/Dataframe (and then replace it locally there), and use that here as a template.

A bit of work, but I think we need to go down this route to avoid copy-pasting doc-strings, IOW, they will be defined in a singular place (on the source method, e.g. Series.idxmin or DataFrame.idxmin / even these could combined also), then using that in groupby with some Substitution parameters.

@pganssle
Copy link
Contributor Author

pganssle commented Feb 1, 2017

OK, yeah, that's what I thought. I agree on the point of documentation, by the way - I've always felt like documentation tools (at least in Python, which is where I'm most familiar with them) were lacking in support for the DRY (don't repeat yourself) principle. You end up having to do some funky stuff like creating metaclasses or dynamically generate docstrings on import (which makes things a bit harder when navigating the code instead of looking at the generated documentation).

I'll take a look at how the templating looks and see if I can find an easy way to either make the return type a replaceable parameter, infer it from the code or both.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

@pganssle yep, tooling is not great, though using Appender and Substitution we get pretty far.

thanks for looking into this.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@pganssle
Copy link
Contributor Author

Hm... This is quite complicated, I'm not quite sure how Appender and Substitution are supposed to work in the case of inherited docstrings, because the top-level documentation itself has to be displayed, meaning the substitution has to take place there. In all derived functions, there's nothing to substitute.

I think the best way to handle this is to modify Substitution and Appender to preserve the original template versions to allow for function-specific docstring modifications.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 22, 2023

It turns out that SeriesGroupBy.idxmin() and SeriesGroupBy.idxmax() are not even showing up in the docs, so that should be fixed as well.

@rhshadrach
Copy link
Member

It turns out that SeriesGroupBy.idxmin() and SeriesGroupBy.idxmax() are not even showing up in the docs, so that should be fixed as well.

This has been fixed.

https://pandas.pydata.org/docs/dev/reference/api/pandas.core.groupby.SeriesGroupBy.idxmax.html

The DataFrameGroupBy.idxmin / max still shows it returning a Series. PRs to fix are welcome!

@akourieh
Copy link

Hello I would like to help, it seems the issue is here:

        Returns
        -------
        Series
            Indexes of minima along the specified axis.

this is in the code in pandas/pandas/core/frame.py for idxmin.

Should I make a PR by updating the docstring to Dataframe in that file ?

@rhshadrach
Copy link
Member

I don't think so - the issue is in pandas.core.groupby.generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants