DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

pganssle · 2017-01-31T23:36:06Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame([[0, 0],
                  [3, 0],
                  [1, 1]], index=list('ABC'), columns=list('ab'))

gby = df.groupby(by='b')

print(type(gby))            # <class 'pandas.core.groupby.DataFrameGroupBy'>
print(type(gby.idxmin()))   # <class 'pandas.core.frame.DataFrame'>

Problem description

According to the documentation, this is supposed to output a pandas.core.Series. To me, that seems to be what makes sense, but I'm not sure how or why this ended up returning a DataFrame. Is this just an issue with the documentation, or is it an issue with the code?

Edit: Ah, I understand why this returns a DataFrame now - if you have multiple columns the idxmin() might be different for each column. Seems that the documentation needs to be updated. I can make a PR if that's appropriate.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.9.6-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.0.3
Cython: 0.25.2
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-01T14:03:10Z

The documentation is actually right. What happens is that .groupby uses the doc string for certain 'whitelisted' methods, IOW, methods that are effectively called like

df.groupby(...).apply(lambda x: getattr(x, method)(...)) so in this case
df.groupby(...).apply(lambda x: x.idxmin())

then the results are inferred to be of the appropriate shape. Since this is a groupby, you get a Series for each group which are then stacked.

So the result is correct and the method of generating the document is 'correct', but this is a method that generates a Series (and not a scalar return), so .apply infers the correct output shape.

So we would need to update the templates a bit to fix this. A PR is welcome for this. I suspect there are several other methods that have this issue as well, but you would have to scan the doc-strings I think to see.

we actually just had some discussion w.r.t. this for .describe #15260 (comment) and in #15272 so these are going to be solved differently (but the idea is the same)

pganssle · 2017-02-01T14:33:43Z

OK, I think I understand what you are saying, but I'm a bit confused still about what the "expected result" should be here. If I'm reading you correctly, you're saying that the templates need to be changed such that DataFrameGroupby.idxmin() (and, presumably, idxmax() and possibly a host of other methods where the same thing happens) explicitly states that it returns a DataFrame, right?

I'm just confused because you say that "the documentation is actually right", but then go on to explain why the documentation says the wrong thing. I'm guessing that you meant that it's the right documentation for x.idxmin(), but that when it gets inherited by things that stack it, the final documentation is wrong.

jreback · 2017-02-01T14:50:46Z

@pganssle yes, I mean the groupby documention is wrong, but the source itself (DataFrame.idxmin/Series.idxmin) is right. We simply use that documentation.

We need a slightly more sophisticated strategy of using the source documentation, but changing the return type to what it actually is for some groupby methods. A way to do this would be to make the return type a replaceable parameter in the doc-string for Series/Dataframe (and then replace it locally there), and use that here as a template.

A bit of work, but I think we need to go down this route to avoid copy-pasting doc-strings, IOW, they will be defined in a singular place (on the source method, e.g. Series.idxmin or DataFrame.idxmin / even these could combined also), then using that in groupby with some Substitution parameters.

pganssle · 2017-02-01T15:16:04Z

OK, yeah, that's what I thought. I agree on the point of documentation, by the way - I've always felt like documentation tools (at least in Python, which is where I'm most familiar with them) were lacking in support for the DRY (don't repeat yourself) principle. You end up having to do some funky stuff like creating metaclasses or dynamically generate docstrings on import (which makes things a bit harder when navigating the code instead of looking at the generated documentation).

I'll take a look at how the templating looks and see if I can find an easy way to either make the return type a replaceable parameter, infer it from the code or both.

jreback · 2017-02-01T15:21:27Z

@pganssle yep, tooling is not great, though using Appender and Substitution we get pretty far.

thanks for looking into this.

pganssle · 2017-05-23T23:49:53Z

Hm... This is quite complicated, I'm not quite sure how Appender and Substitution are supposed to work in the case of inherited docstrings, because the top-level documentation itself has to be displayed, meaning the substitution has to take place there. In all derived functions, there's nothing to substitute.

I think the best way to handle this is to modify Substitution and Appender to preserve the original template versions to allow for function-specific docstring modifications.

Dr-Irv · 2023-02-22T13:42:01Z

It turns out that SeriesGroupBy.idxmin() and SeriesGroupBy.idxmax() are not even showing up in the docs, so that should be fixed as well.

rhshadrach · 2024-11-10T17:04:14Z

It turns out that SeriesGroupBy.idxmin() and SeriesGroupBy.idxmax() are not even showing up in the docs, so that should be fixed as well.

This has been fixed.

https://pandas.pydata.org/docs/dev/reference/api/pandas.core.groupby.SeriesGroupBy.idxmax.html

The DataFrameGroupBy.idxmin / max still shows it returning a Series. PRs to fix are welcome!

akourieh · 2024-11-14T19:13:54Z

Hello I would like to help, it seems the issue is here:

        Returns
        -------
        Series
            Indexes of minima along the specified axis.

this is in the code in pandas/pandas/core/frame.py for idxmin.

Should I make a PR by updating the docstring to Dataframe in that file ?

rhshadrach · 2024-11-17T12:54:32Z

I don't think so - the issue is in pandas.core.groupby.generic.

jreback added Difficulty Intermediate Docs Groupby labels Feb 1, 2017

jreback added this to the 0.20.0 milestone Feb 1, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach added the good first issue label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

pganssle commented Jan 31, 2017 •

edited

Loading

jreback commented Feb 1, 2017

pganssle commented Feb 1, 2017

jreback commented Feb 1, 2017

pganssle commented Feb 1, 2017

jreback commented Feb 1, 2017

pganssle commented May 23, 2017

Dr-Irv commented Feb 22, 2023

rhshadrach commented Nov 10, 2024

akourieh commented Nov 14, 2024

rhshadrach commented Nov 17, 2024

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

Comments

pganssle commented Jan 31, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

jreback commented Feb 1, 2017

pganssle commented Feb 1, 2017

jreback commented Feb 1, 2017

pganssle commented Feb 1, 2017

jreback commented Feb 1, 2017

pganssle commented May 23, 2017

Dr-Irv commented Feb 22, 2023

rhshadrach commented Nov 10, 2024

akourieh commented Nov 14, 2024

rhshadrach commented Nov 17, 2024

pganssle commented Jan 31, 2017 •

edited

Loading

Output of `pd.show_versions()`