Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use '=' width in multirow with compatible column types #60281

Open
2 of 3 tasks
tuetenk0pp opened this issue Nov 11, 2024 · 7 comments
Open
2 of 3 tasks

ENH: use '=' width in multirow with compatible column types #60281

tuetenk0pp opened this issue Nov 11, 2024 · 7 comments
Labels
Enhancement IO LaTeX to_latex Styler conditional formatting using DataFrame.style

Comments

@tuetenk0pp
Copy link

tuetenk0pp commented Nov 11, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

It would be nice to be able to use the = option with multirow. Maybe this is something that can be decided automatically according to the column type.

Feature Description

Add logic to _parse_latex_header_span() to check for cloumn types.

IF columntype NOT ONE OF l, c, r
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"

Alternative Solutions

One could also find/replace the return value of pandas.DataFrame.to_latex manually:

df = pd.read_excel("input.xlsx")
df = df.set_index(df.columns[:3].tolist())
df = df.sort_index(multirow=True)

latex = df.to_latex()
latex = latex.replace("{*}", "{=}")

with open("output.tex", "w") as f:
    f.write(latex)

Additional Context

return f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"

See this snippet from the multirow documentation:

The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).

@tuetenk0pp tuetenk0pp added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2024
@rhshadrach
Copy link
Member

@tuetenk0pp - can you include a reproducible example. We cannot reproduce your example without your input.xlsx file. It would be preferable to have your example not depend on external files at all, e.g.

df = pd.DataFrame(...)

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue IO LaTeX to_latex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2024
@tuetenk0pp
Copy link
Author

@rhshadrach, here is a more in depth example of what is happening:

>>> import pandas as pd
>>> d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(df.columns[:3].tolist())
>>> df = df.sort_index()

The DataFrame now looks like this:

>>> df
                                            count
location       category      item
Drugstore      Medicine      Cough Sirup        1
                             Pain Reliever      1
               Personal Care Shampoo            1
                             Toothpaste         1
Farmers Market Fruits        Oranges            6
                             Strawberrys        1
               Vegetables    Carrots            1
                             Spinach            2
Supermarket    Dry Goods     Pasta              1
                             Rice               2
               Produce       Apples             6
                             Bananas           12

Now I want to export the DataFrame to LaTeX:

>>> latex = r"""
... \documentclass[border=0.5cm]{standalone}
...
... \usepackage{array}
... \usepackage{multirow}
... \usepackage{booktabs}
... 
... \begin{document}
... """
>>> column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'
>>> latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'

The LaTeX Code now looks like this:

>>> print(latex)
\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}

\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
 &  &  & count \\
location & category & item &  \\
\midrule
\multirow[t]{4}{*}{Drugstore} & \multirow[t]{2}{*}{Medicine} & Cough Sirup & 1 \\
 &  & Pain Reliever & 1 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Personal Care} & Shampoo & 1 \\
 &  & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Farmers Market} & \multirow[t]{2}{*}{Fruits} & Oranges & 6 \\
 &  & Strawberrys & 1 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Vegetables} & Carrots & 1 \\
 &  & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Supermarket} & \multirow[t]{2}{*}{Dry Goods} & Pasta & 1 \\
 &  & Rice & 2 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Produce} & Apples & 6 \\
 &  & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}

\end{document}

Here I have a screenshot of the compiled document:
grafik

As you can see, the multicolumn does not respect the fixed column width. The Farmers Market and Personal Care strings should wrap. This can be fixed with replacing the * with = in \multirow{<nrows>}{<width>}{<text>}:

latex_wrap = latex.replace('{*}', '{=}')

The LaTeX Code now looks like this:

>>> print(latex_wrap)
\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}

\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
 &  &  & count \\
location & category & item &  \\
\midrule
\multirow[t]{4}{=}{Drugstore} & \multirow[t]{2}{=}{Medicine} & Cough Sirup & 1 \\
 &  & Pain Reliever & 1 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Personal Care} & Shampoo & 1 \\
 &  & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Farmers Market} & \multirow[t]{2}{=}{Fruits} & Oranges & 6 \\
 &  & Strawberrys & 1 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Vegetables} & Carrots & 1 \\
 &  & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Supermarket} & \multirow[t]{2}{=}{Dry Goods} & Pasta & 1 \\
 &  & Rice & 2 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Produce} & Apples & 6 \\
 &  & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}

\end{document}

As you can see, the issue no longer exists in the compiled document. The Farmers Market and Personal Care strings now wrap into the next line.
grafik

Note

Finding * and replacing them with = becomes much harder, once not every column containing multirows has a fixed width. Therefore the workaround as presented would no longer work. This is why I think this should acutally become a feature. Manipulating other portions of latex, e.g. replacing the tabular environment with a tabularx environment remains simple.

Here is the summarized code for reproduction:

import pandas as pd
import subprocess

d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}

df = pd.DataFrame(data=d)

df = df.set_index(df.columns[:3].tolist())
df = df.sort_index()
print(df)

latex = r"""
\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}
"""

column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'

latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'

latex_flex = latex.replace('{*}', '{=}')

with open('multirow.tex', 'w') as f:
    f.write(latex)

with open('multirow_flex.tex', 'w') as f:
    f.write(latex_flex)

for file in ['multirow.tex', 'multirow_flex.tex']:

    cmd = ['latexmk', '-lualatex', '-interaction=nonstopmode', '-file-line-error', '-silent', '-auxdir=./tmp', file]

    try:
        subprocess.run(cmd)
    except:
        print(f'could not run command: {cmd}')

@rhshadrach
Copy link
Member

Thanks for the additional information! From the OP, you have the logic:

IF columntype NOT ONE OF l, c, r

I do not understand branching on this condition, can you elaborate here?

@tuetenk0pp
Copy link
Author

The multirow documentation says:

The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).

So this actually means that multirow is able to adjust the width of the cell if the column has a defined width. This is generally the case for the LaTeX column types that are mentioned in the quote; so not for the standard column types l, c and r although the column type S from the siunitx package does not define a column width as well.

Thats why I thought it would be a good idea to check the columntype (from the column_format argument of to_latex()) and return the \multirow ... string accordingly.

With the added S column type the logic would look like this:

IF columntype NOT ONE OF l, c, r, S
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"

Or a more safe approach but with no support for any other column types than mentioned in the quote above:

IF columntype ONE OF p{}, m{}, b{}, X, L, C, R, J
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"

@rhshadrach
Copy link
Member

Thanks, PRs to fix are welcome!

cc @attack68

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Nov 17, 2024
@attack68
Copy link
Contributor

What happens if "=" is given in all cases? Does it still compile even for 'standard' columns?

@attack68 attack68 added the Styler conditional formatting using DataFrame.style label Nov 17, 2024
@tuetenk0pp
Copy link
Author

What happens if "=" is given in all cases? Does it still compile even for 'standard' columns?

The documentation states:

If you give “=” in other situations, you will get strange results (usually a too wide column).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO LaTeX to_latex Styler conditional formatting using DataFrame.style
Projects
None yet
Development

No branches or pull requests

3 participants