Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to log a Pandas DataFrame #1265

Open
vladjohnson opened this issue Dec 27, 2024 · 4 comments
Open

Best way to log a Pandas DataFrame #1265

vladjohnson opened this issue Dec 27, 2024 · 4 comments
Labels
question Further information is requested

Comments

@vladjohnson
Copy link

What would be the best way to log a Pandas DataFrame without experiencing spacing issues? Thank you!

@Delgan
Copy link
Owner

Delgan commented Dec 30, 2024

Hi @vladjohnson.

Can you please clarify the "spacing issues" you're encountering?

The following code produces a readable table in the logs (note I just added "\n" in the message):

import pandas
from loguru import logger

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"],
}

df = pandas.DataFrame(data)

logger.info("Created DataFrame:\n{}", df)
2024-12-30 20:14:04.571 | INFO     | __main__:<module>:13 - Created DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

@Delgan Delgan added the question Further information is requested label Dec 30, 2024
@CesarArroyo09
Copy link

CesarArroyo09 commented Jan 7, 2025

Data practitioner here. I am heavily working with pandas DataFrames and I noticed something weird when using the info() method. So I created a quick file on the different ways we would want to create a log using DataFrames information.

The first thing is: Under no circumstance should you try to log the full-format DataFrame. This is just not what logging is for.

The entire Python script is attached as text file: main.txt. I used the TMDB dataset at TMDB Movies, version 349.

Libraries versions

python==3.11
pandas==2.2.3
loguru==0.7.3

TL;DR

Just calling the DataFrame or using the head or describe methods will work for logging. This is because they return actual DataFrames, so when printed a view will be provided.

Using the info method requires some extra work so that the information is kept in the log file.

Default pandas output

The way suggested by @Delgan.

logger.info("TMDB dataset:\n{}", tmdb)

This is going to produce something like:

2025-01-07 13:55:02.998 | INFO     | __main__:main:13 - TMDB dataset:
              id                                              title  ...  imdb_votes                       poster_path
0              2                                              Ariel  ...      8870.0  /ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1              3                                Shadows in Paradise  ...      7654.0  /nj01hspawPof0mJmlgfjuLyJuRN.jpg
2              5                                         Four Rooms  ...    113089.0  /pyCk5JgtRZwRxnXwfrvyzukaKue.jpg
3              6                                     Judgment Night  ...     19456.0  /3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4              8                   Life in Loops (A Megacities RMX)  ...       284.0  /7ln81BRnPR2wqxuITZxEciCe1lc.jpg
...          ...                                                ...  ...         ...                               ...
1039107  1413701  The Biggest Joke In The NFL: Why The Jacksonvi...  ...         NaN                               NaN
1039108  1413702                                Sleep While I Drive  ...         NaN                               NaN
1039109  1413703  Cast & crew IMDbPro Randy Feltface: Smug Druggles  ...        88.0                               NaN
1039110  2662126                                                NaN  ...         NaN                               NaN
1039111  5180730                                                NaN  ...         NaN                               NaN

[1039112 rows x 28 columns]

pd.DataFrame.info method

According to the documentation this returns None and just prints the results.

So trying something like:

logger.info("Info for the TMDB dataset:\n{}", tmdb.info())

will produce:

2025-01-07 13:55:03.459 | INFO     | __main__:main:16 - Info for the TMDB dataset:
None

You will see the standard output printed in your terminal as the method is actually executed but the result will not be catched by loguru or be logged to the added log files.

Catching the printed value of the pd.DataFrame.info method

There is a workaround for the use of the previous comment.

import contextlib
import io

with contextlib.redirect_stdout(io.StringIO()) as new_stdout:
    tmdb.info()
logger.info("Info for the TMDB dataset:\n{}", new_stdout.getvalue())

This will produce the expected:

2025-01-07 13:55:03.867 | INFO     | __main__:main:22 - Info for the TMDB dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1039112 entries, 0 to 1039111
Data columns (total 28 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1039112 non-null  int64  
 1   title                    1039100 non-null  object 
 2   vote_average             1039110 non-null  float64
 3   vote_count               1039110 non-null  float64
 4   status                   1039110 non-null  object 
 5   release_date             922634 non-null   object 
 6   revenue                  1039110 non-null  float64
 7   runtime                  1039110 non-null  float64
 8   budget                   1039110 non-null  float64
 9   imdb_id                  598669 non-null   object 
 10  original_language        1039110 non-null  object 
 11  original_title           1039101 non-null  object 
 12  overview                 855498 non-null   object 
 13  popularity               1039110 non-null  float64
 14  tagline                  155198 non-null   object 
 15  genres                   738855 non-null   object 
 16  production_companies     481095 non-null   object 
 17  production_countries     628698 non-null   object 
 18  spoken_languages         641389 non-null   object 
 19  cast                     694867 non-null   object 
 20  director                 852787 non-null   object 
 21  director_of_photography  251268 non-null   object 
 22  writers                  505325 non-null   object 
 23  producers                332553 non-null   object 
 24  music_composer           102444 non-null   object 
 25  imdb_rating              434737 non-null   float64
 26  imdb_votes               434737 non-null   float64
 27  poster_path              739231 non-null   object 
dtypes: float64(8), int64(1), object(19)
memory usage: 222.0+ MB

head and describe methods

I would not recommend using this for large dataframes as the result will be cropped out, or the formatting will be so cumbersome it will just be extremely difficult to read.

In the provided example I do use the methods with a pd.Series that results from aggregating the pd.DataFrame. These are the patterns when using the methods:

logger.info("Top 10 directors by total revenue:\n{}", revenue_per_director.head(10))
logger.info(
    "Info about the revenue per director:\n{}", revenue_per_director.describe()
)

@Delgan
Copy link
Owner

Delgan commented Jan 7, 2025

Thanks for these extensive guidelines, @CesarArroyo09!

Note that to capture the output DataFrame.info(), it is also possible to use the buf argument as hinted in their documentation.

@CesarArroyo09
Copy link

My pleasure @Delgan! loguru was a very pleasant discovery. Great job!

@vladjohnson Hope the information is clarifying!

I would only add that if we are collecting the results using the .info method, using the buf parameter will append the information printed by .info to the StringIO object.

So if:

import io

from loguru import logger

buffer = io.StringIO()

# Assume df1 and df2 already exists

df1.info(buf=buffer)
logger.info(buffer.getvalue()) # This shows the result of `df1.info()`

df2.info(buf=buffer) # This will append the print to buffer
logger.info(buffer.getvalue()) # This logs both `df1.info()` and `df2.info()`

This can be avoided in 2 ways:

  1. Use the context manager as described in my other comment.
  2. Keep just one instance buffer=StringIO() but truncate the buffer before adding new information: buffer.truncate(0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants