Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suffix in pathlib is not behaving like a file extension #121347

Open
bbilly1 opened this issue Jul 3, 2024 · 10 comments
Open

Suffix in pathlib is not behaving like a file extension #121347

bbilly1 opened this issue Jul 3, 2024 · 10 comments
Labels
topic-pathlib type-bug An unexpected behavior, bug, or error

Comments

@bbilly1
Copy link

bbilly1 commented Jul 3, 2024

Bug report

Bug description:

According to the docs here a suffix is defined as:

The file extension of the final component, if any

But pathlib doesn't behave as expected, illustrating that on Python 3.12.4:

>>> Path("Mr. Smith resume for review").suffix
'. Smith resume for review'

This is particularly problematic for methods like with_suffix. According to the docs here that should:

Return a new path with the suffix changed. If the original path doesn’t have a suffix, the new suffix is appended instead. If the suffix is an empty string, the original suffix is removed

But as established above, that results in:

>>> Path("Mr. Smith resume for review").with_suffix(".pdf")
PosixPath('Mr.pdf')

I've characterized that as a bug as that is not working as described, but could also be a matter of documentation improvement.

I see a few ways:

  1. Implement add_suffix, or an argument to with_suffix, but I know that has been discussed before and was ultimately decided against. Also this is a matter of how suffix is defined, not how it's processed.
  2. Sanity check what a suffix can be, e.g. suffix can't have white space. But ultimately difficult without making assumption.
  3. Adjust the documentation: Clarify that suffix is not equal to file extension but something like last segment separated by a dot or similar. Maybe even with a warning that this can have unexpected behavior as showcased above.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

@bbilly1 bbilly1 added the type-bug An unexpected behavior, bug, or error label Jul 3, 2024
@barneygale
Copy link
Contributor

How are you defining "file extension"?

@bbilly1
Copy link
Author

bbilly1 commented Jul 4, 2024

Common denominator I see is: "The file extension defines what kind of file it is."

Some definitions I have found:

  • Wikipedia: A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use.
  • microsoft: Windows file names have two parts separated by a period: first, the file name, and second, a three- or four-character extension that defines the file type. In expenses.xlsx, for example, the first part of the file name is expenses and the extension is xlsx.
    Extensions tell your computer which application created or can open the file and which icon to use for the file. For example, the docx extension tells your computer that Microsoft Word can open the file and to display a Word icon when you view it in File Explorer.
  • lenovo: A file extension is a three or four-letter code that appears at the end of a filename and indicates the type of file it is. For example, .txt stands for text files, .jpg stands for image files and .docx stands for Microsoft Word documents. By knowing what kind of file, it is, your computer will be able to correctly open the file using the correct program.
  • wordnik: noun A short series of letters and/or numerals at the end of a personal computer filename, used to indicate the type of file and the software that will be required to operate or open it.
  • techterms: A file extension (or simply "extension") is the suffix at the end of a filename that indicates what type of file it is. For example, in the filename "myreport.txt," the .TXT is the file extension. It indicates the file is a text document. Some other examples include .DOCX, which is used for Microsoft Word documents, and .PSD, which is the standard file extension for Photoshop documents.

@barneygale
Copy link
Contributor

Common denominator I see is: "The file extension defines what kind of file it is."

That's only true on Windows. On other operating systems, file extensions are an indicator and nothing more. I can rename an .mp3 file to .jpg on Linux and play it just fine.

Microsoft's definition ("three- or four-character extension") excludes some valid extensions like .a, .so and .patch, but fails to exclude file extensions with spaces such as . Smith resume for review

@barneygale
Copy link
Contributor

The point I'm getting at is that there is no standard definition of a file extension!

Since #82805 was solved, pathlib's suffix splitting works exactly like os.path.splitext(). A non-empty suffix starts with a dot and contains at most one dot, and a non-empty suffix must be preceded by a stem that contains at least one non-dot character.

@bbilly1
Copy link
Author

bbilly1 commented Jul 4, 2024

I get what you are saying. But even on Linux, that will depend on the implementation. E.g. xdg-open will happily ignore the file extension. But common GUI file browsers will not (tested on Thunar).

As there is no standard definition, I'd suggest to either define it, or avoid using the term all together and use the implementation as definition, e.g. last segment separated by a dot.

Even though it's not authoritatively defined, I'd argue, the above with_suffix example is unexpected behavior.

@eryksun
Copy link
Contributor

eryksun commented Jul 4, 2024

Microsoft's definition ("three- or four-character extension") excludes some valid extensions like .a, .so and .patch, but fails to exclude file extensions with spaces such as . Smith resume for review

The Windows shell API currently supports permanently associating a programmatic identifier (ProgID) with any file extension that does not include white space characters and that has a length from 1 to 198 characters (not including the dot). Thus the API supports ".a", ".so", and ".patch" as normal file extensions. If a file has no extension, or if the extension is longer than 198 characters or contains white space characters, then the API displays an open-with dialog that allows opening the file with an application just once instead of setting a permanent association.

@barneygale
Copy link
Contributor

barneygale commented Jul 4, 2024

Hum, that does give some legitimacy to the idea of forbidding whitespace in file extensions in pathlib.

@bbilly1
Copy link
Author

bbilly1 commented Jul 6, 2024

I think ideal behavior would be, that if you are trying to create a suffix with whitespace, throw an error. If you are trying to access a suffix, if there is whitespace, it's not a suffix but just a regular part of the filename.

@barneygale
Copy link
Contributor

The existing behaviour has some advantages:

  1. It's what existing users have come to expect (incumbent advantage)
  2. It's consistent with os.path.splitext()
  3. It's consistent between Windows and non-Windows paths when operating on a basename

So I'm weakly -1 on this idea, but happy to hear opinions from others.

@bbilly1
Copy link
Author

bbilly1 commented Oct 14, 2024

It's what existing users have come to expect

I respectfully disagree. I'd argue based on the with_suffix example above, this is unexpected. At least it was for me, so obviously anecdotal, but I'm an existing user. :-)

But yes, as an experienced developer you and I know about that pitfall, so we can account for it. Probably avoid using with_suffix and use string concatenation to append the file extension. Or check for any existing . dots in the filename before using with_suffix.

As with with_suffix it is super easy to shoot yourself into the foot. Bad code to illustrate my point:

from pathlib import Path

files = [
    Path("Mr. Smith resume for review v1"),
    Path("Mr. Smith resume for review v2"),
    Path("Mr. Smith resume for review v3"),
]

for path in files:
    path.rename(path.with_suffix(".pdf"))

Now you'll have deleted/overwritten v1 and v2 and only have v3 that is now called Mr.pdf.

But I just checked, it looks like the docs were updated in the mean time: #106650.

So that wording is much better and describes the behavior more accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-pathlib type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants