Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Clarify usage of include_package_data/package_data/exclude_package_data on package data files #4643

Merged
merged 21 commits into from
Sep 26, 2024

Conversation

DanielYang59
Copy link
Contributor

@DanielYang59 DanielYang59 commented Sep 10, 2024

Summary of changes

@DanielYang59 DanielYang59 changed the title Clarify pack data doc [Docs] Clarify usage of include-package-data on inclusion of package data files Sep 10, 2024
@@ -35,16 +37,34 @@ For example, if the package tree looks like this::
├── data1.txt
└── data2.txt

and you supply this configuration:
When at least one of the following conditions are met:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include_package_data keyword would only have an effect when condition is met, perhaps it's good to state these upfront?

Copy link
Contributor

@abravalheri abravalheri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution!

Comment on lines 5 to 12
Old packaging installation methods in the Python ecosystem have
traditionally allowed the inclusion of "data files" (files beyond
:ref:`the default set <manifest>` ), which are placed in a platform-specific
location. However, the most common use case for data files distributed
with a package is for use *by* the package, usually by including the
data files **inside the package directory**.

``Setuptools`` focuses on this most common type of data files and offers three ways
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I understand that your main objective was to clarify what “data files” are.

While this intention is very appreciated and welcome, the proposed parenthesis seem to suggest: data file = file not included in the "default set". This could lead to confusion, as what defines a data file is not fundamentally related to whether a file is included in the default set or not. For example, we could change setuptools to start including .json files automatically, but that would not make them more or less “data file”-y.

Could you please have a look at the following suggestion (which adds a new paragraph before the original text)? Would this address your concerns?

Suggested change
Old packaging installation methods in the Python ecosystem have
traditionally allowed the inclusion of "data files" (files beyond
:ref:`the default set <manifest>` ), which are placed in a platform-specific
location. However, the most common use case for data files distributed
with a package is for use *by* the package, usually by including the
data files **inside the package directory**.
``Setuptools`` focuses on this most common type of data files and offers three ways
In the Python ecosystem, the term "data files" is used in various complex scenarios
and can have nuanced meanings.
For the purposes of this documentation, we define "data files" as non-Python files
that are installed alongside Python modules and packages on the user's machine
when they install a :term:`distribution <Distribution Package>` from PyPI
or via a ``.whl`` file.
These files are typically intended for use at runtime by the package itself or
to influence the behavior of other packages or systems.
They may also be referred to as "resource files."
Old packaging installation methods in the Python ecosystem
have traditionally allowed installation of "data files", which
are placed in a platform-specific location. However, the most common use case
for data files distributed with a package is for use *by* the package, usually
by including the data files **inside the package directory**.
Setuptools focuses on this most common type of data files and offers three ways

Does this look good to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I understand that your main objective was to clarify what “data files” are.

Yes exactly. Because this is the starting sentence of the Data Files Support section, and we have to make sure users understand what is considered "data file" in the first place, such that they could decide whether they even need these keywords at all.

For example C source code/READMEs are not "Python" files, but we don't need to declare them to include them into the package. And therefore we might need to clarify the following definition of "data files":

we define "data files" as non-Python files

the proposed parenthesis seem to suggest: data file = file not included in the "default set". This could lead to confusion, as what defines a data file is not fundamentally related to whether a file is included in the default set or not.

Thanks a ton for the clarification! Yes my sketch indeed would need more discussion and check.

They may also be referred to as "resource files."

Personally I think the following terminologies are already very confusing: "package data", "data file", "source distribution", "files inside the wheel", and their difference is pretty unclear, and can we somehow define/clarify them, and perhaps avoid introducing more synonyms ("resource files" in this case)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problems with omitting resource files, I just mentioned before because other packages started calling them that way (e.g. importlib_resources). Please feel free to use and modify my suggestion to align with your vision in the contribution.


For example C source code/READMEs are not "Python" files, but we don't need to declare them to include them into the package.

Yes, in this case C/README files are not data files, because they don't fit in the remaining part of the suggested definition:

... that are installed alongside Python modules and packages on the user's machine
when they install a :term:`distribution <Distribution Package>` from PyPI
or via a ``.whl`` file.
These files are typically intended for use at runtime by the package itself or
to influence the behavior of other packages or systems.

C-files are not installed along side Python files in the user's machine and also are not intended to be used in runtime, right?


I just noticed this from official Python docs:

MANIFEST.in does not affect binary distributions such as wheels.

That is not 100% precise is it? MANIFEST.in determines what goes into the sdist, and then the contents of the sdist influence what goes into the wheel (specially when include-package-data = true which is the default)... So there is a potential indirect effect there. That is because the build process works more or less like in the mermaidjs diagram below1:

graph LR
    src[fa:fa-laptop-code source code] --> b1(fa:fa-hammer)
    
    subgraph build process
      b1 --> sdist(fa:fa-file-code sdist) --> b2(fa:fa-hammer)
      b2 --> wheel(fa:fa-box wheel)
       
      build([build dependencies]) --> b1
      build([build dependencies]) --> b2
    end

    subgraph installation process
        wheel --> pkg
        pkg(fa:fa-box-open) --> inst[fa:fa-python installed packages]
        runtime([runtime dependencies]) --> inst
    end

    classDef deps fill:#2aa198,stroke:#333
    classDef env fill:#6c71c4,stroke:#333
    build:::deps
    runtime:::deps
Loading

Footnotes

  1. The "build process" creates distribution artifacts/packages. sdist distributions are meant to be platform independent (but may contain varying levels of optimisation - e.g. compiling Python to C via Cython). wheel distributions contain the final files meant to be directly copied to the user's machine, and therefore may be platform specific.

docs/userguide/datafiles.rst Outdated Show resolved Hide resolved
docs/userguide/datafiles.rst Outdated Show resolved Hide resolved
@DanielYang59
Copy link
Contributor Author

I just noticed this from official Python docs:

MANIFEST.in does not affect binary distributions such as wheels.

For the purposes of this documentation, we define "data files" as non-Python files
that are installed alongside Python modules and packages on the user's machine
when they install a :term:`distribution <Distribution Package>` from PyPI
or via a ``.whl`` file.
Copy link
Contributor Author

@DanielYang59 DanielYang59 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tag and explain something I want to change (I find myself lose track of what I was originally intended if I overthink), please free feel to comment.

I believe the source (PyPI vs file) is not important here, but whether we're installing from a binary distribution (.whl) or source distribution that matters.

Copy link
Contributor

@abravalheri abravalheri Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are probably better off if we don't define data files in terms of source distributions or source trees... Because then we enter in muddy waters (as source distributions and source tree may have any kinds of files).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are probably better off if we don't define data files in terms of source distributions or source trees

Sorry it looks like I might have caused confusion here. I was suggesting:

- When they install a distribution from PyPI or via a `.whl` file
+ When they install a distribution from source distribution or via a `.whl` file

Because install from PyPI is not important here, because by install from PyPI, we could install both from .whl file (if wheel is built for that particular platform) or source distribution. I hope I didn't misunderstand anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the former phrase is better for the definition than the second. Alternatively only the following is also fine:

- When they install a distribution from source distribution or via a `.whl` file
+ when they install a distribution via a `.whl` file

I believe We should not define define data files as per source distribution. A Python package always need to be transformed into a wheel, so the files inside of the wheel is what actually matters.

In the previous version install from PyPI is being used as an equivalent of pip install package, which in the end of the day downloads (preferentially) a .whl file.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Sep 19, 2024

@abravalheri I believe I have included everything I want to include, perhaps it's time to polish the language and check the correctness and finalize this PR?

Some questions unrelated to this PR:

  1. The following file name has spaces in it, is it intended or maybe rename it to a proper Unix name (escaping spaces looks quite messy)?: https://github.com/pypa/setuptools/blob/main/docs/python%202%20sunset.rst
  2. I got the following notifications while building docs locally, perhaps update conf.py accordingly?
intersphinx inventory has moved: https://twine.readthedocs.io/en/stable//objects.inv -> https://twine.readthedocs.io/en/stable/objects.inv
inventory <https://build.pypa.io/en/latest> contains multiple definitions for std:label:python--m-build--v
intersphinx inventory has moved: https://packaging.python.org/en/latest//objects.inv -> https://packaging.python.org/en/latest/objects.inv
intersphinx inventory has moved: https://packaging.pypa.io/en/latest//objects.inv -> https://packaging.pypa.io/en/latest/objects.inv

@DanielYang59 DanielYang59 marked this pull request as ready for review September 19, 2024 03:17
@DanielYang59 DanielYang59 changed the title [Docs] Clarify usage of include-package-data on inclusion of package data files [Docs] Clarify usage of include-package-data/package_data/exclude_package_data on package data files Sep 19, 2024
@DanielYang59 DanielYang59 changed the title [Docs] Clarify usage of include-package-data/package_data/exclude_package_data on package data files [Docs] Clarify usage of include_package_data/package_data/exclude_package_data on package data files Sep 19, 2024
@abravalheri abravalheri merged commit adb8374 into pypa:main Sep 26, 2024
22 checks passed
@abravalheri
Copy link
Contributor

Thank you very much

@DanielYang59
Copy link
Contributor Author

No problem, thanks a ton for guiding me through all these changes :) I learned a lot!

@abravalheri
Copy link
Contributor

  1. The following file name has spaces in it, is it intended or maybe rename it to a proper Unix name (escaping spaces looks quite messy)?: https://github.com/pypa/setuptools/blob/main/docs/python%202%20sunset.rst

I believe that this one should be OK, spaces are valid characters in Linus files, also if there are links to this URL we should avoid changing it to avoid invalidating the links.

inventory https://build.pypa.io/en/latest contains multiple definitions for std:label:python--m-build--v

There is not much we can do about this one, because the definitions are maintained by build.

The rest should be fixed in #4661.

@DanielYang59
Copy link
Contributor Author

Cool! Thanks for letting me know. You're an amazing reviewer, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants