classification
Title: package_data only allows one glob per-package
Type: behavior Stage: resolved
Components: Distutils2 Versions: Python 3.3, 3rd party
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: eric.araujo Nosy List: alexis, eric.araujo, erik.bray, jkloth, paul.moore, python-dev, tarek
Priority: release blocker Keywords: easy, patch

Created on 2011-04-08 22:23 by erik.bray, last changed 2012-02-06 16:44 by eric.araujo. This issue is now closed.

Files
File name Uploaded Description Edit
fix-package_data-multivalue-cpy33.diff eric.araujo, 2012-01-17 17:02 review
fix-package_data-multivalue-d2.diff eric.araujo, 2012-01-17 17:02
Repositories containing patches
https://bitbucket.org/embray/distutils2
Messages (20)
msg133346 - (view) Author: Erik Bray (erik.bray) * Date: 2011-04-08 22:23
In distutils the package_data option can be supplied a list of glob patterns for each package.  distutils2 currently only supports one glob per package.

This could easily be fixed by simply allowing more than one `package_name = pattern` value in the package_data option.  For example:

package_data =
    mypackage = templates/*.html
    mypackage = static/css/*.css
    mypackage.subpackage = templates/*.html
    ...
msg133388 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-04-09 13:57
What’s the syntax described in the docs?

  package_data =
      mypackage = templates/*.html static/css/*.css

or

  package_data =
      mypackage = templates/*.html
      mypackage = static/css/*.css

?

On a related subject, I think the new resources system should obsolete data_files and package_data.  Putting files in arbitrary, OS-specific places (data_files) or alongside the Python files (package_data) was wrong, let’s move away from that.
msg133405 - (view) Author: Erik Bray (erik.bray) * Date: 2011-04-09 18:09
As far as I've been able to tell there is no proposed syntax in the docs specifically for package_data.  The docs for the resources option seems to suggest separating globs with spaces, which would be fine by me (wouldn't allow paths that contain spaces, but that's a bad idea anyways).

I think that allowing one glob string on each line is more readable, but I'm not too attached either way.  Another possibility would be to allow line breaks in the value, for example:

package data = 
    mypackage = templates/*.html
                static/css/*.css

But that's getting a little more complex syntax-wise.

Agreed on getting rid of data_files--it's dangerous.  But package_data I find very useful sometimes, and I don't think it's always wrong.  At the very least, it's not clear to me how the above use case is intended to be replaced.
msg146947 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-03 17:12
> As far as I've been able to tell there is no proposed syntax in the docs specifically for
> package_data.
Right.  I’ve only found an example in d2’s own setup.cfg:

package_data =
    distutils2._backport = sysconfig.cfg
    distutils2.command = wininst*.exe
    distutils2.tests = xxmodule.c

(Note that there’s a bug in build_py or config so you can’t test this right now.)

> The docs for the resources option seems to suggest separating globs with spaces, which would
> be fine by me (wouldn't allow paths that contain spaces, but that's a bad idea anyways).
Agreed.

> I think that allowing one glob string on each line is more readable
Definitely.

> Another possibility would be to allow line breaks in the value [...]
> But that's getting a little more complex syntax-wise.
Truly!

> Agreed on getting rid of data_files--it's dangerous.
Actually the resources system redefines data_files (it was even called datafiles.py and DATAFILES at one time, and it’s still handled by install_data; I think that was a better naming scheme).  distutils-style data_files were practically unusable: System packagers were unhappy because they wanted to control the location of installed files, Python authors were unhappy because they could not access the files from their code.  I think that’s why package_data took off.

> But package_data I find very useful sometimes, and I don't think it's always wrong.
OS maintainers like the Debian project strongly disagree :)

> At the very least, it's not clear to me how the above use case is intended to be replaced.
The point of resources is that it redefines data_files and is as easy to use as package_data.  If you have for example a project named Spam with a Python package spamlib and a file in templates/log.txt, your code just needs to do this:

   from packaging.database import get_file
   with get_file('Spam', 'templates/log.txt') as fp:
      ...

When run from an uninstalled checkout, for example when developing, the file will be found in the checkout.  When run after being installed on a Debian system, the file will be found in /usr/local/share/spam/templates/log.txt.  Each Python installation can decide (through sysconfig.cfg) where to install things.
msg146960 - (view) Author: Erik Bray (erik.bray) * Date: 2011-11-03 18:49
> When run from an uninstalled checkout, for example when developing, the 
> file will be found in the checkout.  When run after being installed on 
> a Debian system, the file will be found in /usr/local/share
> /spam/templates/log.txt.  Each Python installation can decide (through 
> sysconfig.cfg) where to install things.

Got it!  I think when I first submitted this issue my primary concern was where the resource files live in different development/deployment contexts.  Looking at sysconfig.cfg makes it all pretty clear, though it was hard to find this information a while ago.

So what exactly is the verdict then?  Keep package_data for now, but encourage resources, and dump data_files all together?  Or...?
msg146990 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2011-11-03 23:17
I see what you're saying - and looking through sysconfig.cfg, I can see how things are expected to be laid out (and how I'd configure it if I didn't like it :-))

But as far as I can see, there's no way in packaging to describe a module that works like sysconfig itself! (A module with a resource which is expected to live alongside the .py file). Installing the cfg file to {purelib} would work in practice, but if the distribution containing sysconfig was later changed to include a C extension, then the code would get silently switched to install into platlib, and the .cfg file would no longer be alongside the .py file.

(This is only actually the case when purebase and base differ - which never happens on Windows, and I don't really understand when it would happen on Unix. So I can't really say how likely and/or important this possibility is. But it's certainly there.)

I'd like to see an extra category that was guaranteed to point to where the code files were going to be installed (i.e., the same as whichever one of platlib or purelib the packaging installer chooses). With that, it would be possible (but still clumsy) to implement package data reliably.

OK, having said all that, I do think that saying "package_data was wrong, let’s move away from that" is a bit user-unfriendly. Whether it causes issues for OS distributors, or is "wrong" as a matter of principle, people do use it, a lot. It's the basis of the whole egg concept (a package and its data as a single self-contained object), it's used by distutils itself as you mention, as well as in a number of other places in the stdlib. If you don't support it, I predict that people are simply going to invent more and more elaborate hacks to emulate it.

Actually, given that packaging should allow a hook to find out where a given file is going to be installed (if it isn't, expect a feature request :-)), it should be reasonably easy to write a hook to add a custom category for this purpose. But I'd argue that it's better to put the feature in packaging directly, rather than see the same workaround implemented over and over again.
msg147691 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-15 17:27
> Looking at sysconfig.cfg makes it all pretty clear, though it was hard to find this
> information a while ago.
Yeah, you had to be here when the resources code was committed, or to stumble on it while reading the source code.  It’s one of the big things I will document as part of #12779.

> So what exactly is the verdict then?  Keep package_data for now, but encourage resources,
> and dump data_files all together?  Or...?
The reverse: resources are the new way of installing data_files.  package_data is obsoleted by that system and removed.

> But as far as I can see, there's no way in packaging to describe a module that works like sysconfig
> itself! (A module with a resource which is expected to live alongside the .py file).
For bootstrapping reasons, sysconfig.cfg needs to be the one exception to the rule.  (Yes, I think about moving distutils.cfg and wininst-*.exe files according to sysconfig —and pydoc data and venv shell scripts!  I will make sure we reach agreement on python-dev first, of course :)

The point is that it is not possible to have a data file living alongside the py file.  It is by design.  The new system caters to downstream packagers while trying not to be bothersome to Python authors.  Why do you care where your data file is installed, as long as your code can get it?

There is one argument against this system: It means that your code has a runtime dependency on a function provided by the packaging tool.  This is one of my griefs against setuptools.  The difference here is that the dist-info dir is an accepted PEP and packaging is in the stdlib, and that the dependency is minimal (it does not make you learn a parallel import system for example).  However, one drawback is that we expect and support a transition period where people have parallel setup.cfg and setup.py (with the setup.py possibly taking its info from setup.cfg), so adding support for d2 will not only require adding a setup.cfg and modularizing complex setup.py into hooks: it will require code editions to have code branches for database.get_file and __file__ + open.  That won’t look good.

> Installing the cfg file to {purelib} would work in practice, but if the distribution containing
> sysconfig was later changed to include a C extension, then the code would get silently switched to
> install into platlib, and the .cfg file would no longer be alongside the .py file.
> 
> (This is only actually the case when purebase and base differ - which never happens on Windows, and
> I don't really understand when it would happen on Unix. So I can't really say how likely and/or
> important this possibility is. But it's certainly there.)
There is a distinction between prefix and exec-prefix.  I think it may come from a time where you’d have one tree served for many machines by an NFS server and architecture-specific trees on a local filesystem.

> [...]
> OK, having said all that, I do think that saying "package_data was wrong, let’s move away from that"
> is a bit user-unfriendly. Whether it causes issues for OS distributors, or is "wrong" as a matter of
> principle, people do use it, a lot.
Because they had no alternative: you couldn’t predict where data_files would end up, so package_data + __file__ was the universal kludge.

> It's the basis of the whole egg concept (a package and its data as a single self-contained object)
distutils2 does not support the egg concept.  In a post-PEP 376 world, an installed distribution can be one or more modules with a dist-info directory, one or more packages with a dist-info directory, etc.

> If you don't support it, I predict that people are simply going to invent more and more elaborate
> hacks to emulate it.
Interesting.  I have to go now, but be sure I’ll think over this, bring it up on pydev and make sure the situation is acceptable before Py 3.3b1 / d2 1.0b1.
msg147699 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2011-11-15 19:27
One important point - in the "new world" where data files living
alongside code is unsupported, the bdist_msi and bdist_wininst
installers need to be updated to install data files as needed. This
may work already (I'll do some tests to see how well when I get back
to my PC with dev builds on it) but right now bdist_wininst for
example does not support the full generality of the resource mechanism
(and it's not clear to me how it can be made to...) so there will be
restrictions on what can be allowed until the wininst format is
updated.

For installs from source, the suggested "new world" is fine. But
without a binary install process that's as flexible as a source
install, the current "alongside the code" approach has the benefit of
working well with the existing installer technology.
msg148277 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-24 16:34
I need this to package test files for distutils2’s own next release, so pragmatism/compatibility will win for the short term :)

I’ll fix it together with #13463 and #5302.
msg151379 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-16 16:23
Copied from #13712:

[I]
Problem is that the setup.cfg syntax does not define how to give more than one value.  If it’s judged acceptable to disallow paths with embedded spaces, we could do something like this:

[files]
package_data =
    spam = first second third

Otherwise we’d need to use multiple lines (requested in #5302) [actually no, it was this bug #11805]

[files]
package_data =
    spam = first
    spam = second
    spam = third

We probably don’t want that.  An intermediate idea:

[files]
package_data =
    spam = first
           second
           third

Not sure this would be the nicest thing for people to write, and for us (me) to extend the setup.cfg parser for.


[Erik Bray]
FWIW, I'm for the first option for specifying package_data:

[files]
package_data =
    spam = first second third

I'm pretty sure this is how I ended up implementing it in d2to1, since I needed this functionality.

Theoretically spaces could be supported with an escape sequence, but I don't think that's worth complicating things for if package_data is deprecated anyways.  I'm all for making it difficult for anyone trying to include filenames with spaces in their source code.


So, I think allowing spaces and newlines to separate multiple values will be the nicest (spaces only would require users to cram all there file specs on one line).
msg151381 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-16 16:29
(To clarify quotes in my message: First is a copied message from me, marked [I], then one form Erik, and the last paragraph is me again.  Sleep good, should get more of that.)
msg151477 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-17 17:02
Here are patches for CPython and the d2 repo.  There is no doc update as the setupscript page still talks about setup.py and the setupcfg spec does not mention package_data at all (the negationists resources-promoters at work :), so this can wait for my big doc updates.  The changes to test_command_{build_py,sdist} are unrelated, I just found some fixes to do after grepping for package_data and I’ll commit them separately.

Please review.  (I’ll be offline until the end of the month FYI, but I’ll commit after that.)
msg151478 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-17 17:09
Note that my proposed syntax does not allow something equivalent to {'': [etc.]} in distutils, i.e. package data for top-level modules.  I think this is okay: modules should not install data in their installation dir.  I don’t think it was widely used, but maybe I should ask on distutils-sig or survey setup.py scripts on PyPI to be sure.
msg151543 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-18 15:03
I figured it would let people comment on the syntax I propose more easily if I extracted it from the patch.  Here’s the example:

  package_data =
      cheese = data/templates/* doc/*
        doc/images/*.png

We have a package name, equals sign, then specs/globs separated by whitespace, including newlines (without any indent rules for the continuation lines, the code just adds them to the last seen package name if there is no equals sign in the lign).
msg151554 - (view) Author: Erik Bray (erik.bray) * Date: 2012-01-18 17:33
This patch works for me, and I'm happy with the syntax.  Thanks!
msg152667 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-05 09:49
New changeset 2c08bf9aca22 by Éric Araujo in branch 'default':
Allow multiple values for package_data in setup.cfg (#11805).
http://hg.python.org/cpython/rev/2c08bf9aca22
msg152675 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-05 10:56
> Note that my proposed syntax does not allow something equivalent to {'': [etc.]} in
> distutils, i.e. package data for top-level modules.  I think this is okay: modules should
> not install data in their installation dir.  I don’t think it was widely used

I tested: This construct was allowed but did not actually cause any file to be installed (i.e. no crash, no-op), so we’re not removing any functionality here.
msg152678 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-05 11:23
New changeset 83a0985c7aad by Éric Araujo in branch 'default':
Allow multiple values for package_data in setup.cfg (#11805).
http://hg.python.org/distutils2/rev/83a0985c7aad

New changeset ea717d8e71d0 by Éric Araujo in branch 'python3':
Merge fixes for #13901, #11805, #13712 and other improvements
http://hg.python.org/distutils2/rev/ea717d8e71d0
msg152679 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-05 11:26
Thanks for the help Erik.
msg152745 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-06 16:44
(I forgot that distutils2’s setup.py has its own fork of the conversion functions, for bootstrapping reasons, so I need to port the fix there.)
History
Date User Action Args
2012-02-06 16:44:25eric.araujosetmessages: + msg152745
2012-02-05 11:26:18eric.araujosetstatus: open -> closed
resolution: fixed
messages: + msg152679

stage: needs patch -> resolved
2012-02-05 11:23:58python-devsetmessages: + msg152678
2012-02-05 10:56:48eric.araujosetmessages: + msg152675
2012-02-05 09:49:21python-devsetnosy: + python-dev
messages: + msg152667
2012-02-04 07:04:25eric.araujolinkissue13712 dependencies
2012-01-18 17:33:16erik.braysetmessages: + msg151554
2012-01-18 15:03:18eric.araujosetmessages: + msg151543
2012-01-17 17:09:46eric.araujosetmessages: + msg151478
2012-01-17 17:02:55eric.araujosetfiles: + fix-package_data-multivalue-d2.diff
2012-01-17 17:02:38eric.araujosetfiles: + fix-package_data-multivalue-cpy33.diff
keywords: + patch
2012-01-17 17:02:01eric.araujosetmessages: + msg151477
2012-01-16 16:29:44eric.araujosetmessages: + msg151381
2012-01-16 16:23:33eric.araujosetmessages: + msg151379
2011-11-24 16:34:34eric.araujosetpriority: normal -> release blocker

messages: + msg148277
2011-11-24 16:34:14eric.araujolinkissue13463 dependencies
2011-11-15 19:27:12paul.mooresetmessages: + msg147699
2011-11-15 17:27:30eric.araujosetkeywords: + easy
assignee: tarek -> eric.araujo
messages: + msg147691

stage: needs patch
2011-11-03 23:17:39paul.mooresetnosy: + paul.moore
messages: + msg146990
2011-11-03 18:49:07erik.braysetmessages: + msg146960
2011-11-03 17:12:08eric.araujosetmessages: + msg146947
2011-09-07 15:53:15jklothsetnosy: + jkloth
2011-04-09 18:09:37erik.braysetmessages: + msg133405
2011-04-09 13:57:20eric.araujosetmessages: + msg133388
versions: + 3rd party, Python 3.3
2011-04-08 22:23:48erik.braycreate