Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[config] Support more precise build-year copyright substitution #12451

Closed
jayaddison opened this issue Jun 20, 2024 · 10 comments · Fixed by #12516
Closed

[config] Support more precise build-year copyright substitution #12451

jayaddison opened this issue Jun 20, 2024 · 10 comments · Fixed by #12516
Milestone

Comments

@jayaddison
Copy link
Contributor

Describe the bug

Projects can configure multiline copyright notices (ref #4925), a useful feature for projects that have transitions between copyright holders over time.

However, the current config year-substitution logic enabled when SOURCE_DATE_EPOCH is configured will attempt to pattern-match and substitute the end-year from the build-year in all of the copyright lines.

I'm not sure that's intended; I think that we should only apply the substitution for the most recent (which for an iterable I suppose means the final) line in the notice.

How to Reproduce

conf.py

copyright = [
    '2000-2001 - first copyright holder',
    '2002-2003 - second copyright holder',
]

index.rst

Example Document

Expected

['2000-2001 - first copyright holder', '2002-2024 - second copyright holder']

Actual

['2000-2024 - first copyright holder', '2002-2024 - second copyright holder']

Environment Information

Platform:              linux; (Linux-6.8.12-arm64-aarch64-with-glibc2.38)
Python version:        3.11.9 (main, Apr 10 2024, 13:16:36) [GCC 13.2.0])
Python implementation: CPython
Sphinx version:        7.4.0+/6471027d1
Docutils version:      0.21.2
Jinja2 version:        3.1.4
Pygments version:      2.18.0

Sphinx extensions

N/A

Additional context

Discovered while investigating whether multiline copyright config could be a way to reduce risk of rewriting unintended copyright lines re: any interaction between matplotlib/matplotlib#28418 and #12450.

@jayaddison
Copy link
Contributor Author

Although I feel like I want to make adjustments here and may have ideas about how to do that, I'd appreciate input from other maintainers/developers because churn/incorrect fixes here could potentially be more disruptive than simply leaving the code-is.

AFAIK we don't currently have a way for downstream projects to entirely disable the dynamic copyright substitution logic when SOURCE_DATE_EPOCH is enabled.

The matplotlib project is not affected; they do have a multi-statement copyright notice, but all the statements are configured in a single str (not a list) copyright config value, and, crucially, the first line of that text uses a Unicode en-dash instead of an ASCII dash (minus symbol), which Sphinx doesn't match on during substitution.

@jayaddison
Copy link
Contributor Author

Also: maintainers: please feel free to convert this into a GitHub discussion if that's a more appropriate venue for this thread.

@jayaddison
Copy link
Contributor Author

I'm not convinced that my assertion that only the most-recent/last copyright line should be updated is always true.

Although updating every line does seem incorrect in some situations, it seems unclear/intractable how to determine which lines to update. This could be a reason why some projects have chosen to avoid the substitution logic by using en-dash.

In terms of ideas for routes towards improvements, though -- and bearing in mind that, as far as possible, we should avoid breaking existing projects:

Perhaps we could provide a {{current_year}} helper variable so that documentation projects that do want dynamic year values could opt-into a documented, somewhat-supported way to dynamically retrieve that build-time value (with support for SOURCE_DATE_EPOCH).

Projects that do want dynamic copyright notices could then use that variable in either their single-line or multi-line copyright notices, as appropriate for their use case(s).

As mentioned in a previous thread I'm not too keen on dynamic copyright notices personally, but we do have practical examples of Sphinx projects that use them, and a way to retrieve the current year could help us support those cases. If I were to implement that, I'd document the substitution variable with a note to indicate that it is available for use, but not necessarily recommended.

@jayaddison jayaddison changed the title [config] Build-year copyright substitution should only occur for the most recent copyright line. [config] Support more configurable/precise build-year copyright substitution. Jun 29, 2024
@jayaddison
Copy link
Contributor Author

I'd propose a combination of two changes:

  1. Add a config setting to allow downstream sites to disable copyright year substitution entirely if they choose.
  2. Add the ability to substitute a specific {{ current_year }} pattern within the copyright notice, so that sites can opt-in to current-year substitution with more control, compatible with both single-line and multi-line copyright notices.

I'll attempt to draft a pull request for this soon (within the next week or two).

@jayaddison
Copy link
Contributor Author

Exploratory thoughts: in some ways I think that using the last-modified-time (aka mtime) of the most recent source document would be a better value to use during current-year copyright notice substitution. However, from practical experience, and for understandable convenience, many Sphinx projects build directly from git clone (checkout of source control) of their documentation sources -- and that cloning process does not preserve the last-modification time of files (in other words: a file last edited in source control two years ago, when retrieved in a clone of the containing repository today, will have today's date in its timestamps (created and modified)).

(context: if we are to add current_year replacement -- something I'm weighing up the pros and cons and moral qualms about - it would be good to make that value reasonably-accurate in standard/typical use cases, and relatively difficult to unwittingly build with a nonsensical value. a nice side-effect (although not the design goal) of using latest-modified-file time as the value would be that the emitted current-year would reflect the last time the source work was indicated as modified)

Although technically feasible, I don't think that it would make sense to attempt to add code to determine whether the source files are contained in a version control system (in order to detect modification times) -- doing that would mean that we end up maintaining code to handle an arbitrary and incomplete list of version control systems.

@jayaddison
Copy link
Contributor Author

An incomplete, approximated code search for path:conf.py /^copyright.*today/ using GitHub code search today (20240705) returns over four thousand results. No doubt some of those are duplicates, but even so I think this indicates that there is a relatively frequent desire/use-case for current-year in Sphinx copyright notice config.

One reason that I checked that is to determine whether a harsher/breaking-change approach of disabling dynamic copyright notices would be acceptable. Initially I feel that it is not -- and also it would simply have the effect of pushing the problem elsewhere rather than solving it.

Given that context, the current use of SOURCE_DATE_EPOCH-based replacement to achieve replacement selectively in the case of reproducible builds seems like a pragmatic approach.

Possible improvements remaining in my mind:

  • Could/should we detect dynamic vs static copyright config, and only perform substitution in the dynamic case?
  • In the case of multiline statements could/should we detect end-years not to modify? (for example: perhaps we should only update the largest year values?)

There is also a risk of trying to be too clever and creating unnecessary complexity.

@jayaddison
Copy link
Contributor Author

* Could/should we detect dynamic vs static `copyright` config, and only perform substitution in the dynamic case?

Based on my understanding of Python: detecting statically-declared strings may be possible --we could locate and parse the conf.py and then apply some heuristic/AST checks -- but figuring out whether an f-string, for example, actually varies or not in practice is not feasible. So: it would be somewhat complex, and would have some false-positives (string considered dynamic although it may not vary). On the plus side, this would allow many simple, statically-configured projects to build reproducibly and correctly without modifications.

* In the case of multiline statements could/should we detect end-years _not_ to modify?  (for example: perhaps we should only update the largest year values?)

In the general case this seems intractable, because there's very little that we can infer from copyright year ranges in multiline statements. Did contributions stop because of a change of ownership? Or simply because a contributor no longer provides edits to a project? We can't know those during isolated computerized builds from source.

However, it's possible that we could improve the substitution heuristics; for example: we could determine the system-clock year and only perform replacement of years in the notice that match that. The only edge case I can think of there is dynamic copyright values that insert a different year -- but I can't think of a valid reason for a project to do that.

@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 5, 2024

From my mental notes, three ideas remain valid, and here's how I rate them currently:

Edit: update checklist x3

@jayaddison
Copy link
Contributor Author

jayaddison commented Jul 5, 2024

From my mental notes, three ideas remain valid, and here's how I rate them currently:

An additional fourth idea that also remains valid:

  • Allow the year-substitution logic to be disabled entirely; perhaps by providing a default-enabled config setting. This is likely to be a rarely-used setting, but should be low complexity to implement and I imagine that it could provide a useful escape hatch (hypothetical scenario: quirky replacement is occurring during reproducible builds => action: disable the proposed year-substitution setting, and switch to a static copyright notice definition if currently using a dynamic definition). (update: drafted in copyright: Apply substitutions only to current-year entries, and disallow future year insertion #12516 we decided not to add this functionality)

Edit: update x2

@jayaddison
Copy link
Contributor Author

Quoting from the bugreport description:

conf.py

copyright = [
    '2000-2001 - first copyright holder',
    '2002-2003 - second copyright holder',
]

index.rst

Example Document

Expected

['2000-2001 - first copyright holder', '2002-2024 - second copyright holder']

Actual

['2000-2024 - first copyright holder', '2002-2024 - second copyright holder']

I now think that improved example/expected/actual cases -- including a delineation of static and dynamic copyright notice lines -- would be:

conf.py

from datetime import datetime

copyright = [
     '2000-2001 - first copyright holder',
     '2002-2003 - second copyright holder',
     f"2003-{datetime.utcnow().strftime('%Y')} - third copyright holder",
]

Expected

['2000-2001 - first copyright holder', '2002-2003 - second copyright holder', '2003-2024 - third copyright holder']

Actual

['2000-2024 - first copyright holder', '2002-2024 - second copyright holder', '2003-2024 - third copyright holder']

Or to translate into a description: try only to substitute year values in copyright notices that appear to have been calculated dynamically.

@jayaddison jayaddison changed the title [config] Support more configurable/precise build-year copyright substitution. [config] Support more precise build-year copyright substitution Oct 4, 2024
@AA-Turner AA-Turner added this to the 8.1.0 milestone Oct 6, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants