Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI #6273

Merged
merged 2 commits into from
Feb 24, 2019

Conversation

cjerdonek
Copy link
Member

Fix #5499.

@cjerdonek
Copy link
Member Author

By the way, when writing this PR I noticed that test_user_agent() wasn't actually testing anything because the assert was missing.

So that is fixed now, too.

@cjerdonek cjerdonek force-pushed the issue-5499-detect-ci-for-user-agent branch from ca4679a to 3580a5b Compare February 16, 2019 23:48
@cjerdonek cjerdonek added type: enhancement Improvements to functionality C: download About fetching data from PyPI and other sources type: maintenance Related to Development and Maintenance Processes labels Feb 16, 2019
@cjerdonek cjerdonek force-pushed the issue-5499-detect-ci-for-user-agent branch from 3580a5b to 3d69354 Compare February 17, 2019 00:00
Copy link

@gaborbernat gaborbernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document inline what keys cover what ci managers, maybe link to their documentation. Also maybe this will require them to be whitelisted in tox.

@cjerdonek
Copy link
Member Author

@gaborbernat I can provide examples for each key, but it may be hard to be comprehensive, and it can change over time as CI companies go in and out of business. Links to the online documentation of companies can also change. What about linking to the original issue number where there is more info?

Regarding Tox, it wouldn't be required as the purpose of this PR is to provide some indication. There can certainly be false negatives, but it is unlikely for there to be false positives. In that sense, this information would provide a lower bound for CI hits.

@gaborbernat
Copy link

Well, my worry is more along the lines did we leave some out? When should we remove some when CI companies migrate. As it is the list seems like an arbitrary element, does this cover all popular CIs? (at least per the latest user survey? + Github Actions)

PS. tox is always written with small t.

Return whether it looks like pip is running under CI.
"""
return (
'BUILD_BUILDID' in os.environ or

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems kinda repetitive, maybe instead any(key in os.environ for key in ('A', 'B))

user_agent = PipSession().headers["User-Agent"]

assert user_agent.startswith("pip/%s" % pip.__version__)
assert '"ci":true' in user_agent or '"ci":false' in user_agent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to test exhaustively here, some kind of parametrize per key exists/missing 🤔 as per this implementation the test would pass if someone writes either return True or return False inside looks like CI 🤔 such regressions would go without much notice

@cjerdonek
Copy link
Member Author

Well, my worry is more along the lines did we leave some out?

As I said, it can't hurt if something is left out. At worst, you won't get as much information. Also, it doesn't hurt to leave in a variable. I will add a code comment so you can see some of the CI systems that are covered.

@gaborbernat
Copy link

@cjerdonek I was thinking to provide a list of covered CIs more along the lines to keep the list easy to maintain, easy to see if we left out some important ones or not. 👍 Granted we can't cover all, but we should strive for as much as possible to make these numbers relevant.

@cjerdonek cjerdonek force-pushed the issue-5499-detect-ci-for-user-agent branch from 3d69354 to 883421e Compare February 17, 2019 14:17
@dstufft
Copy link
Member

dstufft commented Feb 17, 2019

My main concern with something like this is whether "false negatives" matter or not. Right now it's very easy to say that the download stats represent all of downloads from PyPI and that we don't attempt to interpret the results, leaving that up to people who are instead actually querying the data.

I think that by explicitly stating that some downloads are known to be from CI, the implication most people will assume is that the other downloads are not from CI, but the truth is the other downloads we just don't know if they're from CI or not. I think this is made worse by the fact this PR currently implements the "other" case as specifying ci = False when in reality it's more like ci = None (or no CI at all).

It might also be time to formalize pip's user agent in a PEP and make it generic so that other projects can opt-in to providing the same information.

@cjerdonek
Copy link
Member Author

Updated.

@cjerdonek
Copy link
Member Author

cjerdonek commented Feb 17, 2019

I think that by explicitly stating that some downloads are known to be from CI, the implication most people will assume is that the other downloads are not from CI,

Isn't this just a matter of documentation, or choosing a different name for the key? Like what about "may-be-ci" or "possible-ci" or "looks-like-ci" etc? It doesn't really matter to me. What would you prefer for the name? Re: the "other" case, the false value is meant to convey that pip ran its check and that it returned false.

[Edit: Or "has-ci-env-var"]

@pradyunsg
Copy link
Member

Maybe ci = "unknown".

@pradyunsg
Copy link
Member

It might also be time to formalize pip's user agent in a PEP and make it generic so that other projects can opt-in to providing the same information.

+1 to this idea.

@dstufft
Copy link
Member

dstufft commented Feb 17, 2019

I think the name is OK as is. I’m just raising it as a possible concern. I do think that the False value is particularly misleading though, would would prefer either an explicit None or just omitting the key all together.

@cjerdonek
Copy link
Member Author

would would prefer either an explicit None or just omitting the key all together.

Okay, I'll change it to None, then, so it's easier to know when the check has been run.

@cjerdonek cjerdonek force-pushed the issue-5499-detect-ci-for-user-agent branch 4 times, most recently from 391af7b to 14a6aaa Compare February 17, 2019 15:15
Copy link

@gaborbernat gaborbernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight change suggestions to make it easier to maintain, but otherwise looks good.

tests/unit/test_download.py Outdated Show resolved Hide resolved
tests/unit/test_download.py Outdated Show resolved Hide resolved
src/pip/_internal/download.py Outdated Show resolved Hide resolved
@cjerdonek
Copy link
Member Author

Thanks a lot for your detailed reviews and suggestions, @gaborbernat.

Copy link
Contributor

@ssbarnea ssbarnea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a much easier way to detect automation: sys.stdout.isatty().

isatty will be true only for terminals, so any scripts/ci will not use it. Not neet to try to be too smart and guess every possible CI systems in the wild

@pradyunsg
Copy link
Member

isatty will be true only for terminals, so any scripts/ci will not use it.

On Travis CI, it does return True.

Not sure what the statement after that was for -- I'll politely suggest you read the discussion in the corresponding issue for this.

@cjerdonek
Copy link
Member Author

@ssbarnea That suggestion was already raised here: #5499 (comment)
The problem with it is that it can have false positives, which means we can’t rely on it to say anything definitive. The environment variable approach on the other hand wouldn’t, which means we can use it to get a reliable lower bound.

@cjerdonek
Copy link
Member Author

I suppose it wouldn’t hurt to add a code comment mentioning for future maintainers why that approach wasn’t used.

@cjerdonek cjerdonek force-pushed the issue-5499-detect-ci-for-user-agent branch from cf1205d to a229f11 Compare February 19, 2019 03:28
@cjerdonek
Copy link
Member Author

I added a code comment saying why we don't check for a tty.

Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM tho, I suggest waiting for another review on this.

@ssbarnea
Copy link
Contributor

AFAIK Travis is the only CI that simulates a TTY during execution so I would find this logic as needing less maintenance: just check for travis envvar, for others it will work fine with istty().

@pradyunsg
Copy link
Member

I wouldn't mind including a separate UA item for isatty(). Let's keep that separate from the CI env-var check.

@cjerdonek
Copy link
Member Author

@ssbarnea There are many CI systems (including probably many we don’t know about), and their behavior can also change over time, so I don’t think we can say that for sure. It would be nice not to introduce a source of false results. I like @pradyunsg’s suggestion to make it another improvement to consider.

@ssbarnea
Copy link
Contributor

I think we got lost into details here: I though that we wanted to know if pip was used in an interactive way (where user can interact with it, like answering a prompt...) or not. Any CD/CI would count as an non-interactive executions mode, but the same applies for scripted runs on individual machines if the output is processed by another tool (including a simple redirection to a file.

If I do some-cmd 2>&1 >result.log it also counts as non-interactive execution of some-cmd, same as CI and I can assure you that there is not "CI" variable present in this way. Be sure that some-cmd will return False for isatty() when run like this.

So, I would not say that using isatty() would lower the quality of differentiation but the opposite, it would make the split better.

For start I could see clear problems with current proposed change:

  • Does not detect Zuul CI - ZUUL_CHANGE is one of the vars defined by it
  • Choles if CI=false or no, or 0. I bet there are people doing this somewhere.

@cjerdonek
Copy link
Member Author

I though that we wanted to know if pip was used in an interactive way (where user can interact with it, like answering a prompt...) or not.

My understanding of the original issue was to provide the ability to filter out (at least a good portion of) pip invocations due to automation. In my view, non-interactive executions on individual machines should not be included in this group.

In any case, regardless of what constitutes "automation," the PR that I submitted here is focused only on detecting a CI environment (which is also why the key was named ci). A key-value storing whether a tty is present could perhaps also be useful, but it is not the subject of this PR and so should be considered separately IMO as @pradyunsg and I have already suggested.

Regarding ZUUL_CHANGE and the possibility of CI=false, etc, I'm willing to make changes like those, but I'd like to hear first from those who already approved the change so I don't make unnecessary changes, or make changes that would cause them no longer to approve. Also, I would like to emphasize that this PR doesn't have to be complete with respect to detecting CI for the information to be useful. This is also mentioned in the code comments. Even if Zuul CI or other systems we're not aware of were left out, it would still let us filter out a large number of CI runs.

@cjerdonek
Copy link
Member Author

With three approvals now from PyPA members (as well as myself), I'm going to merge this now. I'm happy to consider either or both of @ssbarnea's two suggestions (ZUUL_CHANGE and the possibility of CI=false, etc) as a follow-on change if any others think it would be useful.

Storing isatty() in the User Agent as an additional piece of info is something else that can be considered as a separate PR. (I'm not planning to create that PR myself though.)

@cjerdonek cjerdonek merged commit 821247d into pypa:master Feb 24, 2019
@cjerdonek cjerdonek deleted the issue-5499-detect-ci-for-user-agent branch February 24, 2019 22:25
@pradyunsg
Copy link
Member

Sounds good to me @cjerdonek! :)

@lock
Copy link

lock bot commented May 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: download About fetching data from PyPI and other sources type: enhancement Improvements to functionality type: maintenance Related to Development and Maintenance Processes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Differentiating organic vs automated installations
6 participants