-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude certain languages from repo stats #11672
Conversation
I wonder if this should be better done upstream in go-enry. |
Doing this in upstream would mean significant changes between linguist and go-enry, and I believe that is not the aim of the project. Additionally, GitHub also performs same exclusion outside of linguist scope (otherwise we'd be seeing lot of YAML repos instead of JavaScript ones) ref github-linguist/linguist#4348 github-linguist/linguist#4445 github-linguist/linguist#4459 github-linguist/linguist@3bc8185 |
Frankly I find reasoning given by linguist insufficient in consideration that every other lockfile in existence is excluded EXCEPT for That being said I definitely agree that Either way, I think this PR is still valid as it aims to improve code statistics detection by removing unnecessary noise, bringing it closer to GitLab and GitHub's implementations. |
I'd definitely agree that The argument that yaml is easier to diff than json is also silly, I can easily visually diff both for small changes and for bigger ones I skip over both of them. |
Maybe add a comment describing that the reason for the existance of this list is to overrule linguist/enry's behaviour that is considered suboptimal. |
What I worry thought is that let's say I have a repo for a Ansible role. These are almost purely YAML and those files are basically equivalent to code. If I introduce a bash script into said repo, it would no longer count any YAML at all? Maybe some sort of filename-based blacklist would be better. It should only count files that are definitely always generated. Users could define their own classifications via gitattributes. |
That isn't strictly true; the existence of the list can be owned to linguist poor choice because that's what personally made me write the code, but in general the purpose of the list for now and for future is to not only overrule but also remove noise. I would rather not implement any file-based approach lest we re-do linguist/enry on our own. I advise you to push example repository to GitHub and see how they see it. My bet is that they will ignore YAML just like my implementation. |
I found good example here: https://github.com/ansible/ansible-examples |
GitLab outright refuses to even acknowledge YAML even if it's lone in repository: https://gitlab.com/maksold/ansible-role-certbot |
So you're saying this exclusion is not part of linguist but some hackery that GitHub does after the files were processed by linuist? I wonder if one can overrule this via gitattributes Maybe instead of a plain exclusion list, we want to specify the files as generated so enry will see them as such, it'd be more compatible and user-overridable. Edit: created https://github.com/silverwind/ansible-examples, so far no change |
Yes you can overrule it for linguist, it will make GitHub show full diff as it won't treat these files as generated/vendored, but it doesn't seem to affect language statistics. GitLab doesn't use linguist at all and I'm not sure you can even override it there. |
These ansible files aren't considered generated or vendored by linguist in the first place, so |
Generally GitHub uses linguist for few different things. One of them is deciding if diff should be collapsed, another is language bar statistics. For language bar they apply additional logic to clean out certain languages from showing in almost every repo. |
Ok, I guess the notion behind this that these files would never represent source code. Certainly not true for YAML or SVG but those are kind of grey areas. BTW, yarn lockfile v1 is not strictly YAML format, only the v2 version is. Still most mechanisms seem to pick it up as YAML for some reason. |
Actually linguist explicitly marks |
Co-authored-by: silverwind <[email protected]>
Would this be better served as a configurable option?
That would allow us to close #10266 (alternatively you could add TOML to that list, but I feel like a per-repo config file might be more future-proof) |
I'm generally opposed to idea of configuring language statistics per-repo in yet-another-dot-file(r). My idea is that Gitea should properly identify language of repositories by default, without forcing user to apply additional configuration. Remember, the function of language statistics is also so that repos can be looked up in search via their language, we can not rely on every repository owner doing that, and it also opens an issue of what to do with migrations? I strongly believe expected user behavior is for language statistic to work out-of-the-box, without requiring additional configuration for every repository. Added TOML to list. |
although I'm generally opposed to requiring yet another dotfile - as I think we should generally just get things right first time - I think having an option to override behaviour is always a good idea. That being said - that can be done as a different PR. |
No need for additional dot file. Github does language stats fine-tuning using |
I won't oppose any PR that adds such ability via |
Deprecated by #11681 |
For one reason or another, Linguist sometimes decides that it is better to not mark file as generated, so that its diff is not suppressed on GitHub.
One such example is
yarn.lock
, often present in Node repositories using Yarn as package manager. Due to it's nature it was decided to not include it as generated file as it is easy to inspect its diff for security purposes (lockfile poison).However, despite that, GitHub still does not include such files in their language bar, unless they are the only file present in repository.
Further tests revealed that in fact, GitHub explicitly ignores certain languages unless they have 100% presence in repository. There is no publicly available list of such exclusions, however prominent examples are JSON and YAML - you can verify this by trying to search GitHub repositories by
language:
.This PR aims to add support for similar logic in Gitea. The list of languages present right now might not be full, the only real way to find would be to test every language GitHub knows about.