Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete initial pilot data collection: Phase 1 #6

Open
bhsims opened this issue Aug 3, 2020 · 1 comment
Open

Complete initial pilot data collection: Phase 1 #6

bhsims opened this issue Aug 3, 2020 · 1 comment
Assignees

Comments

@bhsims
Copy link
Member

bhsims commented Aug 3, 2020

Execute RepoScanner with latest repo list and view results.

@bhsims bhsims changed the title RepoScan pilot study RepoScanner pilot study Aug 3, 2020
@bhsims bhsims changed the title RepoScanner pilot study Complete pilot data collection Aug 3, 2020
@bhsims bhsims changed the title Complete pilot data collection Complete initial pilot data collection Aug 3, 2020
@elaineraybourn
Copy link
Member

Data collection to be considered in different phases. The first pilot phase explores the "contributor" activity in ECP repos. Data are scraped from GitHub (when available BitBucket, GitLab). The definitions guiding the Phase 1 (Tier 1) data collection are posted in Issue #7 and included in this comment below proposed pilot data collection steps:

  1. Determine (number of) unique contributors to ECP repos (and by definition ECP project, see below)
  2. Determine (number of) unique contributors who have contributed to 2 or more ECP projects
  3. Determine rank order (greatest number to smallest number) of contributor network rankings
  4. Identify the repos with the greatest number of cross-project contributor network ranking
  5. Identify the repos with the least number of cross-project contributor network ranking
  6. ...

Definitions
For "author" we use "contributor" in our research study document. In developing our contributor or "author" classification scheme for data analysis we need to be clear of the scope of the analysis, and how we are defining terms. I propose the following definitions to guide data analysis:
Phase 1 (Tier 1 -- lowest level of analysis)
Repo is defined as a GitHub repository. For the purposes of this Phase, we are interested primarily in repos that are associated with ECP projects.
Commit is defined as a save (of the current state, or snapshot) of the repository.
Contributor is defined as a unique user ID with 1 or more commits to 1 or more repos attributed to an ECP project.
Contributor ranking is defined as the number of commits. The greater the number of commits, the greater the ranking of the contributor.
Contribution is defined as a commit generated by a human, and potentially, a contribution by a bot created by a human.
Cross-repo contribution is defined as one or more commits by a unique contributor to two or more different repos
Cross-project contribution is defined as one or more commits to one or more ECP project repos. A project is defined as a formal collection of repos (e.g. ADTM, ALPINE). Only one commit to any ECP project repo is necessary to be considered a contribution to the ECP project, if multiple repos exist in the project.
Contributor network ranking is defined as the number of commits in repos that are attributed to a number of different ECP project repos. The greater the number of commits and greater the number of ECP project repos the higher the ranking of the individual contributor.

@elaineraybourn elaineraybourn changed the title Complete initial pilot data collection Complete initial pilot data collection: Phase 1 Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants