From bfe1a39d655d86c47c195cf5de83d678a8ec597b Mon Sep 17 00:00:00 2001 From: h1alexbel Date: Thu, 9 May 2024 13:50:43 +0300 Subject: [PATCH 1/4] feat(#307): filter out, typo --- tex/report.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tex/report.tex b/tex/report.tex index b563b60e..1813843b 100644 --- a/tex/report.tex +++ b/tex/report.tex @@ -86,7 +86,7 @@ \section{Motivation}\label{sec:motivation} their research results, paper authors must somehow guarantee that the source code used at the time of research remains available and intact throughout the paper's lifetime. One obvious solution would be to make copies of the -repositories being extracted and then host them somewhere they are "forever" +repositories being extracted and then host them somewhere they are ``forever'' available. Second, research methods typically involve filtering out certain types of files @@ -136,6 +136,8 @@ \section{Methodology}\label{sec:method} \item Fetch open repositories from GitHub, which have \ff{java} language tag, have reasonably big but not too big number of stars, and are of certain minimum size; + \item Filter out repositories contain samples, instead real project, + framework or library. \item Remove files without \ff{.java} extension, Java files with syntax errors, supplementary files such as \ff{package-info.java} and \ff{module-info.java}, files with very long lines, and unit tests; From 0cc376d82d868a80eff71d638fa6af1cc8b068b9 Mon Sep 17 00:00:00 2001 From: h1alexbel Date: Thu, 9 May 2024 13:55:41 +0300 Subject: [PATCH 2/4] doc(#275): mit or apache --- tex/report.tex | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/tex/report.tex b/tex/report.tex index 1813843b..26bb23d3 100644 --- a/tex/report.tex +++ b/tex/report.tex @@ -134,9 +134,9 @@ \section{Methodology}\label{sec:method} Python, Ruby, and Bash, which do exactly the following: \begin{itemize} \item Fetch open repositories from GitHub, which have \ff{java} language - tag, have reasonably big but not too big number of stars, and are - of certain minimum size; - \item Filter out repositories contain samples, instead real project, + tag, have reasonably big but not too big number of stars, have either MIT or Apache License, + and are of certain minimum size; + \item Filter out repositories those contain samples, instead real project, framework or library. \item Remove files without \ff{.java} extension, Java files with syntax errors, supplementary files such as \ff{package-info.java} and \ff{module-info.java}, @@ -153,7 +153,6 @@ \section{Methodology}\label{sec:method} We believe that our method is ethical, as it utilizes data from publicly available sources, thereby avoiding any infringement of copyright. -% Would be great to include only repositories with MIT and Apache license, see https://github.com/yegor256/cam/issues/275 \section{Results}\label{sec:results} From 44bdcbf8258281a4149ca7b0dc1e98aac397af4c Mon Sep 17 00:00:00 2001 From: h1alexbel Date: Thu, 9 May 2024 15:12:05 +0300 Subject: [PATCH 3/4] feat(#307): more --- tex/report.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tex/report.tex b/tex/report.tex index 26bb23d3..a0e73524 100644 --- a/tex/report.tex +++ b/tex/report.tex @@ -137,7 +137,8 @@ \section{Methodology}\label{sec:method} tag, have reasonably big but not too big number of stars, have either MIT or Apache License, and are of certain minimum size; \item Filter out repositories those contain samples, instead real project, - framework or library. + framework or library utilizing Machine Learning techniques like text + classification. \item Remove files without \ff{.java} extension, Java files with syntax errors, supplementary files such as \ff{package-info.java} and \ff{module-info.java}, files with very long lines, and unit tests; @@ -161,6 +162,7 @@ \section{Results}\label{sec:results} \iexec{cat "${TARGET}/temp/repo-details.tex"} The full list of them is in the \ff{repositories.csv} file. The \ff{hashes.csv} file has a list of Git hashes of their latest commits. +Predictions about each repository being sample or not located in \ff{predictions.csv} file. The filtering process was the following: From 8d351b15277cfeaecf7d6870ad9f7d41f37eea7b Mon Sep 17 00:00:00 2001 From: h1alexbel Date: Fri, 10 May 2024 09:44:07 +0300 Subject: [PATCH 4/4] feat(#307): license filter in separate item, repo link, more details --- tex/report.tex | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/tex/report.tex b/tex/report.tex index a0e73524..275f165a 100644 --- a/tex/report.tex +++ b/tex/report.tex @@ -134,11 +134,12 @@ \section{Methodology}\label{sec:method} Python, Ruby, and Bash, which do exactly the following: \begin{itemize} \item Fetch open repositories from GitHub, which have \ff{java} language - tag, have reasonably big but not too big number of stars, have either MIT or Apache License, - and are of certain minimum size; + tag, have reasonably big but not too big number of stars, and are of certain minimum size; + \item Filter out repositories that have license different from MIT or Apache License. \item Filter out repositories those contain samples, instead real project, - framework or library utilizing Machine Learning techniques like text - classification. + framework or library by using \ff{samples-filter}\footnote{\url{https://github.com/h1alexbel/samples-filter}} + that predicts using text classification to which class (real or sample) + repository belongs to. \item Remove files without \ff{.java} extension, Java files with syntax errors, supplementary files such as \ff{package-info.java} and \ff{module-info.java}, files with very long lines, and unit tests;