diff --git a/paper/paper.pdf b/paper/paper.pdf index a7455ed..95c3dd5 100644 Binary files a/paper/paper.pdf and b/paper/paper.pdf differ diff --git a/paper/paper.qmd b/paper/paper.qmd index 324eae2..8d207a6 100644 --- a/paper/paper.qmd +++ b/paper/paper.qmd @@ -35,23 +35,23 @@ Other than bad programming practices [@trisovic:2022], the main computing barrie [^r]: In this paper, we will focus on R, a popular programming language used frequently in various computational fields (e.g. computational social science, bioinformatics). -In reality, the impact of Component A is relative weak as mainstream, open source programming languages and their software libraries are usually cross platform. In modern computational research, Linux is the de-facto operating system in high performance computing environments (e.g. Slurm). Instead, the impact of Components B, C, and D is far higher. Component D is the most volatile among them all as there are many possible combinations of R packages and versions. Software update with breaking changes (even in a dependency) might render existing shared code using those changed features not executable or not producing the same result anymore. Also, software obsolescence is a common place, especially when academic software is often not well maintained due to lack of incentives [@merow:2023:B]. +In reality, the impact of Component A is relatively weak as mainstream, open source programming languages and their software libraries are usually cross platform. In modern computational research, Linux is the de-facto operating system in high performance computing environments (e.g. Slurm). Instead, the impact of Components B, C, and D is much higher. Component D is the most volatile among them all as there are many possible combinations of R packages and versions. Software updates with breaking changes (even in a dependency) might render existing shared code using those changed features not executable or not producing the same result anymore. Also, software obsolescence is commonplace, especially since academic software is often not well maintained due to lack of incentives [@merow:2023:B]. The DevOps (software development and IT operations) community is also confronted with this problem. The issue is usually half-jokingly referred to as "it works on my machine"-problem [@valstar:2020:UDS, a software works on someone's local machine but is not working anymore when deployed to the production system, indicates the software tacitly depends on the computational environment of the local machine]. A partial solution to this problem from the DevOps community is called *containerization*. In essence, to containerize is to develop and deploy the software together with all the libraries and the operating system in an OS-level virtualization environment. In this way, software dependency issues can be resolved inside the isolated virtualized software environment and independent of what is installed on the local computer. Docker is a popular choice in the DevOps world for containerization. -To build a container, one needs to write in plain text a declarative description of the required computational environment. Inside this declarative description, it should pin down all four Components mentioned above. For Docker, it is in the form of a plain text file called `Dockerfile`. This `Dockerfile` is then used as the recipe to build a Docker image, where the four Components are assembled. Then, one can launch a container with the built Docker image. +To build a container, one needs to write a plain text declarative description of the required computational environment. Inside this declarative description, it should pin down all four Components mentioned above. For Docker, it is in the form of a plain text file called `Dockerfile`. This `Dockerfile` is then used as the recipe to build a Docker image, where the four Components are assembled. Then, one can launch a container with the built Docker image. -There has been many papers written on how containerization solutions such as Docker can be helpful also to foster computational reproducibility of science [e.g. @nuest:2019;@peikert:2021:RDA;@boettiger:2017:IR]. Although tutorials are available [e.g. @nuest:2019], providing a declarative description of the computational environment in the form of Dockerfile is far from the standard code sharing practice. This might probably be due to most scientists do not have the (DevOps) skill to craft a Dockerfile [@kim:2018:E]. But there are many tools available to automate the process [e.g. @nuest:2019]. The case in point described in this paper, `rang`, is among one of them. And we argue that `rang` is the only easy-to-use solution available that can pin down and restore all four components without the reliance on any commercial service such as MRAN. +There has been many papers written on how containerization solutions such as Docker can be helpful also to foster computational reproducibility of science [e.g. @nuest:2019;@peikert:2021:RDA;@boettiger:2017:IR]. Although tutorials are available [e.g. @nuest:2019], providing a declarative description of the computational environment in the form of Dockerfile is far from the standard code sharing practice. This might be due to a lack of (DevOps) skills of most scientists to create a Dockerfile [@kim:2018:E]. But there are many tools available to automate the process [e.g. @nuest:2019]. The case in point described in this paper, `rang`, is one of them. We argue that `rang` is the only easy-to-use solution available that can pin down and restore all four components without the reliance on any commercial service such as MRAN. ## Existing solutions -`renv` [@renvrpkg] (and its derivatives such as `jetpack` and its predecessor `packrat`) takes a similar approach to Python's `virtualenv` and Ruby's `Gem` to pin down the exact version of R packages using a "lock file". Other solutions such as `checkpoint` [@checkpointrpkg] depends on the availability of The Microsoft R Application Network (MRAN, a time-stamped daily backup of CRAN), which will be shut down on July 1st, 2023. `groundhog` [@groundhogrpkg] used to be depending on MRAN but with a plan to switch to their home-grown R package repository. These solution can effectively pin down Component C and D. But they can only restore component D. Also, for solution dependent on MRAN, there is a limit on how far back can this reproducibility go, because MRAN can only go as far back as September 17, 2014. Also, it covers only CRAN packages. +`renv` [@renvrpkg] (and its derivatives such as `jetpack` and its predecessor `packrat`) takes a similar approach to Python's `virtualenv` and Ruby's `Gem` to pin down the exact version of R packages using a "lock file". Other solutions such as `checkpoint` [@checkpointrpkg] depend on the availability of The Microsoft R Application Network (MRAN, a time-stamped daily backup of CRAN), which will be shut down on July 1st, 2023. `groundhog` [@groundhogrpkg] used to depend on MRAN but has a plan to switch to their home-grown R package repository. These solution can effectively pin down Component C and D. But they can only restore component D. Also, for solutions depending on MRAN, there is a limit on how far back this reproducibility can go, since MRAN can only go back as far as September 17, 2014. Additionally, it only covers CRAN packages. -`containerit` [@nuest:2019] takes the current state of the computational environment and document it as a Dockerfile. `containerit` makes an assumption that Component A has a weak influence on computational reproducibility and therefore defaults to Linux-based Rocker images. In this way, it fixes Component A. But `containerit` does not pin down the exact version of R packages. Therefore, it can pin down components A, B, C, but only a part of component D. `dockta` is another containerization solution that can potentially pin down all components due to the fact that MRAN is used. But it also suffers from the same limitations mentioned above. +`containerit` [@nuest:2019] takes the current state of the computational environment and documents it as a Dockerfile. `containerit` makes the assumption that Component A has a weak influence on computational reproducibility and therefore defaults to Linux-based Rocker images. In this way, it fixes Component A. But `containerit` does not pin down the exact version of R packages. Therefore, it can pin down components A, B, C, but only a part of component D. `dockta` is another containerization solution that can potentially pin down all components due to the fact that MRAN is used. But it also suffers from the same limitations mentioned above. -It is also worth mention that MRAN is not the only archival service. Posit also provides a free (*gratis*) time-stamped daily backup of CRAN and Bioconductor (a series of repositories of R package for bioinformatics and computational biology) called Posit Public Package Manager (https://packagemanager.rstudio.com/client/#/repos/2/packages/). It can goes as far back as October 10, 2017. +It is also worth mentioning that MRAN is not the only archival service. Posit also provides a free (*gratis*) time-stamped daily backup of CRAN and Bioconductor (a series of repositories of R package for bioinformatics and computational biology) called Posit Public Package Manager (https://packagemanager.rstudio.com/client/#/repos/2/packages/). It can goes as far back as October 10, 2017. -These solutions are better for prospective usage, i.e. using them now to ensure the reproducibility of the current research for future researchers. `rang` mostly targets retrospective usage, i.e. using `rang` to reconstruct historical R computational environments which the declarative descriptions are not available. One can think of `rang` as an archaeological tool. In this realm, we could not find any existing solution targeting specifically for R but yet not depending on MRAN. +These solutions are better for prospective usage, i.e. using them now to ensure the reproducibility of the current research for future researchers. `rang` mostly targets retrospective usage, i.e. using `rang` to reconstruct historical R computational environments for which the declarative descriptions are not available. One can think of `rang` as an archaeological tool. In this realm, we could not find any existing solution targeting R specifically which does not currently depend on MRAN. ## Structure of this paper @@ -76,9 +76,9 @@ The resolved result is an S3 object called `rang`. The information contained in dockerize(graph, output_dir = "docker") ``` -For R >= 3.1, the images from the Rocker project are used [@boettiger:2017:IR]. For R < 3.1 but >= 1.3.1, a custom image based on Debian is used. As of writing, `rang` does not support R < 1.3.1, i.e. snapshot date earlier than 2001-08-31 (which is 13 years earlier than all solutions depend on MRAN). There are two features of `dockerize()` that are important for future reproducibility. +For R >= 3.1, the images from the Rocker project are used [@boettiger:2017:IR]. For R < 3.1 but >= 1.3.1, a custom image based on Debian is used. As of writing, `rang` does not support R < 1.3.1, i.e. snapshot date earlier than 2001-08-31 (which is 13 years earlier than all solutions depending on MRAN). There are two features of `dockerize()` that are important for future reproducibility. -1. By default, the container building process downloads source packages from their sources and then compiles them. This step depends on the future availability of R packages on CRAN (which is extremely likely to be the case in the near future, given the continuous availability since 1997-04-23) [^1]m Bioconductor, and Github. However, it is also possible to cache (or archive) the source packages now. The archived R packages can then be used instead during the building process. The significance of this step in terms of long-term computational reproducibility will be discussed in Section 4. +1. By default, the container building process downloads source packages from their sources and then compiles them. This step depends on the future availability of R packages on CRAN (which is extremely likely to be the case in the near future, given the continuous availability since 1997-04-23) [^1], Bioconductor, and Github. However, it is also possible to cache (or archive) the source packages now. The archived R packages can then be used instead during the building process. The significance of this step in terms of long-term computational reproducibility will be discussed in Section 4. [^1]: https://stat.ethz.ch/pipermail/r-announce/1997/000001.html @@ -239,7 +239,7 @@ cd materials Rscript fn_5.R ``` -The same file can also be "rescued" by `rang`. +The same file can thus also be "rescued" by `rang`. [^groundhog]: http://datacolada.org/100 @@ -266,7 +266,7 @@ The software paper of the R package `ptproc` was published in 2003 and introduce [^taskview]: https://cran.r-project.org/web/views/SpatioTemporal.html -Even with this over-a-decade removal and new packages with similar functionalities have been created, there is evidence that `ptproc` is still being sought for. As late as 2017, there are blog posts on how to install the long obsolete package on modern R [^blog]. The package is extremely challenging to install on a modern R system because the package was written before the introduction of name space management in R 1.7.0 [@RN-2003-001]. In other words, the available tarball file from the original author's website does not contain a `NAMESPACE` file as all other modern R packages do. +Even with this over-a-decade removal and new packages with similar functionalities have been created, there is evidence that `ptproc` is still being sought for. As late as 2017, there are blog posts on how to install the long obsolete package on modern versions of R [^blog]. The package is extremely challenging to install on a modern R system because the package was written before the introduction of name space management in R 1.7.0 [@RN-2003-001]. In other words, the available tarball file from the original author's website does not contain a `NAMESPACE` file as all other modern R packages do. The oldest version of R that `rang` can support, as of writing, is R 1.3.1. `rang` is probably the only solution available that can support the 1.x series of R (i.e. before 2004-10-04). Similar to the case of `maxent` above, a Dockerfile to assemble a Docker image with `ptproc` installed can be generated with two lines of code. @@ -322,9 +322,9 @@ The file `peng.Rout` contains the execution results of the script from inside th ## Recover a removed Bioconductor package -Also similar to CRAN, packages can get removed over time on Bioconductor. The Bioconductor package `Sushi` has been deprecated by the original authors and is removed from Bioconductor version 3.16 (2022-11-02). `Sushi` is a data visualization tool for genomic data and was used in many online tutorials and scientific papers, including the original paper announcing the package by the original authors [@phanstiel:2014:S]. +Similar to CRAN, packages can also be removed over time from Bioconductor. The Bioconductor package `Sushi` has been deprecated by the original authors and is removed from Bioconductor version 3.16 (2022-11-02). `Sushi` is a data visualization tool for genomic data and was used in many online tutorials and scientific papers, including the original paper announcing the package by the original authors [@phanstiel:2014:S]. -`rang` has native support for Bioconductor packages since 0.2. We obtained the R script `"PaperFigure.R"` from the Github repo of `Sushi` [^sushi], which generates the figure in @phanstiel:2014:S. Similar to the above case of `ptproc`, we made a completely automated BASH script to run `"PaperFigure.R"` and get the generated figure out of the container (@fig-figure2). We made no modification to `"PaperFigure.R"`. +`rang` has native support for Bioconductor packages since version 0.2. We obtained the R script `"PaperFigure.R"` from the Github repository of `Sushi` [^sushi], which generates the figure in @phanstiel:2014:S. Similar to the above case of `ptproc`, we made a completely automated BASH script to run `"PaperFigure.R"` and get the generated figure out of the container (@fig-figure2). We made no modification to `"PaperFigure.R"`. ```sh Rscript -e "require(rang); dockerize(resolve('Sushi', '2014-06-05'), @@ -349,7 +349,7 @@ knitr::include_graphics("sushi_figure1.pdf", dpi = 300) # Preparing research compendia with long-term computational reproducibility -The above six examples show how useful it is for `rang` reconstructing tricky computational environments which have not been completely declared in the literature. Although we position `rang` mostly as an archaeological tool, we also think that `rang` can also be used to prepare research compendia of current research. We can't predict the future but research compendia generated by `rang` would probably have long-term computational reproducibility. +The above six examples show how powerful `rang` is to reconstruct tricky computational environments which have not been completely declared in the literature. Although we position `rang` mostly as an archaeological tool, we think that `rang` can also be used to prepare research compendia of current research. We can't predict the future but research compendia generated by `rang` would probably have long-term computational reproducibility. To demonstrate this point, we took the recent paper by @oser:2022:HPE. This paper was selected because 1) the paper was published in *Political Communication*, a high impact journal that awards Open Science Badges; 2) shared data and R code are available; and most importantly, 3) the shared R code is well-written. In the repository of this paper, we based on the materials shared by @oser:2022:HPE and prepared a research compendium that should have long-term computational reproducibility. The research compendium is similar to the Executable Compendium suggested by the Turing way. @@ -405,9 +405,9 @@ oserdocker/ oserimg.tar.gz ``` -In this executable compendium, only the first four elements are essential. The directory `oserdocker` (116 MB) contains cached R packages, Dockerfile, and a verbatim copy of the directory `meta-analysis/` to be transferred into the Docker image. That can be regenerated by running `make resolve`. However, having this directory preserved insures against the situations that some R packages used in the project were no longer available or any of the information providers used by `rang` for resolving the dependency relationships were not available. (Or in the rare circumstance of `rang` is no longer available.) +In this executable compendium, only the first four elements are essential. The directory `oserdocker` (116 MB) contains cached R packages, a Dockerfile, and a verbatim copy of the directory `meta-analysis/` to be transferred into the Docker image. That can be regenerated by running `make resolve`. However, having this directory preserved insures against the situations that some R packages used in the project were no longer available or any of the information providers used by `rang` for resolving the dependency relationships were not available. (Or in the rare circumstance of `rang` is no longer available.) -`oserimg.tar.gz` (667 MB) is a backup copy of the Docker image. This can be regenerated by running `make export`. Preserving this file insurces against all the situations mentioned above, but also the situations of Docker Hub and the software repositories used by the dockerized operating system being not available. When `oserimg.tar.gz` is available, it is possible to run `make rebuild` and `make render` even without internet access (provided that Docker and `make` have been preinstalled). Of course, there is still an extremely rare situation where Docker (the program) itself is no longer available [^make]. However, it is possible to convert the image file for use on other containerization solutions such as Singularity [^singularity], if Docker is really not available anymore. +`oserimg.tar.gz` (667 MB) is a backup copy of the Docker image. This can be regenerated by running `make export`. Preserving this file insures against all the situations mentioned above, but also the situations of Docker Hub and the software repositories used by the dockerized operating system being not available. When `oserimg.tar.gz` is available, it is possible to run `make rebuild` and `make render` even without internet access (provided that Docker and `make` have been installed before). Of course, there is still an extremely rare situation where Docker (the program) itself is no longer available [^make]. However, it is possible to convert the image file for use on other containerization solutions such as Singularity [^singularity], if Docker is really not available anymore. Sharing of research artifacts less than 1G is not as challenging as it used to be. Zenodo, for example, allows the sharing of 50G of files. Therefore, sharing of the last two components of the executable compendium prepared with `rang` is at least possible on Zenodo. However, for data repositories with more restrictions on data size, sharing the executable compendium without the last two parts could be considered sufficient. For that, run `make` will make the default target `all` and generate all the things needed for reproducing the analysis inside a container. @@ -421,10 +421,10 @@ The above `Makefile` is general enough that one can reuse it by just modifying h # Concluding remarks -This paper presents `rang`, a solution to (re)construct R computational environments based on Docker. As the six examples in Section 3 shown, `rang` can be used archaeologically to rerun some old code, many of them not executable without the analytic and reconstruction processes facilitated by `rang`. These retrospectively use cases demonstrate how versatile `rang` is. `rang` is also helpful for prospective usage, as demonstrated in Section 4 whereby an executable compendium is created. +This paper presents `rang`, a solution to (re)construct R computational environments based on Docker. As the six examples in Section 3 show, `rang` can be used archaeologically to rerun old code, many of them not executable without the analytic and reconstruction processes facilitated by `rang`. These retrospective use cases demonstrate how versatile `rang` is. `rang` is also helpful for prospective usage, as demonstrated in Section 4 whereby an executable compendium is created. There are still many features that we did not mention in this paper. `rang` is built with interoperability in mind. As of writing, `rang` is interoperable with existing R packages such as `renv` and R built-in `sessionInfo()`. Also, the `rang` object can be used for network analysis with R packages such as `igraph`. -Computational reproducibility is a complex topic and as in all of these complex topic, there is no silver bullet [@canon:2019:CPR]. All solutions have their trade-offs. The (re)construction process based on `rang` takes notably more time than other solutions because all packages are compiled from source. `rang` trades computational efficiency of this often one-off (re)constructing process for correctness, backward compatibility and independence from any commercial backups of software repositories such as MRAN. There are also other limitations. In the Vignette of `rang` (https://cran.r-project.org/web/packages/rang/vignettes/faq.html), we list out all of these limitations as well as possible mitigation. +Computational reproducibility is a complex topic and as in all of these complex topic, there is no silver bullet [@canon:2019:CPR]. All solutions have their trade-offs. The (re)construction process based on `rang` takes notably more time than other solutions because all packages are compiled from source. `rang` trades computational efficiency of this often one-off (re)constructing process for correctness, backward compatibility and independence from any commercial backups of software repositories such as MRAN. There are also other limitations. In the Vignette of `rang` (https://cran.r-project.org/web/packages/rang/vignettes/faq.html), we list all of these limitations as well as possible mitigation. # References