Skip to content

A pipeline for phylogenetic diversity analysis of GBIF-mediated data

License

Notifications You must be signed in to change notification settings

vmikk/PhyloNext

Repository files navigation

PhyloNext - PD (Phylogenetic Diversity) in the cloud

GitHub (latest release) Nextflow run with docker run with singularity GitHub license
CI/CD status: Nextflow (full pipeline) OToL Biodiverse
DOI - 10.1186/s12862-024-02256-9 DOI

PhyloNext is the automated pipeline for the analysis of phylogenetic diversity using GBIF occurrence data, species phylogenies from Open Tree of Life, and Biodiverse software.

Introduction

Current pipeline brings together two critical research data infrastructures, the Global Biodiversity Information Facility (GBIF) and Open Tree of Life (OToL), to make them more accessible to non-experts.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

The pipeline could be launched in a cloud environment (e.g., the Microsoft Azure Cloud Computing Services, Amazon AWS Web Services, and Google Cloud Computing Services).

Pipeline summary

  1. Filtering of GBIF species occurrences for various taxonomic clades and geographic areas
  2. Removal of non-terrestrial records and spatial outliers (using density-based clustering)
  3. Preparation of phylogenetic tree (currently, only pre-constructed phylogenetic trees are available; with the update of OToL, phylogenetic trees will be downloaded automatically using API) and name-matching with GBIF species keys
  4. Spatial binning of species occurrences using Uber’s H3 system (hexagonal hierarchical spatial index)
  5. Estimation of phylogenetic diversity and endemism indices using Biodiverse program
  6. Visualization of the obtained results

Quick Start

An example command to run the pipilene:

nextflow run vmikk/phylonext -r main \
  --input "/mnt/GBIF/Parquet/2022-01-01/occurrence.parquet/" \
  --classis "Mammalia" --family  "Felidae,Canidae" \
  --country "DE,PL,CZ"  \
  --minyear 2000  \
  --dbscan true  \
  --phytree $(realpath "${HOME}/.nextflow/assets/vmikk/phylonext/test_data/phy_trees/Mammals.nwk") \
  --iterations 100  \
  -resume

Web GUI

To facilitate easy and efficient navigation for exploring the PhyloNext pipeline, a user-friendly, web-based graphical user interface (GUI) has been developed by Thomas Stjernegaard Jeppesen.

The GUI is available at https://phylonext.gbif.org/.

NB! To access the GUI, users must have a GBIF user account. To register an account, please visit https://www.gbif.org/.

Documentation

The PhyloNext pipeline comes with documentation about the pipeline usage at https://phylonext.github.io/.

Main pipeline parameters and output are desribed here:

To show a help message, run nextflow run vmikk/phylonext -r main --help.

=====================================================================
PhyloNext: GBIF phylogenetic diversity pipeline :  Version 1.4.0
=====================================================================

Pipeline Usage:
To run the pipeline, enter the following in the command line:
    nextflow run vmikk/phylonext -r main --input ... --outdir ...

Options:
REQUIRED:
    --input               Path to the directory with parquet files (GBIF occurrcence dump)
    --outdir              The output directory where the results will be saved
OPTIONAL:
    --phylum              Phylum to analyze (multiple comma-separated values allowed); e.g., "Chordata"
    --classis             Class to analyze (multiple comma-separated values allowed); e.g., "Mammalia"
    --order               Order to analyze (multiple comma-separated values allowed); e.g., "Carnivora"
    --family              Family to analyze (multiple comma-separated values allowed); e.g., "Felidae,Canidae"
    --genus               Genus to analyze (multiple comma-separated values allowed); e.g., "Felis,Canis,Lynx"
    --specieskeys         Custom list of GBIF specieskeys (file with a single column, with header)

    --phytree             Custom phylogenetic tree
    --taxgroup            Specific taxonomy group in Open Tree of Life (default, "All_life")
    --phylabels           Type of tip labels on a phylogenetic tree ("OTT" or "Latin")
    --maxage              Manually assign root age for a tree obtained from Open Tree of Life; e.g., 127
    --phyloonly           Prune Open Tree tips for which there are no phylogenetic inputs; logical, default, false

    --country             Country code, ISO 3166 (multiple comma-separated values allowed); e.g., "DE,PL,CZ"
    --latmin              Minimum latitude of species occurrences (decimal degrees); e.g., 5.1
    --latmax              Maximum latitude of species occurrences (decimal degrees); e.g., 15.5
    --lonmin              Minimum longitude of species occurrences (decimal degrees); e.g., 47.0
    --lonmax              Maximum longitude of species occurrences (decimal degrees); e.g., 55.5
    --minyear             Minimum year of record's occurrences; default, 1945
    --maxyear             Maximum year of record's occurrences; default, none
    --coordprecision      Coordinate precision threshold (less than maximum allowed value; default, 0.1)
    --coorduncertainty    Maximum allowed coordinate uncertainty, meters (default, 10000)
    --coorduncertaintyexclude Black list of coordinate uncertainty values (default, "301,3036,999,9999")
    --basisofrecordinclude Basis of record to include from the data; e.g., "PRESERVED_SPECIMEN"
    --basisofrecordexclude Basis of record to exclude from the data; e.g., "FOSSIL_SPECIMEN,LIVING_SPECIMEN"
    --polygon             Custom area of interest (a file with polygons in GeoPackage format)
    --wgsrpd              Polygons of World Geographical Regions; e.g., "pipeline_data/WGSRPD.RData"
    --regions             Names of World Geographical Regions; e.g., "L1_EUROPE,L1_ASIA_TEMPERATE"
    --noextinct           File with extinct species specieskeys for their removal (file with a single column, with header)
    --excludehuman        Logical, exclude genus "Homo" from occurrence data (default, true)
    --roundcoords         Numeric, round spatial coordinates to N decimal places, to reduce the dataset size (default, 2; set to negative to disable rounding)
    --h3resolution        Spatial resolution of the H3 geospatial indexing system; e.g., 4

    --dbscan              Logical, remove spatial outliers with density-based clustering; e.g., "false"
    --dbscannoccurrences  Minimum species occurrence to perform DBSCAN; e.g., 30
    --dbscanepsilon       DBSCAN parameter epsilon, km; e.g., "700"
    --dbscanminpts        DBSCAN min number of points; e.g., "3"

    --terrestrial         Land polygon for removal of non-terrestrial occurrences; e.g., "pipeline_data/Land_Buffered_025_dgr.RData"
    --rmcountrycentroids  Polygons with country and province centroids; e.g., "pipeline_data/CC_CountryCentroids_buf_1000m.RData"
    --rmcountrycapitals   Polygons with country capitals; e.g., "pipeline_data/CC_Capitals_buf_10000m.RData"
    --rminstitutions      Polygons with biological institutuions and museums; e.g., "pipeline_data/CC_Institutions_buf_100m.RData"
    --rmurban             Polygons with urban areas; e.g., "pipeline_data/CC_Urban.RData"

    --deriveddataset      Prepare a list of DOIs for the datasets used (default, true)

    --indices             Comma-seprated list of diversity and endemism indices; e.g., "calc_richness,calc_pd,calc_pe"
    --randname            Randomisation scheme type; e.g., "rand_structured"
    --iterations          Number of randomisation iterations; e.g., 1000
    --biodiversethreads   Number of Biodiverse threads; e.g., 10
    --randconstrain       Polygons to perform spatially constrained randomization (GeoPackage format)

Leaflet interactive visualization:
    --leaflet_var         Variables to plot; e.g., "RICHNESS_ALL,PD,SES_PD,PD_P,ENDW_WE,SES_ENDW_WE,PE_WE,SES_PE_WE,CANAPE,Redundancy"
    --leaflet_canapesuper Include the `superendemism` class in CANAPE results (default, false)
    --leaflet_color       Color scheme for continuous variables (default, "RdYlBu")
    --leaflet_palette     Color palette for continuous variables (default, "quantile")
    --leaflet_bins        Number of color bins for continuous variables (default, 5)
    --leaflet_sescolor    Color scheme for standardized effect sizes, SES (default, "threat"; alternative - "hotspots)
    --leaflet_redundancy  Redundancy threshold for hiding the grid cells with low number of records (default, 0 = display all grid cells)

Static visualization:
    --plotvar             Variables to plot (multiple comma-separated values allowed); e.g., "RICHNESS_ALL,PD,PD_P"
    --plottype            Plot type
    --plotformat          Plot format (jpg,pdf,png)
    --plotwidth           Plot width (default, 18 inches)
    --plotheight          Plot height (default, 18 inches)
    --plotunits           Plot size units (in,cm)
    --world               World basemap

NEXTFLOW-SPECIFIC:
    -qs                   Queue size (max number of processes that can be executed in parallel); e.g., 8
    -w                    Path to the working directory to store intermediate results (default, "./work")
    -resume               Execute the pipeline using the cached results.<br>Useful to continue executions that was stopped by an error
    -profile              Configuration profile; e.g., "docker"
    -params-file          Parameter file in YAML or JSON format (e.g., "Mammals.yaml")
    -c / -C               Configuration file (`-C` ignores all default values) (default, "nextflow.config")

Source code for the documentation can be found at https://github.com/PhyloNext/phylonext.github.io.

Credits

PhyloNext pipeline was developed by Vladimir Mikryukov and Kessy Abarenkov.

Biodiverse program and Perl scripts accompanying PhyloNext were written by Shawn Laffan (Laffan et al., 2010).

Scripts for getting an induced subtree from the Open Tree of Life were developed by Emily Jane McTavish.

We thank the following people for their extensive assistance in the development of this pipeline: Joe Miller, Shawn Laffan, Tim Robertson, Emily Jane McTavish, John Waller, Thomas Stjernegaard Jeppesen, and Matthew Blissett.

Also we are very grateful to Manuele Simi and nf-core community for helpful advices on the development of this pipeline.

For more details, please see the Acknowledgments section in the docs.

Funding

The work is supported by a grant “PD (Phylogenetic Diversity) in the Cloud” to GBIF Supplemental funds from the GEO-Microsoft Planetary Computer Programme.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to file an issue on GitHub.

Future plans

Citations

If you use PhyloNext pipeline for your analysis, please cite it as:

Mikryukov V, Abarenkov K, Laffan S, Robertson T, McTavish EJ, Jeppesen TS, Waller J, Blissett M, Kõljalg U, Miller JT (2024). PhyloNext: A pipeline for phylogenetic diversity analysis of GBIF-mediated data. BMC Ecology and Evolution, 24(1), 76. DOI:10.1186/s12862-024-02256-9

Laffan SW, Lubarsky E, Rosauer DF (2010) Biodiverse, a tool for the spatial analysis of biological and related diversity. Ecography, 33: 643-647. DOI: 10.1111/j.1600-0587.2010.06237.x

An extensive list of references for the tools used by the pipeline can be found in the Citations section in the documentation.