This pangenome graph evaluation pipeline measures the reconstruction accuracy of a pangenome graph (in the variation graph model). Its goal is to give guidance in finding the best pangenome graph construction tool for a given input data and task.
It has five phases:
: (sample preparation) -- SHORT DESCRIPTION. TODO. -
splitfa: (split sequences) -- SHORT DESCRIPTION. TODO.
: (pick a subset of random sequences) -- SHORT DESCRIPTION. TODO. -
GraphAligner: (alignment) -- SHORT DESCRIPTION. TODO.
peanut: (alignment evaluation) -- SHORT DESCRIPTION. TODO.
beehave.R: (plot evaluation results) -- SHORT DESCRIPTION. TODO.
Clone this repository:
git clone --recursive
cd pgge
Create a pangenome graph and its consensus graphs using pggb
, storing the results in the pggb_yeast
pggb -i data/yeast/cerevisiae.pan.fa.gz -t 16 -s 50000 -p 90 -n 5 -Y "#" -k 8 -B 10000000 -w 30000 -I 0.7 -o pggb_yeast -W
Evaluate the consensus graphs stored in the pggb_yeast
./pgge -g "pggb_yeast/*consensus*.gfa" -f data/yeast/cerevisiae.pan.fa.gz -t 16 -r scripts/beehave.R -l 100000 -s 50000 -o pgge_yeast
Make sure that you include the opening and closing "
in the command line, else the regex can't be resolved. For a single
input GFA, this is not required.
Optionally, you can set -b
to write the unmapped regions to BED.
If you want to enable random subsampling to reduce alignment time, you can select either -p/--subsample-percentage
or -u/--subsample-number
summarizes results by sample name. If you have
in your given FASTA file, the results will only contain one line of metrics. In this case for S288C
. This is useful if
you have contig sequences in your FASTA and want to summarize by sample name. pgge
always splits by .
and takes the
first entry in the resulting split as sample name.
was designed for processing the results
of pggb
. If you are evaluating your own data not originating from pggb
it is recommended to set the -n/--input-graph-names
parameter to ensure the final PNG is labeled correctly. This parameters requires a TSV with 2 rows:
- The name of the original input graph.
- The name to display in the PNG.
In the following an example for the yeast data set:
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10000::y:0:1000000.gfa 10k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@1000::y:0:1000000.gfa 1k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@100::y:0:1000000.gfa 0.1k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10::y:0:1000000.gfa 0.01k::y:0:1000k
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10000:y.gfa 10k:y
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@1000:y.gfa 1k:y
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@100:y.gfa 0.1k:y
cerevisiae.pan.fa.pggb-W-s50000-p90-n5-a0-K16-k8.seqwish-w30000-j5000-e5000-I0.7.smooth.consensus@10:y.gfa 0.01k:y
The output is written to pgge_yeast/pgge-l100000-s50000.tsv
in a tab-delimited format:
cat pgge_yeast/pgge-l100000-s50000.tsv | column -t cons.jump qsc uniq multi nonaln
DBVPG6044 10000:y 0.994253 0.9882487336244542 0.9878719650655022 0.00037676855895196504 0.011751266375545851
DBVPG6044 1000:y 0.99429 0.9905872052401746 0.9902261572052402 0.00036104803493449783 0.009412794759825328
DBVPG6044 100:y 0.994346 0.9920169432314411 0.9917783406113537 0.00023860262008733625 0.007983056768558951
DBVPG6044 10:y 0.994804 0.9931444978165939 0.9930238427947599 0.00012065502183406113 0.0068555021834061135
DBVPG6765 10000:y 0.992895 0.984453537117904 0.9841169868995633 0.00033655021834061135 0.01554646288209607
DBVPG6765 1000:y 0.992816 0.9851402620087336 0.9847942358078603 0.00034602620087336247 0.014859737991266376
DBVPG6765 100:y 0.992857 0.9850624454148471 0.9848960262008734 0.00016641921397379914 0.014937554585152838
DBVPG6765 10:y 0.993555 0.9918473362445415 0.9916482969432314 0.00019903930131004368 0.008152663755458514
S288C 10000:y 0.993815 0.9840108085106383 0.9836786808510638 0.0003321276595744681 0.015989191489361704
S288C 1000:y 0.993819 0.9856704255319149 0.9854043829787235 0.0002660425531914894 0.014329574468085107
S288C 100:y 0.994008 0.9880367234042553 0.9878560425531915 0.0001806808510638298 0.011963276595744681
S288C 10:y 0.994503 0.9923237872340426 0.9922378723404255 0.00008591489361702127 0.007676212765957447
SK1 10000:y 0.994393 0.98889 0.9882832467532467 0.0006067532467532468 0.01111
SK1 1000:y 0.994355 0.9909501731601732 0.9903794805194805 0.0005706926406926407 0.00904982683982684
SK1 100:y 0.994531 0.9920734632034632 0.9916370995670996 0.00043636363636363637 0.007926536796536796
SK1 10:y 0.99508 0.9932187878787879 0.9930288311688311 0.00018995670995670995 0.006781212121212121
UWOPS034614 10000:y 0.993074 0.9854122807017544 0.9849935964912281 0.0004186842105263158 0.014587719298245615
UWOPS034614 1000:y 0.993131 0.9833553070175438 0.9829300438596491 0.00042526315789473683 0.01664469298245614
UWOPS034614 100:y 0.99331 0.9884982894736842 0.9883386842105263 0.00015960526315789473 0.01150171052631579
UWOPS034614 10:y 0.993955 0.9915775 0.991506403508772 0.00007109649122807018 0.0084225
Y12 10000:y 0.994867 0.9878221834061135 0.9873637554585153 0.00045842794759825325 0.012177816593886464
Y12 1000:y 0.994863 0.9892637554585153 0.9888601746724891 0.0004035807860262009 0.010736244541484715
Y12 100:y 0.994997 0.9919159388646288 0.9917003056768559 0.00021563318777292578 0.00808406113537118
Y12 10:y 0.995301 0.9941565065502184 0.9939880786026201 0.00016842794759825327 0.00584349344978166
YPS128 10000:y 0.995545 0.9904155895196507 0.9902230567685589 0.00019253275109170305 0.009584410480349345
YPS128 1000:y 0.99559 0.9909312663755458 0.9907755458515284 0.00015572052401746726 0.009068733624454149
YPS128 100:y 0.995676 0.993891615720524 0.9938559825327511 0.000035633187772925764 0.006108384279475983
YPS128 10:y 0.99591 0.995602576419214 0.995569519650655 0.000033056768558951964 0.004397423580786026
The first number is the
derived from the alignment identity GAF field of GraphAligner
. All other metrics can
be found in the metrics section of peanut
also generates a visualization of the results pgge_yeast/pgge-l100000-s50000.tsv.png
To simplify installation and versioning, we have an automated GitHub action that pushes the current docker build to the GitHub registry. To use it, first pull the actual image:
docker pull
Or if you want to pull a specific snapshot from
docker pull
Going in the pgge
git clone --recursive
cd pgge
you can run the container using the example DRB1-3123 provided in this repo:
docker run -it -v ${PWD}/data/:/data pangenome/pgge "pgge -g '/data/HLA/DRB1-3123/*.consensus*.gfa' -f /data/HLA/DRB1-3123/DRB1-3123.fa -r /scripts/beehave.R -t 16 -o /data/HLA/DRB1-3123/pgge_docker -l 1000 -s 1000 -p 100"
from the command line, when running in a docker container, we have to use '
instead of "
in order to ensure that the regex is parsed properly.
The -v
argument of docker run
always expects a full path: If you intended to pass a host directory, use absolute path.
This is taken care of by using ${PWD}
If you want to experiment around, you can build a docker image locally using the Dockerfile
docker build -t ${USER}/pgge:latest .
Staying in the pgge
directory, we can run pgge
with the locally build image:
docker run -it -v ${PWD}/data/:/data ${USER}/pgge 'pgge -g "/data/HLA/DRB1-3123/*.consensus*.gfa' -f /data/HLA/DRB1-3123/DRB1-3123.fa -r /scripts/beehave.R -t 16 -o /data/HLA/DRB1-3123/pgge_docker -l 1000 -s 1000 -p 100"
should accept a list of GFA files as input (path/to/files/*.consensus*.gfa) and output the summarized results in one PNG - Integrate as an option to prepare the input FASTA.
- Add the possibility to split the input by sample name. Later re-use that information in the final result. THIS IS THE NEW DEFAULT.
- Add R script to visualize the result.
- Explain
. - Add option to directly start from GAF file.
- Add output-folder option.
- Add possibility to input several GAF files. Make sure the user can input a list of samples for the GAFs.
- The user should be able to select options for GraphAligner.
- Add a toolchain that compares the query alignments with the exact nodes they aligned to in the graph.
- Add Dockerfile.
- Add a CI building the Dockerfile and emitting evaluation metrics for all tools using
data. - Add usage examples for
, andSibeliaZ
. - Integrate into nf-core/pangenome pipeline.
Simon Heumos, Andrea Guarracino, Erik Garrison, Christian Fischer