Skip to content
jdhayhurst edited this page Aug 9, 2023 · 9 revisions

GWAS sumstats service

A micro-service for handling summary statistics submissions as part of the GWAS submission/deposition service. The endpoints are not intended to be exposed publicly. Instead, they are intended to be called by the deposition backend when a sumstats submission is made.

It handles the uploaded summary statistics files, validates them, reports errors to the deposition backend and puts valid files in the queue for sumstats file harmonisation and publishing on the FTP.

It is a Flask app exposing the endpoints listed below for registering sumstats, getting sumstats validation statuses and updating sumstats. Operations with the sumstats service are done at the submission level rather than the study level (a submission can contain many studies). Celery worker(s) perform the validation tasks in the background. They can work from anywhere the app is installed and can see the RabbitMQ queue.

Specific documentation for GWAS Catalog: https://www.ebi.ac.uk/seqdb/confluence/display/GOCI/Summary+Statistics+Service

Local installation

  • Requires: RabbitMQ and Python 3.6
  • Clone the repository
    • git clone https://github.com/EBISPOT/gwas-sumstats-service.git
    • cd gwas-sumstats-service
    • Set up environment
    • virtualenv --python=python3.6 .env
    • source activate .env/bin/activate
  • Install
    • pip install .
    • pip install -r requirements.txt

Run the tests

  • Run this, to setup up a RabbitMQ server, run the tests, and tear it all down.
  • tox

Run as a flask app

  • Spin up a RabbitMQ server on the port (BROKER_PORT) specified in the config e.g.
    • rabbitmq-server
  • Start the flask app with gunicorn http://localhost:8000
    • from gwas-sumstats-service:
    • gunicorn -b 0.0.0.0:8000 sumstats_service.app:app --log-level=debug
  • Start a celery worker for the database side
    • from gwas-sumstats-service:
    • celery -A sumstats_service.app.celery worker --loglevel=debug --queues=postval
  • Start a celery worker for the validation side
    • from gwas-sumstats-service:
    • celery -A sumstats_service.app.celery worker --loglevel=debug --queues=preval

Run with Docker-compose

  • Spin up the Flask and RabbitMQ and Celery docker containers
    • clone repo as above
    • docker-compose build
    • docker-compose up
  • Start up a celery worker on the machine validating and storing the files
    • follow the local installation as above
    • set BROKER_HOST to that of RabbitMQ host e.g. localhost in config.py
    • celery -A sumstats_service.app.celery worker --queues=preval --loglevel=debug

Deploy with helm (kubernetes)

  • First, deploy rabbitmq using helm
    • helm install --name rabbitmq --namespace rabbitmq --set rabbitmq.username=<user>,service.type=NodePort,service.nodePort=<port> stable/rabbitmq
  • create kubernetes secrets for the ssh keys and Globus
    • kubectl --kubeconfig=<path to config> -n <namespace> create secret generic ssh-keys --from-file=id_rsa=<path/to/id_rsa> --from-file=id_rsa.pub=/path/to/id_rsa.pub> --from-file=known_hosts=/path/to/known_hosts
    • kubectl --kubeconfig=<path to config> -n gwas create secret generic globus --from-file=refresh-tokens.json=<path/to/refresh-tokens.json>
  • deploy the sumstats service
    • helm install --name gwas-sumstats k8chart/ --wait
  • Start a celery worker from docker
    • docker run -it -d --name sumstats -v /path/to/data/:$INSTALL_PATH/sumstats_service/data -e CELERY_USER=<user> -e CELERY_PASSWORD=<pwd> -e QUEUE_HOST=<host ip> -e QUEUE_PORT=<port> gwas-sumstats-service:latest /bin/bash
    • docker exec sumstats celery -A sumstats_service.app.celery worker --loglevel=debug --queues=preval

Endpoints

/v1/sum-stats (POST)

Register a submission of summary stats. This triggers the summary stats validation. A callback ID is returned, which is used to retrieve the sumstats validation status from the /v1/sum-stats/<callbackID (GET) endpoint.

POST a payload with the sumstats metadata.

Payload object:

{
	"skipValidation": <Boolean>, # Skip validation entirely, do not look for files or publish any.
	"minrows": <Int>, # Minimum number of rows for the sumstats files to be deemed valid
	"forceValid": <Boolean>, # Force the files to be valid 
	"zeroPvalue": <Boolean>, # Allow p-values of zero
	"requestEntries": [{
		"id": <study ID>,
		"filePath": <sumstats file path>,
		"md5": <md5 checksum of sumsats file>,
		"assembly": <genome assembly>,
		"readme": <author readme>, # optional
		"entryUUID": <Globus endpoint UUID>
	}]
}

Example POST method:

# request
curl -i -H "Content-Type: application/json" -X POST -d '{"requestEntries":[{"id":"abc123","filePath":"formatted_test.tsv","md5":"16e89d9993cad683c3857d754595cb28","assembly":"GRCh38", "readme":"optional text", "entryUUID": "curator_sumstats"},{"id":"bcd234","filePath":"formatted_test.tsv","md5":"16e89d9993cad683c3857d","assembly":"GRCh38", "entryUUID": "curator_sumstats"}]}' http://localhost:8000/v1/sum-stats


# response
HTTP/1.0 201 CREATED
Content-Type: application/json
Content-Length: 26
Server: Werkzeug/0.15.4 Python/3.6.5
Date: Wed, 17 Jul 2019 15:15:23 GMT

{"callbackID": "TiQS2yxV"}

/v1/sum-stats/<callbackID (GET)

Request the validation status for a submission of summary stats referring to the callback ID from the POST.

Response object:

{
  "callbackID": <callbackID>,
  "completed": <submission validation status>, # boolean
  "statusList": [ # list of studies/sumstats validation statuses within submission
    {
      "id": <study ID>,
      "status": <validation status>, # options: "VALID"|"INVALID"|"RETRIEVING"
      "error": <validation error message>, # error message string OR null
      "gcst": <GWAS study accession> # optional
    },
    {
      "id": <study ID>,
      "status": <validation status>,
      "error": <validation error message>,
      "gcst": <GWAS study accession>
    }
  ]
}

Example GET method (using callback id from above):

# request
curl http://localhost:8000/v1/sum-stats/TiQS2yxV

# response
{
  "callbackID": "TiQS2yxV",
  "completed": false,
  "statusList": [
    {
      "id": "abc123",
      "status": "VALID",
      "error": null
    },
    {
      "id": "bcd234",
      "status": "INVALID",
      "error": "md5sum did not match the one provided"
    }
  ]
}

/v1/sum-stats/<callbackID> (PUT)

Update the already registered sumstats submission with GCST accessions. This request will trigger two actions:

  1. Assign GCSTs to the studies in the submission
  2. Stage the summary stats in the submission for publication on the FTP and queue for harmonisation

Payload object:

{
	"pmid": <pubmed ID>, # optional
	"authorName": <author name>, # optional: (FullNameStandard)
	"studyList": [{
			"id": <study ID>, 
			"gcst": <GCST accession ID> 
		},
		{
			"id": <study ID>,
			"gcst": <GCST accession ID>
		}
	]
}

Example PUT request:

# request
curl -i -H "Content-Type: application/json" -X PUT -d ' {"studyList": [{"id": "xyz321","gcst": "GCST123456"},{"id": "abc123","gcst":"GCST234567"}]}' http://localhost:8000/v1/sum-stats/TiQS2yxV