diff --git a/Contribute-Docs.md b/Contribute-Docs.md new file mode 100644 index 000000000..58e39bc8a --- /dev/null +++ b/Contribute-Docs.md @@ -0,0 +1,27 @@ +Contributions towards documentations and examples are always welcome! + +Documentation is built using [Sphinx.](https://www.sphinx-doc.org/en/master/). If you've ever seen any docs over at [readthedocs](https://readthedocs.org/) then you've likely seen some examples of sphinx out in the wild. + +## Build the Documentation + +These instructions require that you have docker installed. The best way to do that is to follow the installation instructions at [Get Docker](https://docs.docker.com/get-docker/). The upside to this is that you don't need to clobber any existing conda environments in order to build your docs. + +You don't need to have any particular understanding of docker to run these commands. We are treating the docker image as a shell. + +``` +docker build -t sphinx-sgkit -f docs/Dockerfile . +cd docs +docker run --rm -i -v "$(pwd):/docs" sphinx-sgkit make clean html +``` + +## Serve the Documentation + +Now that we've run built the docs let's view them in their native html state. + +``` +# You can use any port you'd like instead of 8080 +docker run -p 8080:80 -v "$(pwd)/_build/html:/usr/share/nginx/html:ro" nginx +``` + +Now open up localhost:8080 in your browser and you'll see the docs just as they appear on the docs website. + diff --git a/docs/Dockerfile b/docs/Dockerfile new file mode 100644 index 000000000..75879ea91 --- /dev/null +++ b/docs/Dockerfile @@ -0,0 +1,19 @@ +FROM continuumio/miniconda3 + +RUN apt-get update \ + && apt-get install --no-install-recommends -y \ + graphviz \ + imagemagick \ + make \ + git \ + && apt-get autoremove \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /docs +ADD requirements-dev.txt /docs/ + +RUN conda install -y -c conda-forge scikit-allel sphinx nbsphinx pip pandoc && \ + pip3 install -r requirements-dev.txt && \ + pip3 install git+https://github.com/pystatgen/sgkit@96203d471531e7e2416d4dd9b48ca11d660a1bcc + diff --git a/docs/examples.rst b/docs/examples.rst new file mode 100644 index 000000000..1315aa546 --- /dev/null +++ b/docs/examples.rst @@ -0,0 +1,12 @@ +Examples +======== + +Understanding the Xarray Genotype Call Dataset +********************************************** + +.. toctree:: + :maxdepth: 1 + + examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-From-VCF + examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-From-SGKit-Zarr + examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-Minimal-Numpy-Example diff --git a/docs/examples/notebooks/Genotype-Call-Dataset-From-SGKit-Zarr.ipynb b/docs/examples/notebooks/Genotype-Call-Dataset-From-SGKit-Zarr.ipynb new file mode 100644 index 000000000..6e63cc5a1 --- /dev/null +++ b/docs/examples/notebooks/Genotype-Call-Dataset-From-SGKit-Zarr.ipynb @@ -0,0 +1,1153 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load From Malaria Gen Zarr\n", + "\n", + "A central point to the SGkit API is the Genotype Call Dataset. This is the data structure that most of the other functions use. It uses [Xarray](http://xarray.pydata.org/en/stable/) underneath the hood to give a programmatic interface that allows for the backend to be several different data files.\n", + "\n", + "The Xarray itself is *sort of* a transposed VCF file.\n", + "\n", + "For this example we are going to from the preprocessed zarr to the sgkit Genotype Call XArray Dataset.\n", + "\n", + "This is only meant to demonstrate the datatypes that we feed into the Xarray dataset. For a more conceptual understanding please check out the `Genotype-Call-Dataset-From-VCF.ipynb`." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import zarr\n", + "import pandas as pd\n", + "import dask.array as da\n", + "import allel\n", + "from pprint import pprint\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a Dask Cluster\n", + "\n", + "This isn't that important for this example, but SGkit can use Dask under the hood for many of it's calculations. Divide and conquer your statistical genomics data!" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "60ad30bcd7044d6fb7f8fd803c140a26", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HTML(value='

KubeCluster

'), HBox(children=(HTML(value='\\n
\\n
xarray.Dataset
" + ], + "text/plain": [ + "\n", + "Dimensions: (alleles: 2, ploidy: 2, samples: 1142, variants: 10000)\n", + "Dimensions without coordinates: alleles, ploidy, samples, variants\n", + "Data variables:\n", + " variant/contig (variants) int64 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0\n", + " variant/position (variants) int32 9526 9531 9536 ... 64411 64416 64418\n", + " variant/alleles (variants, alleles) |S1 b'A' b'G' b'A' ... b'T' b'C'\n", + " sample/id (samples) chr2.vcf\n", + "# ls -lah chr2.vcf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Grab Some Data\n", + "\n", + "We're going to grab a small subset of a VCF file from the [1000 Genomes Project.](https://www.internationalgenome.org/faq/how-do-i-get-sub-section-vcf-file/). We're only going to grab 3 calls, which is fine for our purposes.\n", + "\n", + "These calls are also already biallelic. I cheat. ;-)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# I couldn't run this from jupyterhub but needed an actual terminal\n", + "#! conda activate samtools \n", + "#! tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967800 > chr2.vcf\n", + "#! conda deactivate\n", + "# ls -lah chr2.vcf" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import vcf\n", + "import os" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's write up a quick, *not to be used in the real world*, parser to grab data about the variants.\n", + "\n", + "* Variant Contig Names - A unique list of all the chromosomes and contigs\n", + "* Variant Contig - an index of the variant_contig_names list.\n", + "* Variant Position - Position on the chromosome\n", + "* Variant Reference and Alternate\n", + "* Samples\n", + "* Genotype calls per sample - with missing encoded as -1\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "vcf_reader = vcf.Reader(open('/home/jovyan/chr2.vcf', 'r'))\n", + "\n", + "# I already know these come from chr2\n", + "# but let's grab them anyways\n", + "variant_contig_names = []\n", + "\n", + "variant_chrom = []\n", + "variant_position = []\n", + "variant_alleles = []\n", + "variant_contig = []\n", + "\n", + "sample_id = []\n", + "call_genotype = []\n", + "\n", + "count = 0\n", + "\n", + "for record in vcf_reader:\n", + " \n", + " chrom = str(record.CHROM)\n", + " if chrom not in variant_contig_names:\n", + " variant_contig_names.append(chrom)\n", + " \n", + " # Grab the index of the contig\n", + " variant_contig.append(variant_contig_names.index(chrom))\n", + " \n", + " # Get the variant data\n", + " # I'm cheating and only getting the first alternate. In the real world you would filter for biallelic variants.\n", + " variant_alleles.append([str(record.REF), str(record.ALT[0])])\n", + " variant_position.append(record.POS)\n", + " \n", + " # the sample records is an object that has call data \n", + " samples = record.samples\n", + " \n", + " # Grab the sample names\n", + " if count == 0:\n", + " for sample in samples:\n", + " sample_id.append(sample.sample)\n", + " \n", + " # Grab the call data for each sample for the variant\n", + " variant_genotypes = []\n", + " for sample in samples:\n", + " # If its missing encode as -1, -1\n", + " if sample['GT'] == './.':\n", + " variant_genotypes.append([-1, -1])\n", + " else:\n", + " GT = sample['GT'].split('|')\n", + " variant_genotypes.append([int(GT[0]), int(GT[1])])\n", + " \n", + " call_genotype.append(variant_genotypes)\n", + " count = count + 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Convert to Numpy\n", + "\n", + "Now that we have our data, we need to prepare for our XArray dataset by converting these to Numpy arrays.\n", + "\n", + "If you're wondering how I know what these are you can check out the `sgkit.api.create_genotype_call_dataset`. The exact functions are `check_array_like` and make sure that these are numpy arrays of a particular type.\n", + "\n", + "```\n", + "check_array_like(variant_contig, kind=\"i\", ndim=1)\n", + "check_array_like(variant_position, kind=\"i\", ndim=1)\n", + "check_array_like(variant_alleles, kind=\"S\", ndim=2)\n", + "check_array_like(sample_id, kind=\"U\", ndim=1)\n", + "check_array_like(call_genotype, kind=\"i\", ndim=3)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "sample_id = np.array(sample_id, dtype='U')\n", + "variant_position = np.array(variant_position, dtype='i')\n", + "variant_alleles = np.array(variant_alleles, dtype='S')\n", + "variant_contig_names = np.array(variant_contig_names, dtype='S')\n", + "variant_contig = np.array(variant_contig, dtype='i')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Understanding Variant Contig and Variant Position\n", + "\n", + "The Genotype Call Xarray dataset is meant to be able to incorporate multiple chromosomes.\n", + "\n", + "Let's say we have variant calls from chrs 1 and 2, which we read into an array `['chr1','chr2']`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variant_contig_indexvariant_position
001
102
211
312
\n", + "
" + ], + "text/plain": [ + " variant_contig_index variant_position\n", + "0 0 1\n", + "1 0 2\n", + "2 1 1\n", + "3 1 2" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "contigs = ['chr1', 'chr2']\n", + " \n", + "df = pd.DataFrame({\n", + " 'variant_contig_index': [0, 0, 1, 1],\n", + " 'variant_position': [1, 2, 1, 2],\n", + " })\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Xarray dataset looks like the dataframe above. \n", + "\n", + "When we initialize the Xarray dataset we will give it a list of contigs (or chromosomes). We don't need to explicitly list the contig per position because we can calculate this based on the contig index.\n", + "\n", + "**Contig**: `contigs[row['variant_contig_index']]`\n", + "\n", + "**Position**: `row['variant_position']`" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variant_contig_indexvariant_positiondescription
001Chr: chr1 Pos: 1
102Chr: chr1 Pos: 2
211Chr: chr2 Pos: 1
312Chr: chr2 Pos: 2
\n", + "
" + ], + "text/plain": [ + " variant_contig_index variant_position description\n", + "0 0 1 Chr: chr1 Pos: 1\n", + "1 0 2 Chr: chr1 Pos: 2\n", + "2 1 1 Chr: chr2 Pos: 1\n", + "3 1 2 Chr: chr2 Pos: 2" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def return_contig(row):\n", + " return 'Chr: {chr} Pos: {pos}'.format(chr=contigs[row['variant_contig_index']], pos=row['variant_position'])\n", + "\n", + "df['description'] = df.apply(lambda row: return_contig(row), axis=1)\n", + "\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Genotype Calls\n", + "\n", + "If we've done our work right we our genotypes should have the shape: `[DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY]`, meaning the first axis is the number of variants, the second the number of samples, and the third the ploidy. In our case we are working with diploid alleles.\n", + "\n", + "Our genotype array has this structure:\n", + "\n", + "```python\n", + "genotypes = [\n", + "\n", + " # Outermost array should have a length = the number of variants\n", + " \n", + " # variant chr 1 position 1\n", + " [\n", + " # Per variant we should have an array length = number of samples\n", + " \n", + " # sample 1 \n", + " # Per sample we should have an array length = number of alleles\n", + " [call, call],\n", + " \n", + " # sample 2\n", + " [call, call]\n", + " ],\n", + " \n", + " # variant chr 1 position 2\n", + " [\n", + " # sample 1 \n", + " [call, call],\n", + " # sample 2\n", + " [call, call]\n", + " ],\n", + " \n", + "]\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 629, 2)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "call_genotype = np.array(call_genotype, dtype='i')\n", + "call_genotype.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is correct! We have 3 variants, 629 samples, and diploid alleles." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Convert to Genotype Call Dataset\n", + "\n", + "Finally! Let's convert this to the Genotype Call Dataset!" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[b'T', b'A'],\n", + " [b'G', b'C'],\n", + " [b'C', b'T']], dtype='|S1')" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "variant_alleles" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "import sgkit\n", + "\n", + "genotype_xarray_dataset = sgkit.api.create_genotype_call_dataset(\n", + " variant_contig_names = variant_contig_names,\n", + " # Since we know these are all from the same chromosome we could just calculate this on the fly as a np array of zeros\n", + " #variant_contig = np.zeros(len(variant_position)),\n", + " variant_contig = variant_contig,\n", + " variant_position = variant_position,\n", + " variant_alleles = variant_alleles,\n", + " sample_id = sample_id,\n", + " call_genotype = call_genotype,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "Show/Hide data repr\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Show/Hide attributes\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
xarray.Dataset
    • alleles: 2
    • ploidy: 2
    • samples: 629
    • variants: 3
      • variant/contig
        (variants)
        int32
        0 0 0
        array([0, 0, 0], dtype=int32)
      • variant/position
        (variants)
        int32
        39967768 39967778 39967793
        array([39967768, 39967778, 39967793], dtype=int32)
      • variant/alleles
        (variants, alleles)
        |S1
        b'T' b'A' b'G' b'C' b'C' b'T'
        array([[b'T', b'A'],\n",
        +       "       [b'G', b'C'],\n",
        +       "       [b'C', b'T']], dtype='|S1')
      • sample/id
        (samples)
        <U7
        'HG00098' 'HG00100' ... 'NA20828'
        array(['HG00098', 'HG00100', 'HG00106', 'HG00112', 'HG00114', 'HG00116',\n",
        +       "       'HG00117', 'HG00118', 'HG00119', 'HG00120', 'HG00122', 'HG00123',\n",
        +       "       'HG00124', 'HG00126', 'HG00131', 'HG00141', 'HG00142', 'HG00143',\n",
        +       "       'HG00144', 'HG00145', 'HG00146', 'HG00147', 'HG00148', 'HG00149',\n",
        +       "       'HG00150', 'HG00151', 'HG00152', 'HG00153', 'HG00156', 'HG00158',\n",
        +       "       'HG00159', 'HG00160', 'HG00171', 'HG00173', 'HG00174', 'HG00176',\n",
        +       "       'HG00177', 'HG00178', 'HG00179', 'HG00180', 'HG00181', 'HG00182',\n",
        +       "       'HG00183', 'HG00185', 'HG00186', 'HG00187', 'HG00188', 'HG00189',\n",
        +       "       'HG00190', 'HG00231', 'HG00239', 'HG00242', 'HG00243', 'HG00244',\n",
        +       "       'HG00245', 'HG00247', 'HG00258', 'HG00262', 'HG00264', 'HG00265',\n",
        +       "       'HG00266', 'HG00267', 'HG00269', 'HG00270', 'HG00272', 'HG00306',\n",
        +       "       'HG00308', 'HG00311', 'HG00312', 'HG00357', 'HG00361', 'HG00366',\n",
        +       "       'HG00367', 'HG00368', 'HG00369', 'HG00372', 'HG00373', 'HG00377',\n",
        +       "       'HG00380', 'HG00403', 'HG00404', 'HG00406', 'HG00407', 'HG00445',\n",
        +       "       'HG00446', 'HG00452', 'HG00457', 'HG00553', 'HG00554', 'HG00559',\n",
        +       "       'HG00560', 'HG00565', 'HG00566', 'HG00577', 'HG00578', 'HG00592',\n",
        +       "       'HG00593', 'HG00596', 'HG00610', 'HG00611', 'HG00625', 'HG00626',\n",
        +       "       'HG00628', 'HG00629', 'HG00634', 'HG00635', 'HG00637', 'HG00638',\n",
        +       "       'HG00640', 'NA06984', 'NA06985', 'NA06986', 'NA06989', 'NA06994',\n",
        +       "       'NA07000', 'NA07037', 'NA07048', 'NA07051', 'NA07056', 'NA07346',\n",
        +       "       'NA07347', 'NA07357', 'NA10847', 'NA10851', 'NA11829', 'NA11830',\n",
        +       "       'NA11831', 'NA11832', 'NA11840', 'NA11843', 'NA11881', 'NA11892',\n",
        +       "       'NA11893', 'NA11894', 'NA11918', 'NA11919', 'NA11920', 'NA11930',\n",
        +       "       'NA11931', 'NA11932', 'NA11933', 'NA11992', 'NA11993', 'NA11994',\n",
        +       "       'NA11995', 'NA12003', 'NA12004', 'NA12005', 'NA12006', 'NA12043',\n",
        +       "       'NA12044', 'NA12045', 'NA12046', 'NA12058', 'NA12144', 'NA12154',\n",
        +       "       'NA12155', 'NA12156', 'NA12249', 'NA12272', 'NA12273', 'NA12275',\n",
        +       "       'NA12287', 'NA12340', 'NA12341', 'NA12342', 'NA12347', 'NA12348',\n",
        +       "       'NA12383', 'NA12399', 'NA12400', 'NA12413', 'NA12414', 'NA12489',\n",
        +       "       'NA12546', 'NA12716', 'NA12717', 'NA12718', 'NA12749', 'NA12750',\n",
        +       "       'NA12751', 'NA12761', 'NA12762', 'NA12763', 'NA12775', 'NA12776',\n",
        +       "       'NA12777', 'NA12778', 'NA12812', 'NA12813', 'NA12814', 'NA12815',\n",
        +       "       'NA12828', 'NA12830', 'NA12872', 'NA12873', 'NA12874', 'NA12889',\n",
        +       "       'NA12890', 'NA18486', 'NA18487', 'NA18489', 'NA18498', 'NA18499',\n",
        +       "       'NA18501', 'NA18502', 'NA18504', 'NA18505', 'NA18507', 'NA18508',\n",
        +       "       'NA18510', 'NA18511', 'NA18516', 'NA18517', 'NA18519', 'NA18520',\n",
        +       "       'NA18522', 'NA18523', 'NA18525', 'NA18526', 'NA18527', 'NA18532',\n",
        +       "       'NA18535', 'NA18537', 'NA18538', 'NA18539', 'NA18541', 'NA18542',\n",
        +       "       'NA18545', 'NA18547', 'NA18550', 'NA18552', 'NA18553', 'NA18555',\n",
        +       "       'NA18558', 'NA18560', 'NA18561', 'NA18562', 'NA18563', 'NA18564',\n",
        +       "       'NA18565', 'NA18566', 'NA18567', 'NA18570', 'NA18571', 'NA18572',\n",
        +       "       'NA18573', 'NA18574', 'NA18576', 'NA18577', 'NA18579', 'NA18582',\n",
        +       "       'NA18592', 'NA18593', 'NA18603', 'NA18605', 'NA18608', 'NA18609',\n",
        +       "       'NA18611', 'NA18612', 'NA18614', 'NA18615', 'NA18616', 'NA18617',\n",
        +       "       'NA18618', 'NA18619', 'NA18620', 'NA18621', 'NA18622', 'NA18623',\n",
        +       "       'NA18624', 'NA18625', 'NA18626', 'NA18627', 'NA18628', 'NA18630',\n",
        +       "       'NA18631', 'NA18632', 'NA18633', 'NA18634', 'NA18636', 'NA18638',\n",
        +       "       'NA18640', 'NA18642', 'NA18643', 'NA18745', 'NA18853', 'NA18856',\n",
        +       "       'NA18858', 'NA18861', 'NA18867', 'NA18868', 'NA18870', 'NA18871',\n",
        +       "       'NA18873', 'NA18874', 'NA18907', 'NA18908', 'NA18909', 'NA18910',\n",
        +       "       'NA18912', 'NA18916', 'NA18940', 'NA18941', 'NA18942', 'NA18943',\n",
        +       "       'NA18944', 'NA18945', 'NA18947', 'NA18948', 'NA18949', 'NA18950',\n",
        +       "       'NA18951', 'NA18952', 'NA18953', 'NA18955', 'NA18956', 'NA18959',\n",
        +       "       'NA18960', 'NA18961', 'NA18963', 'NA18964', 'NA18965', 'NA18967',\n",
        +       "       'NA18968', 'NA18970', 'NA18971', 'NA18972', 'NA18973', 'NA18974',\n",
        +       "       'NA18975', 'NA18976', 'NA18977', 'NA18979', 'NA18980', 'NA18981',\n",
        +       "       'NA18982', 'NA18983', 'NA18984', 'NA18985', 'NA18986', 'NA18987',\n",
        +       "       'NA18988', 'NA18989', 'NA18990', 'NA18997', 'NA18999', 'NA19000',\n",
        +       "       'NA19001', 'NA19002', 'NA19003', 'NA19004', 'NA19005', 'NA19007',\n",
        +       "       'NA19009', 'NA19010', 'NA19012', 'NA19027', 'NA19044', 'NA19054',\n",
        +       "       'NA19055', 'NA19056', 'NA19057', 'NA19058', 'NA19059', 'NA19060',\n",
        +       "       'NA19062', 'NA19063', 'NA19064', 'NA19065', 'NA19066', 'NA19067',\n",
        +       "       'NA19068', 'NA19070', 'NA19072', 'NA19074', 'NA19075', 'NA19076',\n",
        +       "       'NA19077', 'NA19078', 'NA19079', 'NA19082', 'NA19083', 'NA19084',\n",
        +       "       'NA19085', 'NA19086', 'NA19087', 'NA19088', 'NA19093', 'NA19098',\n",
        +       "       'NA19099', 'NA19102', 'NA19107', 'NA19108', 'NA19113', 'NA19114',\n",
        +       "       'NA19116', 'NA19119', 'NA19129', 'NA19130', 'NA19131', 'NA19137',\n",
        +       "       'NA19138', 'NA19141', 'NA19143', 'NA19144', 'NA19147', 'NA19152',\n",
        +       "       'NA19153', 'NA19159', 'NA19160', 'NA19171', 'NA19172', 'NA19184',\n",
        +       "       'NA19189', 'NA19190', 'NA19200', 'NA19201', 'NA19204', 'NA19206',\n",
        +       "       'NA19207', 'NA19209', 'NA19210', 'NA19213', 'NA19225', 'NA19235',\n",
        +       "       'NA19236', 'NA19247', 'NA19248', 'NA19256', 'NA19257', 'NA19311',\n",
        +       "       'NA19312', 'NA19313', 'NA19314', 'NA19332', 'NA19334', 'NA19338',\n",
        +       "       'NA19346', 'NA19347', 'NA19350', 'NA19355', 'NA19359', 'NA19360',\n",
        +       "       'NA19371', 'NA19372', 'NA19375', 'NA19376', 'NA19377', 'NA19379',\n",
        +       "       'NA19381', 'NA19382', 'NA19383', 'NA19384', 'NA19385', 'NA19390',\n",
        +       "       'NA19391', 'NA19393', 'NA19394', 'NA19395', 'NA19397', 'NA19398',\n",
        +       "       'NA19399', 'NA19401', 'NA19404', 'NA19428', 'NA19429', 'NA19434',\n",
        +       "       'NA19435', 'NA19436', 'NA19437', 'NA19438', 'NA19439', 'NA19440',\n",
        +       "       'NA19443', 'NA19444', 'NA19445', 'NA19446', 'NA19448', 'NA19449',\n",
        +       "       'NA19451', 'NA19452', 'NA19453', 'NA19455', 'NA19456', 'NA19457',\n",
        +       "       'NA19461', 'NA19462', 'NA19463', 'NA19466', 'NA19467', 'NA19469',\n",
        +       "       'NA19471', 'NA19472', 'NA19473', 'NA19474', 'NA19625', 'NA19648',\n",
        +       "       'NA19649', 'NA19651', 'NA19652', 'NA19654', 'NA19655', 'NA19658',\n",
        +       "       'NA19660', 'NA19661', 'NA19678', 'NA19684', 'NA19685', 'NA19700',\n",
        +       "       'NA19701', 'NA19703', 'NA19704', 'NA19707', 'NA19712', 'NA19713',\n",
        +       "       'NA19720', 'NA19722', 'NA19723', 'NA19725', 'NA19726', 'NA19818',\n",
        +       "       'NA19819', 'NA19834', 'NA19835', 'NA19900', 'NA19901', 'NA19904',\n",
        +       "       'NA19908', 'NA19909', 'NA19914', 'NA19916', 'NA19917', 'NA19920',\n",
        +       "       'NA19921', 'NA19982', 'NA20414', 'NA20502', 'NA20505', 'NA20508',\n",
        +       "       'NA20509', 'NA20510', 'NA20512', 'NA20515', 'NA20516', 'NA20517',\n",
        +       "       'NA20518', 'NA20519', 'NA20520', 'NA20521', 'NA20522', 'NA20524',\n",
        +       "       'NA20525', 'NA20526', 'NA20527', 'NA20528', 'NA20529', 'NA20530',\n",
        +       "       'NA20531', 'NA20532', 'NA20533', 'NA20534', 'NA20535', 'NA20536',\n",
        +       "       'NA20537', 'NA20538', 'NA20539', 'NA20540', 'NA20541', 'NA20542',\n",
        +       "       'NA20543', 'NA20544', 'NA20581', 'NA20582', 'NA20585', 'NA20586',\n",
        +       "       'NA20588', 'NA20589', 'NA20752', 'NA20753', 'NA20754', 'NA20755',\n",
        +       "       'NA20756', 'NA20757', 'NA20758', 'NA20759', 'NA20760', 'NA20761',\n",
        +       "       'NA20765', 'NA20769', 'NA20770', 'NA20771', 'NA20772', 'NA20773',\n",
        +       "       'NA20774', 'NA20775', 'NA20778', 'NA20783', 'NA20785', 'NA20786',\n",
        +       "       'NA20787', 'NA20790', 'NA20792', 'NA20795', 'NA20796', 'NA20797',\n",
        +       "       'NA20798', 'NA20799', 'NA20800', 'NA20801', 'NA20802', 'NA20803',\n",
        +       "       'NA20804', 'NA20805', 'NA20806', 'NA20807', 'NA20808', 'NA20809',\n",
        +       "       'NA20810', 'NA20811', 'NA20812', 'NA20813', 'NA20814', 'NA20815',\n",
        +       "       'NA20816', 'NA20818', 'NA20819', 'NA20826', 'NA20828'], dtype='<U7')
      • call/genotype
        (variants, samples, ploidy)
        int32
        0 0 0 0 1 1 0 1 ... 0 0 0 0 0 0 0 0
        array([[[ 0,  0],\n",
        +       "        [ 0,  0],\n",
        +       "        [ 1,  1],\n",
        +       "        ...,\n",
        +       "        [ 0,  1],\n",
        +       "        [ 1,  1],\n",
        +       "        [ 1,  0]],\n",
        +       "\n",
        +       "       [[-1, -1],\n",
        +       "        [-1, -1],\n",
        +       "        [-1, -1],\n",
        +       "        ...,\n",
        +       "        [-1, -1],\n",
        +       "        [-1, -1],\n",
        +       "        [-1, -1]],\n",
        +       "\n",
        +       "       [[ 0,  0],\n",
        +       "        [ 0,  0],\n",
        +       "        [ 0,  0],\n",
        +       "        ...,\n",
        +       "        [ 0,  0],\n",
        +       "        [ 0,  0],\n",
        +       "        [ 0,  0]]], dtype=int32)
      • call/genotype_mask
        (variants, samples, ploidy)
        bool
        False False False ... False False
        array([[[False, False],\n",
        +       "        [False, False],\n",
        +       "        [False, False],\n",
        +       "        ...,\n",
        +       "        [False, False],\n",
        +       "        [False, False],\n",
        +       "        [False, False]],\n",
        +       "\n",
        +       "       [[ True,  True],\n",
        +       "        [ True,  True],\n",
        +       "        [ True,  True],\n",
        +       "        ...,\n",
        +       "        [ True,  True],\n",
        +       "        [ True,  True],\n",
        +       "        [ True,  True]],\n",
        +       "\n",
        +       "       [[False, False],\n",
        +       "        [False, False],\n",
        +       "        [False, False],\n",
        +       "        ...,\n",
        +       "        [False, False],\n",
        +       "        [False, False],\n",
        +       "        [False, False]]])
    • contigs :
      [b'2']
    " + ], + "text/plain": [ + "\n", + "Dimensions: (alleles: 2, ploidy: 2, samples: 629, variants: 3)\n", + "Dimensions without coordinates: alleles, ploidy, samples, variants\n", + "Data variables:\n", + " variant/contig (variants) int32 0 0 0\n", + " variant/position (variants) int32 39967768 39967778 39967793\n", + " variant/alleles (variants, alleles) |S1 b'T' b'A' b'G' b'C' b'C' b'T'\n", + " sample/id (samples) \n", + "\n", + "\n", + "Show/Hide data repr\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Show/Hide attributes\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
    xarray.Dataset
      • alleles: 2
      • ploidy: 2
      • samples: 1
      • variants: 1
        • variant/contig
          (variants)
          int32
          0
          array([0], dtype=int32)
        • variant/position
          (variants)
          int32
          1
          array([1], dtype=int32)
        • variant/alleles
          (variants, alleles)
          |S1
          b'A' b'T'
          array([[b'A', b'T']], dtype='|S1')
        • sample/id
          (samples)
          <U8
          'sample-1'
          array(['sample-1'], dtype='<U8')
        • call/genotype
          (variants, samples, ploidy)
          int32
          0 0
          array([[[0, 0]]], dtype=int32)
        • call/genotype_mask
          (variants, samples, ploidy)
          bool
          False False
          array([[[False, False]]])
      • contigs :
        ['3R']
      " + ], + "text/plain": [ + "\n", + "Dimensions: (alleles: 2, ploidy: 2, samples: 1, variants: 1)\n", + "Dimensions without coordinates: alleles, ploidy, samples, variants\n", + "Data variables:\n", + " variant/contig (variants) int32 0\n", + " variant/position (variants) int32 1\n", + " variant/alleles (variants, alleles) |S1 b'A' b'T'\n", + " sample/id (samples) `__ underneath the hood to +give a programmatic interface that allows for the backend to be several +different data files. + +The Xarray itself is *sort of* a transposed VCF file. + +For this example we are going to from the preprocessed zarr to the sgkit +Genotype Call XArray Dataset. + +This is only meant to demonstrate the datatypes that we feed into the +Xarray dataset. For a more conceptual understanding please check out the +``Genotype-Call-Dataset-From-VCF.ipynb``. + +.. code:: ipython3 + + import numpy as np + import zarr + import pandas as pd + import dask.array as da + import allel + from pprint import pprint + import matplotlib.pyplot as plt + %matplotlib inline + +Create a Dask Cluster +--------------------- + +This isn’t that important for this example, but SGkit can use Dask under +the hood for many of it’s calculations. Divide and conquer your +statistical genomics data! + +.. code:: ipython3 + + from dask_kubernetes import KubeCluster + cluster = KubeCluster(n_workers=30, silence_logs='error') + cluster + + + +.. parsed-literal:: + + VBox(children=(HTML(value='

      KubeCluster

      '), HBox(children=(HTML(value='\n
      \n
      xarray.Dataset
        • alleles: 2
        • ploidy: 2
        • samples: 1142
        • variants: 10000
          • variant/contig
            (variants)
            int64
            0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
            array([0, 0, 0, ..., 0, 0, 0])
          • variant/position
            (variants)
            int32
            9526 9531 9536 ... 64416 64418
            array([ 9526,  9531,  9536, ..., 64411, 64416, 64418], dtype=int32)
          • variant/alleles
            (variants, alleles)
            |S1
            b'A' b'G' b'A' ... b'T' b'T' b'C'
            array([[b'A', b'G'],
            +           [b'A', b'T'],
            +           [b'T', b'C'],
            +           ...,
            +           [b'A', b'T'],
            +           [b'G', b'T'],
            +           [b'T', b'C']], dtype='|S1')
          • sample/id
            (samples)
            <U8
            'AA0040-C' ... 'AY0091-C'
            array(['AA0040-C', 'AA0041-C', 'AA0042-C', ..., 'AY0089-C', 'AY0090-C',
            +           'AY0091-C'], dtype='<U8')
          • call/genotype
            (variants, samples, ploidy)
            int8
            0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
            array([[[0, 0],
            +            [0, 0],
            +            [0, 0],
            +            ...,
            +            [0, 0],
            +            [0, 0],
            +            [0, 0]],
            +    
            +           [[0, 0],
            +            [0, 0],
            +            [0, 0],
            +            ...,
            +            [0, 0],
            +            [0, 0],
            +            [0, 0]],
            +    
            +           [[0, 0],
            +            [0, 0],
            +            [0, 0],
            +            ...,
            +            [0, 0],
            +            [0, 0],
            +            [0, 0]],
            +    
            +           ...,
            +    
            +           [[0, 0],
            +            [0, 0],
            +            [0, 0],
            +            ...,
            +            [0, 0],
            +            [0, 0],
            +            [0, 0]],
            +    
            +           [[0, 0],
            +            [0, 0],
            +            [0, 0],
            +            ...,
            +            [0, 0],
            +            [0, 0],
            +            [0, 0]],
            +    
            +           [[0, 0],
            +            [0, 0],
            +            [0, 0],
            +            ...,
            +            [0, 0],
            +            [0, 0],
            +            [0, 0]]], dtype=int8)
          • call/genotype_mask
            (variants, samples, ploidy)
            bool
            False False False ... False False
            array([[[False, False],
            +            [False, False],
            +            [False, False],
            +            ...,
            +            [False, False],
            +            [False, False],
            +            [False, False]],
            +    
            +           [[False, False],
            +            [False, False],
            +            [False, False],
            +            ...,
            +            [False, False],
            +            [False, False],
            +            [False, False]],
            +    
            +           [[False, False],
            +            [False, False],
            +            [False, False],
            +            ...,
            +            [False, False],
            +            [False, False],
            +            [False, False]],
            +    
            +           ...,
            +    
            +           [[False, False],
            +            [False, False],
            +            [False, False],
            +            ...,
            +            [False, False],
            +            [False, False],
            +            [False, False]],
            +    
            +           [[False, False],
            +            [False, False],
            +            [False, False],
            +            ...,
            +            [False, False],
            +            [False, False],
            +            [False, False]],
            +    
            +           [[False, False],
            +            [False, False],
            +            [False, False],
            +            ...,
            +            [False, False],
            +            [False, False],
            +            [False, False]]])
        • contigs :
          ['3R']
        + + diff --git a/docs/examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-From-VCF.rst b/docs/examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-From-VCF.rst new file mode 100644 index 000000000..c6ff0cf45 --- /dev/null +++ b/docs/examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-From-VCF.rst @@ -0,0 +1,923 @@ +Load From a VCF File Example +============================ + +A central point to the SGkit API is the Genotype Call Dataset. This is +the data structure that most of the other functions use. It uses +`Xarray `__ underneath the hood to +give a programmatic interface that allows for the backend to be several +different data files. + +The Xarray itself is *sort of* a transposed VCF file. + +For this particular example we are going to go from a VCF file to the +Genotype Call DataSet. + +**Please note that in the real world you should not read in your VCF +files like this, but instead use the functionality in sgkit to go from a +VCF to a Zarr file.** + +We are starting from the VCF file in order to give a conceptual +understanding of the data structure itself. + +.. code:: ipython3 + + import numpy as np + import zarr + import pandas as pd + import dask.array as da + import allel + from pprint import pprint + import matplotlib.pyplot as plt + %matplotlib inline + +Prep Work - Install Packages +---------------------------- + +SGKit is still under rapid development, so I’m installing based on a +commit. + +.. code:: ipython3 + + #! pip install git+https://github.com/pystatgen/sgkit@96203d471531e7e2416d4dd9b48ca11d660a1bcc + +Install PyVCF +~~~~~~~~~~~~~ + +You’ll need to install PyVCF, samtools and tabix in order to run this +example as is. + +PyVCF needs to be in the same kernel in order to use it, but tabix can +be installed anywhere. + +.. code:: ipython3 + + # Or install to your existing environment + # ! conda install -c bioconda -c conda-forge -y pyvcf samtools tabix + + + # Uncomment these to create a new conda environment and install these packages + # If you create a new environment you will have to switch your jupyterhub kernel + # ! conda create -n samtools -c bioconda -c conda-forge -y samtools pyvcf samtools tabix + # ! conda activate samtools + + # ! tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768 > chr2.vcf + # ls -lah chr2.vcf + +Grab Some Data +-------------- + +We’re going to grab a small subset of a VCF file from the `1000 Genomes +Project. `__. +We’re only going to grab 3 calls, which is fine for our purposes. + +These calls are also already biallelic. I cheat. ;-) + +.. code:: ipython3 + + # I couldn't run this from jupyterhub but needed an actual terminal + #! conda activate samtools + #! tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967800 > chr2.vcf + #! conda deactivate + # ls -lah chr2.vcf + +.. code:: ipython3 + + import vcf + import os + +Let’s write up a quick, *not to be used in the real world*, parser to +grab data about the variants. + +- Variant Contig Names - A unique list of all the chromosomes and + contigs +- Variant Contig - an index of the variant_contig_names list. +- Variant Position - Position on the chromosome +- Variant Reference and Alternate +- Samples +- Genotype calls per sample - with missing encoded as -1 + +.. code:: ipython3 + + vcf_reader = vcf.Reader(open('/home/jovyan/chr2.vcf', 'r')) + + # I already know these come from chr2 + # but let's grab them anyways + variant_contig_names = [] + + variant_chrom = [] + variant_position = [] + variant_alleles = [] + variant_contig = [] + + sample_id = [] + call_genotype = [] + + count = 0 + + for record in vcf_reader: + + chrom = str(record.CHROM) + if chrom not in variant_contig_names: + variant_contig_names.append(chrom) + + # Grab the index of the contig + variant_contig.append(variant_contig_names.index(chrom)) + + # Get the variant data + # I'm cheating and only getting the first alternate. In the real world you would filter for biallelic variants. + variant_alleles.append([str(record.REF), str(record.ALT[0])]) + variant_position.append(record.POS) + + # the sample records is an object that has call data + samples = record.samples + + # Grab the sample names + if count == 0: + for sample in samples: + sample_id.append(sample.sample) + + # Grab the call data for each sample for the variant + variant_genotypes = [] + for sample in samples: + # If its missing encode as -1, -1 + if sample['GT'] == './.': + variant_genotypes.append([-1, -1]) + else: + GT = sample['GT'].split('|') + variant_genotypes.append([int(GT[0]), int(GT[1])]) + + call_genotype.append(variant_genotypes) + count = count + 1 + +Convert to Numpy +---------------- + +Now that we have our data, we need to prepare for our XArray dataset by +converting these to Numpy arrays. + +If you’re wondering how I know what these are you can check out the +``sgkit.api.create_genotype_call_dataset``. The exact functions are +``check_array_like`` and make sure that these are numpy arrays of a +particular type. + +:: + + check_array_like(variant_contig, kind="i", ndim=1) + check_array_like(variant_position, kind="i", ndim=1) + check_array_like(variant_alleles, kind="S", ndim=2) + check_array_like(sample_id, kind="U", ndim=1) + check_array_like(call_genotype, kind="i", ndim=3) + +.. code:: ipython3 + + sample_id = np.array(sample_id, dtype='U') + variant_position = np.array(variant_position, dtype='i') + variant_alleles = np.array(variant_alleles, dtype='S') + variant_contig_names = np.array(variant_contig_names, dtype='S') + variant_contig = np.array(variant_contig, dtype='i') + +Understanding Variant Contig and Variant Position +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Genotype Call Xarray dataset is meant to be able to incorporate +multiple chromosomes. + +Let’s say we have variant calls from chrs 1 and 2, which we read into an +array ``['chr1','chr2']``. + +.. code:: ipython3 + + import pandas as pd + +.. code:: ipython3 + + contigs = ['chr1', 'chr2'] + + df = pd.DataFrame({ + 'variant_contig_index': [0, 0, 1, 1], + 'variant_position': [1, 2, 1, 2], + }) + df + + + + +.. raw:: html + +
        + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        variant_contig_indexvariant_position
        001
        102
        211
        312
        +
        + + + +The Xarray dataset looks like the dataframe above. + +When we initialize the Xarray dataset we will give it a list of contigs +(or chromosomes). We don’t need to explicitly list the contig per +position because we can calculate this based on the contig index. + +**Contig**: ``contigs[row['variant_contig_index']]`` + +**Position**: ``row['variant_position']`` + +.. code:: ipython3 + + def return_contig(row): + return 'Chr: {chr} Pos: {pos}'.format(chr=contigs[row['variant_contig_index']], pos=row['variant_position']) + + df['description'] = df.apply(lambda row: return_contig(row), axis=1) + + df + + + + +.. raw:: html + +
        + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        variant_contig_indexvariant_positiondescription
        001Chr: chr1 Pos: 1
        102Chr: chr1 Pos: 2
        211Chr: chr2 Pos: 1
        312Chr: chr2 Pos: 2
        +
        + + + +Genotype Calls +~~~~~~~~~~~~~~ + +If we’ve done our work right we our genotypes should have the shape: +``[DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY]``, meaning the first axis is the +number of variants, the second the number of samples, and the third the +ploidy. In our case we are working with diploid alleles. + +Our genotype array has this structure: + +.. code:: python + + genotypes = [ + + # Outermost array should have a length = the number of variants + + # variant chr 1 position 1 + [ + # Per variant we should have an array length = number of samples + + # sample 1 + # Per sample we should have an array length = number of alleles + [call, call], + + # sample 2 + [call, call] + ], + + # variant chr 1 position 2 + [ + # sample 1 + [call, call], + # sample 2 + [call, call] + ], + + ] + +.. code:: ipython3 + + call_genotype = np.array(call_genotype, dtype='i') + call_genotype.shape + + + + +.. parsed-literal:: + + (3, 629, 2) + + + +This is correct! We have 3 variants, 629 samples, and diploid alleles. + +Convert to Genotype Call Dataset +-------------------------------- + +Finally! Let’s convert this to the Genotype Call Dataset! + +.. code:: ipython3 + + variant_alleles + + + + +.. parsed-literal:: + + array([[b'T', b'A'], + [b'G', b'C'], + [b'C', b'T']], dtype='|S1') + + + +.. code:: ipython3 + + import sgkit + + genotype_xarray_dataset = sgkit.api.create_genotype_call_dataset( + variant_contig_names = variant_contig_names, + # Since we know these are all from the same chromosome we could just calculate this on the fly as a np array of zeros + #variant_contig = np.zeros(len(variant_position)), + variant_contig = variant_contig, + variant_position = variant_position, + variant_alleles = variant_alleles, + sample_id = sample_id, + call_genotype = call_genotype, + ) + +.. code:: ipython3 + + genotype_xarray_dataset + + + + +.. raw:: html + +
        + + + Show/Hide data repr + + + + + + Show/Hide attributes + + + + + + + +
        xarray.Dataset
          • alleles: 2
          • ploidy: 2
          • samples: 629
          • variants: 3
            • variant/contig
              (variants)
              int32
              0 0 0
              array([0, 0, 0], dtype=int32)
            • variant/position
              (variants)
              int32
              39967768 39967778 39967793
              array([39967768, 39967778, 39967793], dtype=int32)
            • variant/alleles
              (variants, alleles)
              |S1
              b'T' b'A' b'G' b'C' b'C' b'T'
              array([[b'T', b'A'],
              +           [b'G', b'C'],
              +           [b'C', b'T']], dtype='|S1')
            • sample/id
              (samples)
              <U7
              'HG00098' 'HG00100' ... 'NA20828'
              array(['HG00098', 'HG00100', 'HG00106', 'HG00112', 'HG00114', 'HG00116',
              +           'HG00117', 'HG00118', 'HG00119', 'HG00120', 'HG00122', 'HG00123',
              +           'HG00124', 'HG00126', 'HG00131', 'HG00141', 'HG00142', 'HG00143',
              +           'HG00144', 'HG00145', 'HG00146', 'HG00147', 'HG00148', 'HG00149',
              +           'HG00150', 'HG00151', 'HG00152', 'HG00153', 'HG00156', 'HG00158',
              +           'HG00159', 'HG00160', 'HG00171', 'HG00173', 'HG00174', 'HG00176',
              +           'HG00177', 'HG00178', 'HG00179', 'HG00180', 'HG00181', 'HG00182',
              +           'HG00183', 'HG00185', 'HG00186', 'HG00187', 'HG00188', 'HG00189',
              +           'HG00190', 'HG00231', 'HG00239', 'HG00242', 'HG00243', 'HG00244',
              +           'HG00245', 'HG00247', 'HG00258', 'HG00262', 'HG00264', 'HG00265',
              +           'HG00266', 'HG00267', 'HG00269', 'HG00270', 'HG00272', 'HG00306',
              +           'HG00308', 'HG00311', 'HG00312', 'HG00357', 'HG00361', 'HG00366',
              +           'HG00367', 'HG00368', 'HG00369', 'HG00372', 'HG00373', 'HG00377',
              +           'HG00380', 'HG00403', 'HG00404', 'HG00406', 'HG00407', 'HG00445',
              +           'HG00446', 'HG00452', 'HG00457', 'HG00553', 'HG00554', 'HG00559',
              +           'HG00560', 'HG00565', 'HG00566', 'HG00577', 'HG00578', 'HG00592',
              +           'HG00593', 'HG00596', 'HG00610', 'HG00611', 'HG00625', 'HG00626',
              +           'HG00628', 'HG00629', 'HG00634', 'HG00635', 'HG00637', 'HG00638',
              +           'HG00640', 'NA06984', 'NA06985', 'NA06986', 'NA06989', 'NA06994',
              +           'NA07000', 'NA07037', 'NA07048', 'NA07051', 'NA07056', 'NA07346',
              +           'NA07347', 'NA07357', 'NA10847', 'NA10851', 'NA11829', 'NA11830',
              +           'NA11831', 'NA11832', 'NA11840', 'NA11843', 'NA11881', 'NA11892',
              +           'NA11893', 'NA11894', 'NA11918', 'NA11919', 'NA11920', 'NA11930',
              +           'NA11931', 'NA11932', 'NA11933', 'NA11992', 'NA11993', 'NA11994',
              +           'NA11995', 'NA12003', 'NA12004', 'NA12005', 'NA12006', 'NA12043',
              +           'NA12044', 'NA12045', 'NA12046', 'NA12058', 'NA12144', 'NA12154',
              +           'NA12155', 'NA12156', 'NA12249', 'NA12272', 'NA12273', 'NA12275',
              +           'NA12287', 'NA12340', 'NA12341', 'NA12342', 'NA12347', 'NA12348',
              +           'NA12383', 'NA12399', 'NA12400', 'NA12413', 'NA12414', 'NA12489',
              +           'NA12546', 'NA12716', 'NA12717', 'NA12718', 'NA12749', 'NA12750',
              +           'NA12751', 'NA12761', 'NA12762', 'NA12763', 'NA12775', 'NA12776',
              +           'NA12777', 'NA12778', 'NA12812', 'NA12813', 'NA12814', 'NA12815',
              +           'NA12828', 'NA12830', 'NA12872', 'NA12873', 'NA12874', 'NA12889',
              +           'NA12890', 'NA18486', 'NA18487', 'NA18489', 'NA18498', 'NA18499',
              +           'NA18501', 'NA18502', 'NA18504', 'NA18505', 'NA18507', 'NA18508',
              +           'NA18510', 'NA18511', 'NA18516', 'NA18517', 'NA18519', 'NA18520',
              +           'NA18522', 'NA18523', 'NA18525', 'NA18526', 'NA18527', 'NA18532',
              +           'NA18535', 'NA18537', 'NA18538', 'NA18539', 'NA18541', 'NA18542',
              +           'NA18545', 'NA18547', 'NA18550', 'NA18552', 'NA18553', 'NA18555',
              +           'NA18558', 'NA18560', 'NA18561', 'NA18562', 'NA18563', 'NA18564',
              +           'NA18565', 'NA18566', 'NA18567', 'NA18570', 'NA18571', 'NA18572',
              +           'NA18573', 'NA18574', 'NA18576', 'NA18577', 'NA18579', 'NA18582',
              +           'NA18592', 'NA18593', 'NA18603', 'NA18605', 'NA18608', 'NA18609',
              +           'NA18611', 'NA18612', 'NA18614', 'NA18615', 'NA18616', 'NA18617',
              +           'NA18618', 'NA18619', 'NA18620', 'NA18621', 'NA18622', 'NA18623',
              +           'NA18624', 'NA18625', 'NA18626', 'NA18627', 'NA18628', 'NA18630',
              +           'NA18631', 'NA18632', 'NA18633', 'NA18634', 'NA18636', 'NA18638',
              +           'NA18640', 'NA18642', 'NA18643', 'NA18745', 'NA18853', 'NA18856',
              +           'NA18858', 'NA18861', 'NA18867', 'NA18868', 'NA18870', 'NA18871',
              +           'NA18873', 'NA18874', 'NA18907', 'NA18908', 'NA18909', 'NA18910',
              +           'NA18912', 'NA18916', 'NA18940', 'NA18941', 'NA18942', 'NA18943',
              +           'NA18944', 'NA18945', 'NA18947', 'NA18948', 'NA18949', 'NA18950',
              +           'NA18951', 'NA18952', 'NA18953', 'NA18955', 'NA18956', 'NA18959',
              +           'NA18960', 'NA18961', 'NA18963', 'NA18964', 'NA18965', 'NA18967',
              +           'NA18968', 'NA18970', 'NA18971', 'NA18972', 'NA18973', 'NA18974',
              +           'NA18975', 'NA18976', 'NA18977', 'NA18979', 'NA18980', 'NA18981',
              +           'NA18982', 'NA18983', 'NA18984', 'NA18985', 'NA18986', 'NA18987',
              +           'NA18988', 'NA18989', 'NA18990', 'NA18997', 'NA18999', 'NA19000',
              +           'NA19001', 'NA19002', 'NA19003', 'NA19004', 'NA19005', 'NA19007',
              +           'NA19009', 'NA19010', 'NA19012', 'NA19027', 'NA19044', 'NA19054',
              +           'NA19055', 'NA19056', 'NA19057', 'NA19058', 'NA19059', 'NA19060',
              +           'NA19062', 'NA19063', 'NA19064', 'NA19065', 'NA19066', 'NA19067',
              +           'NA19068', 'NA19070', 'NA19072', 'NA19074', 'NA19075', 'NA19076',
              +           'NA19077', 'NA19078', 'NA19079', 'NA19082', 'NA19083', 'NA19084',
              +           'NA19085', 'NA19086', 'NA19087', 'NA19088', 'NA19093', 'NA19098',
              +           'NA19099', 'NA19102', 'NA19107', 'NA19108', 'NA19113', 'NA19114',
              +           'NA19116', 'NA19119', 'NA19129', 'NA19130', 'NA19131', 'NA19137',
              +           'NA19138', 'NA19141', 'NA19143', 'NA19144', 'NA19147', 'NA19152',
              +           'NA19153', 'NA19159', 'NA19160', 'NA19171', 'NA19172', 'NA19184',
              +           'NA19189', 'NA19190', 'NA19200', 'NA19201', 'NA19204', 'NA19206',
              +           'NA19207', 'NA19209', 'NA19210', 'NA19213', 'NA19225', 'NA19235',
              +           'NA19236', 'NA19247', 'NA19248', 'NA19256', 'NA19257', 'NA19311',
              +           'NA19312', 'NA19313', 'NA19314', 'NA19332', 'NA19334', 'NA19338',
              +           'NA19346', 'NA19347', 'NA19350', 'NA19355', 'NA19359', 'NA19360',
              +           'NA19371', 'NA19372', 'NA19375', 'NA19376', 'NA19377', 'NA19379',
              +           'NA19381', 'NA19382', 'NA19383', 'NA19384', 'NA19385', 'NA19390',
              +           'NA19391', 'NA19393', 'NA19394', 'NA19395', 'NA19397', 'NA19398',
              +           'NA19399', 'NA19401', 'NA19404', 'NA19428', 'NA19429', 'NA19434',
              +           'NA19435', 'NA19436', 'NA19437', 'NA19438', 'NA19439', 'NA19440',
              +           'NA19443', 'NA19444', 'NA19445', 'NA19446', 'NA19448', 'NA19449',
              +           'NA19451', 'NA19452', 'NA19453', 'NA19455', 'NA19456', 'NA19457',
              +           'NA19461', 'NA19462', 'NA19463', 'NA19466', 'NA19467', 'NA19469',
              +           'NA19471', 'NA19472', 'NA19473', 'NA19474', 'NA19625', 'NA19648',
              +           'NA19649', 'NA19651', 'NA19652', 'NA19654', 'NA19655', 'NA19658',
              +           'NA19660', 'NA19661', 'NA19678', 'NA19684', 'NA19685', 'NA19700',
              +           'NA19701', 'NA19703', 'NA19704', 'NA19707', 'NA19712', 'NA19713',
              +           'NA19720', 'NA19722', 'NA19723', 'NA19725', 'NA19726', 'NA19818',
              +           'NA19819', 'NA19834', 'NA19835', 'NA19900', 'NA19901', 'NA19904',
              +           'NA19908', 'NA19909', 'NA19914', 'NA19916', 'NA19917', 'NA19920',
              +           'NA19921', 'NA19982', 'NA20414', 'NA20502', 'NA20505', 'NA20508',
              +           'NA20509', 'NA20510', 'NA20512', 'NA20515', 'NA20516', 'NA20517',
              +           'NA20518', 'NA20519', 'NA20520', 'NA20521', 'NA20522', 'NA20524',
              +           'NA20525', 'NA20526', 'NA20527', 'NA20528', 'NA20529', 'NA20530',
              +           'NA20531', 'NA20532', 'NA20533', 'NA20534', 'NA20535', 'NA20536',
              +           'NA20537', 'NA20538', 'NA20539', 'NA20540', 'NA20541', 'NA20542',
              +           'NA20543', 'NA20544', 'NA20581', 'NA20582', 'NA20585', 'NA20586',
              +           'NA20588', 'NA20589', 'NA20752', 'NA20753', 'NA20754', 'NA20755',
              +           'NA20756', 'NA20757', 'NA20758', 'NA20759', 'NA20760', 'NA20761',
              +           'NA20765', 'NA20769', 'NA20770', 'NA20771', 'NA20772', 'NA20773',
              +           'NA20774', 'NA20775', 'NA20778', 'NA20783', 'NA20785', 'NA20786',
              +           'NA20787', 'NA20790', 'NA20792', 'NA20795', 'NA20796', 'NA20797',
              +           'NA20798', 'NA20799', 'NA20800', 'NA20801', 'NA20802', 'NA20803',
              +           'NA20804', 'NA20805', 'NA20806', 'NA20807', 'NA20808', 'NA20809',
              +           'NA20810', 'NA20811', 'NA20812', 'NA20813', 'NA20814', 'NA20815',
              +           'NA20816', 'NA20818', 'NA20819', 'NA20826', 'NA20828'], dtype='<U7')
            • call/genotype
              (variants, samples, ploidy)
              int32
              0 0 0 0 1 1 0 1 ... 0 0 0 0 0 0 0 0
              array([[[ 0,  0],
              +            [ 0,  0],
              +            [ 1,  1],
              +            ...,
              +            [ 0,  1],
              +            [ 1,  1],
              +            [ 1,  0]],
              +    
              +           [[-1, -1],
              +            [-1, -1],
              +            [-1, -1],
              +            ...,
              +            [-1, -1],
              +            [-1, -1],
              +            [-1, -1]],
              +    
              +           [[ 0,  0],
              +            [ 0,  0],
              +            [ 0,  0],
              +            ...,
              +            [ 0,  0],
              +            [ 0,  0],
              +            [ 0,  0]]], dtype=int32)
            • call/genotype_mask
              (variants, samples, ploidy)
              bool
              False False False ... False False
              array([[[False, False],
              +            [False, False],
              +            [False, False],
              +            ...,
              +            [False, False],
              +            [False, False],
              +            [False, False]],
              +    
              +           [[ True,  True],
              +            [ True,  True],
              +            [ True,  True],
              +            ...,
              +            [ True,  True],
              +            [ True,  True],
              +            [ True,  True]],
              +    
              +           [[False, False],
              +            [False, False],
              +            [False, False],
              +            ...,
              +            [False, False],
              +            [False, False],
              +            [False, False]]])
          • contigs :
            [b'2']
          + + + +Done! +----- + +Now we have our Xarray dataset that we can use with the rest of Sgkit! diff --git a/docs/examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-Minimal-Numpy-Example.rst b/docs/examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-Minimal-Numpy-Example.rst new file mode 100644 index 000000000..fbccc3ae5 --- /dev/null +++ b/docs/examples/understanding-genotype-call-xarray-dataset/Genotype-Call-Dataset-Minimal-Numpy-Example.rst @@ -0,0 +1,460 @@ +Minimal Numpy Example +===================== + +A central point to the SGkit API is the Genotype Call Dataset. This is +the data structure that most of the other functions use. It uses +`Xarray `__ underneath the hood to +give a programmatic interface that allows for the backend to be several +different data files. + +The Xarray itself is *sort of* a transposed VCF file. + +For this particular example we are going to use a minimal set of numpy +arrays in order to create a small Genotype Call Dataset. + +This is only meant to demonstrate the datatypes that we feed into the +Xarray dataset. For a more conceptual understanding please check out the +``Genotype-Call-Dataset-From-VCF.ipynb``. + +.. code:: ipython3 + + import numpy as np + import zarr + import pandas as pd + import dask.array as da + import allel + from pprint import pprint + import matplotlib.pyplot as plt + %matplotlib inline + +Prep Work - Install Packages +---------------------------- + +SGKit is still under rapid development, so I’m installing based on a +commit. + +.. code:: ipython3 + + #! pip install git+https://github.com/pystatgen/sgkit@96203d471531e7e2416d4dd9b48ca11d660a1bcc + +Numpy Representations of the Variant Data +----------------------------------------- + +We need to prepare for our XArray dataset by converting these to Numpy +arrays. + +If you’re wondering how I know what these are you can check out the +``sgkit.api.create_genotype_call_dataset``. The exact functions are +``check_array_like`` and make sure that these are numpy arrays of a +particular type. + +:: + + check_array_like(variant_contig, kind="i", ndim=1) + check_array_like(variant_position, kind="i", ndim=1) + check_array_like(variant_alleles, kind="S", ndim=2) + check_array_like(sample_id, kind="U", ndim=1) + check_array_like(call_genotype, kind="i", ndim=3) + +.. code:: ipython3 + + variant_contig_names = ['3R'] + # the variant contig is the index of the chr in the variant_contig_names + # because we always prefer numbers over strings! + variant_contig = np.array([0], dtype='i') + variant_position = np.array([1], dtype='i') + variant_alleles = np.array([['A', 'T']], dtype='S') + + sample_id = np.array(['sample-1'], dtype='U') + call_genotype_phased = None + variant_id = None + +.. code:: ipython3 + + # The genotype is + # "call/genotype": ([DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY], call_genotype), + # and needs to be type 'i' + # You can also look at the GenotypeChunkedArray + call_genotype = np.array([[[0, 0]]], dtype='i') + call_genotype.shape + + + + +.. parsed-literal:: + + (1, 1, 2) + + + +This is correct! We have 1 variant, 1 sample, 1 biallelic call. + +Convert to Genotype Call Dataset +-------------------------------- + +Finally! Let’s convert this to the Genotype Call Dataset! + +.. code:: ipython3 + + import sgkit + + genotype_xarray_dataset = sgkit.api.create_genotype_call_dataset( + variant_contig_names = variant_contig_names, + variant_contig = variant_contig, + variant_position = variant_position, + variant_alleles = variant_alleles, + sample_id = sample_id, + call_genotype = call_genotype, + ) + +.. code:: ipython3 + + genotype_xarray_dataset + + + + +.. raw:: html + +
          + + + Show/Hide data repr + + + + + + Show/Hide attributes + + + + + + + +
          xarray.Dataset
            • alleles: 2
            • ploidy: 2
            • samples: 1
            • variants: 1
              • variant/contig
                (variants)
                int32
                0
                array([0], dtype=int32)
              • variant/position
                (variants)
                int32
                1
                array([1], dtype=int32)
              • variant/alleles
                (variants, alleles)
                |S1
                b'A' b'T'
                array([[b'A', b'T']], dtype='|S1')
              • sample/id
                (samples)
                <U8
                'sample-1'
                array(['sample-1'], dtype='<U8')
              • call/genotype
                (variants, samples, ploidy)
                int32
                0 0
                array([[[0, 0]]], dtype=int32)
              • call/genotype_mask
                (variants, samples, ploidy)
                bool
                False False
                array([[[False, False]]])
            • contigs :
              ['3R']
            + + + +Done! +----- + +Now we have our Xarray dataset that we can use with the rest of Sgkit! diff --git a/docs/index.rst b/docs/index.rst index 865ce268a..95e0cf26e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -3,9 +3,11 @@ sgkit: Statistical genetics toolkit in Python .. toctree:: :maxdepth: 2 - :caption: Contents: + :caption: Contents api + examples + Indices and tables diff --git a/requirements-dev.txt b/requirements-dev.txt index 4a4131125..b5c807dca 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -5,4 +5,7 @@ pytest-cov hypothesis sphinx sphinx_rtd_theme +ipython +nbsphinx statsmodels +