Skip to content

Commit

Permalink
use pgenlib from pypi (resolves #16)
Browse files Browse the repository at this point in the history
  • Loading branch information
aryarm committed Jul 21, 2022
1 parent c8a7773 commit 1155b53
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 49 deletions.
7 changes: 0 additions & 7 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,13 +156,6 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr

The time required to load various genotype file formats.

.. warning::
Use of this class is not officially supported yet because it relies upon the as-yet-unpublished ``pgenlib`` python library. See `issue #16 <https://github.com/gymrek-lab/haptools/pull/16>`_ for current progress on this challenge. In the meantime, you must install the library manually from Github via ``pip``.

.. code-block:: bash
pip install git+https://github.com/chrchang/plink-ng.git#subdirectory=2.0/Python
The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.

.. code-block:: python
Expand Down
19 changes: 10 additions & 9 deletions docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,20 @@
Genotypes
=========

Genotype files must be specified as VCF or BCF files.
.. figure:: https://drive.google.com/uc?export=view&id=1_JARKJQ0LX-DzL0XsHW1aiQgLCOJ1ZvC

.. _formats-genotypesplink:
The time required to load various genotype file formats.

There is also experimental support for `PLINK2 PGEN <https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf>`_ files in some commands. These files can be loaded much more quickly than VCFs, so we highly recommend using them if you're working with large datasets. See the documentation for the :class:`GenotypesPLINK` class in :ref:`the API docs <api-data-genotypesplink>` for more information.
VCF/BCF
-------

.. figure:: https://drive.google.com/uc?export=view&id=1_JARKJQ0LX-DzL0XsHW1aiQgLCOJ1ZvC
Genotype files must be specified as VCF or BCF files. They can be bgzip-compressed.

The time required to load various genotype file formats.
.. _formats-genotypesplink:

.. warning::
PGEN files are not officially supported yet because our code relies upon the as-yet-unpublished ``pgenlib`` python library. See `issue #16 <https://github.com/gymrek-lab/haptools/pull/16>`_ for current progress on this challenge. In the meantime, you must install the library manually from Github via ``pip``.
PLINK2 PGEN
-----------

.. code-block:: bash
There is also experimental support for `PLINK2 PGEN <https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf>`_ files in some commands. These files can be loaded much more quickly than VCFs, so we highly recommend using them if you're working with large datasets. See the documentation for the :class:`GenotypesPLINK` class in :ref:`the API docs <api-data-genotypesplink>` for more information.

pip install git+https://github.com/chrchang/plink-ng.git#subdirectory=2.0/Python
If you run out memory when using PGEN files, consider reading variants from the file in chunks via the ``--chunk-size`` parameter.
33 changes: 1 addition & 32 deletions haptools/data/genotypes.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from __future__ import annotations
import re
import subprocess
from csv import reader
from pathlib import Path
from typing import Iterator
Expand All @@ -10,6 +9,7 @@

import numpy as np
import numpy.typing as npt
from pgenlib import PgenReader, PgenWriter
from cyvcf2 import VCF, Variant
from pysam import VariantFile, TabixFile

Expand Down Expand Up @@ -933,16 +933,6 @@ def read(
See documentation for :py:attr:`~.GenotypesRefAlt.read`
"""
super(Genotypes, self).read()
# TODO: figure out how to install this package
try:
from pgenlib import PgenReader
except ImportError:
raise ImportError(
"We cannot read PGEN files without the pgenlib library. Please install"
" pgenlib via\npip install git+https://github.com/chrchang/plink-ng.gi"
"t#subdirectory=2.0/Python"
)

sample_idxs = self.read_samples(samples)
with PgenReader(
bytes(str(self.fname), "utf8"), sample_subset=sample_idxs
Expand Down Expand Up @@ -1097,16 +1087,6 @@ def __iter__(
See documentation for :py:meth:`~.GenotypesPLINK._iterate`
"""
super(Genotypes, self).read()
# TODO: figure out how to install this package
try:
from pgenlib import PgenReader
except ImportError:
raise ImportError(
"We cannot read PGEN files without the pgenlib library. Please install"
" pgenlib via\npip install git+https://github.com/chrchang/plink-ng.gi"
"t#subdirectory=2.0/Python"
)

sample_idxs = self.read_samples(samples)
pgen = PgenReader(bytes(str(self.fname), "utf8"), sample_subset=sample_idxs)
# call another function to force the lines above to be run immediately
Expand Down Expand Up @@ -1162,17 +1142,6 @@ def write(self):
# write the psam and pvar files
self.write_samples()
self.write_variants()

# TODO: figure out how to install this package
try:
from pgenlib import PgenWriter
except ImportError:
raise ImportError(
"We cannot write PGEN files without the pgenlib library. Please "
"install pgenlib via\npip install git+https://github.com/chrchang/plin"
"k-ng.git#subdirectory=2.0/Python"
)

# transpose the data b/c pgenwriter expects things in "variant-major" order
# (ie where variants are rows instead of samples)
data = self.data.transpose((1, 0, 2))
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ pysam = "^0.19.1"
cyvcf2 = "^0.30.14"
brewer2mpl = "^1.4.1"
matplotlib = "^3.5.1"
Pgenlib = "^0.81.2"

# docs
# these belong in dev-dependencies, but RTD doesn't support that yet -- see
Expand Down
1 change: 0 additions & 1 deletion tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,6 @@ def test_subset_genotypes(self):

class TestGenotypesPLINK:
def _get_fake_genotypes_plink(self):
pgenlib = pytest.importorskip("pgenlib")
gts_ref_alt = TestGenotypes()._get_fake_genotypes_refalt()
gts = GenotypesPLINK(gts_ref_alt.fname)
gts.data = gts_ref_alt.data
Expand Down

0 comments on commit 1155b53

Please sign in to comment.