Feature idea: select_index – select that only returns indices #128

ivirshup · 2022-11-04T12:08:47Z

select returns a subset of the passed dataframe.

For use with anndata and sgkit, we'd probably want to use select to get the indices of the regions of index, then subset the entire AnnData object or xarray.Dataset. Alternatively we may want to subset a selection of the aligned arrays.

If we could get a method in bioframe that only returned the indices of the selected regions, we could reduce the overhead of allocating all the other columns of the passed dataframe. This would make it easier to rely on bioframe in other libraries.

Demo implementation

def select_indices(df, region, cols=None):
    """
    Return indices of genomic intervals in a dataframe that overlap a genomic region.

    Parameters
    ----------
    df : pandas.DataFrame

    region : str or tuple
        The genomic region to select from the dataframe.
        UCSC-style genomic region string, or Triple (chrom, start, end),
        where ``start`` or ``end`` may be ``None``. See :func:`.core.stringops.parse_region()`
        for more information on region formatting.

    cols : (str, str, str) or None
        The names of columns containing the chromosome, start and end of the
        genomic intervals. The default values are 'chrom', 'start', 'end'.
    """
    from bioframe.core import checks
    from bioframe.core.specs import _get_default_colnames

    ck, sk, ek = _get_default_colnames() if cols is None else cols
    checks.is_bedframe(df, raise_errors=True, cols=[ck, sk, ek])

    chrom, start, end = bioframe.parse_region(region)
    if chrom is None:
        raise ValueError("no chromosome detected, check region input")
    if (start is not None) and (end is not None):
        inds = (df[ck] == chrom) & (df[sk] < end) & (df[ek] > start)
    else:
        inds = df[ck] == chrom
    return inds

The text was updated successfully, but these errors were encountered:

gfudenberg · 2022-11-04T16:58:47Z

Thanks for the suggestion!

In terms of implementation, another option would be to add keywords controlling return behavior to current select, as we do in overlap(), which has defaults:

    return_input=True,
    return_index=False

thoughts?

cc @golobor @nvictus

nvictus · 2022-11-04T17:17:05Z

Perhaps something like bioframe.iselect(df, region) would be short and convenient.

Theres the additional issue of whether to return DataFrame indexes (for .loc[]), or array indexes (for .iloc[] or exogenous arrays aligned with the dataframe), provide these as separate functions or via a parameter.

ivirshup · 2022-11-07T15:30:41Z

In terms of implementation, another option would be to add keywords

My preference is generally to use different functions for returning different types (so `iselect), but API consistency is also good.

I do always find the numpy functions that can return a variety of different values based on flags confusing to work with.

Theres the additional issue of whether to return DataFrame indexes

Currently, the indexing statement used in select returns a boolean mask. I think this could be fine. I would also lean for array indices over labels, since labels can be non-unique.

In theory, this could be a slice for sorted ranges, which would be the most efficient representation.

gfudenberg · 2022-11-28T20:05:38Z

index selection function discussion 11/28/2022

namespacing:
-- would this be fine to keep out of base namespace, e.g. from bioframe.ops import bselect or from bioframe.iops import bselect? This relates to 1st order proposal about refactoring current private _intidx functions in ops. cc @ivirshup

implementation:
-- should not return pandas indices, because bioframe does not rely on them
-- slices are not good outputs because even in sorted bedframes, a range selection could skip intervals
-- most versatile type of output would be boolean numpy mask (compatible with pandas .loc, preserves shape, easy to get indices for iloc etc).

naming:
-- given this function wouldn't necessarily return pandas indices, some discussed options: select_mask, bselect, iselect

1st order
-- refactor _intidxs functions as a new module, iops or idxops
-- setdiff could also be refactored into iops
-- note that cluster and complement (from which we derive merge & subtract) would not have iops analogs because they can modify the number of elements in the array

2nd order:
-- think about what would be necessary for out-of-core etc

ivirshup · 2022-11-30T17:54:09Z

namespacing:

I would like at least some of these to be public for use with AnnData, sgkit etc.

most versatile type of output would be boolean numpy mask (compatible with pandas .loc, preserves shape, easy to get indices for iloc etc).

Also nice that we can invert it.

A little unfortunate that it will always be large, even if the result is very sparse. Not sure there's a great solution there other than two implementations (one mask, one indices). Could be a problem for out-of-core .

ivirshup · 2023-03-21T14:57:13Z

Would love to see a release with select_mask from #132 out. Any idea what the schedule is like for that?

nvictus · 2023-03-23T22:14:31Z

Just released in v0.4.0!

We've included not only select_mask but also select_indices and select_labels.

gfudenberg added the enhancement label Nov 7, 2022

ivirshup mentioned this issue Dec 5, 2022

Support for genomic ranges scverse/anndata#624

Open

nvictus mentioned this issue Dec 19, 2022

Add select_mask #132

Merged

nvictus closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature idea: select_index – select that only returns indices #128

Feature idea: select_index – select that only returns indices #128

ivirshup commented Nov 4, 2022

gfudenberg commented Nov 4, 2022 •

edited

Loading

nvictus commented Nov 4, 2022

ivirshup commented Nov 7, 2022

gfudenberg commented Nov 28, 2022 •

edited

Loading

ivirshup commented Nov 30, 2022 •

edited

Loading

ivirshup commented Mar 21, 2023

nvictus commented Mar 23, 2023

Feature idea: select_index – select that only returns indices #128

Feature idea: select_index – select that only returns indices #128

Comments

ivirshup commented Nov 4, 2022

gfudenberg commented Nov 4, 2022 • edited Loading

nvictus commented Nov 4, 2022

ivirshup commented Nov 7, 2022

gfudenberg commented Nov 28, 2022 • edited Loading

ivirshup commented Nov 30, 2022 • edited Loading

ivirshup commented Mar 21, 2023

nvictus commented Mar 23, 2023

gfudenberg commented Nov 4, 2022 •

edited

Loading

gfudenberg commented Nov 28, 2022 •

edited

Loading

ivirshup commented Nov 30, 2022 •

edited

Loading