Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature idea: select_index – select that only returns indices #128

Closed
ivirshup opened this issue Nov 4, 2022 · 7 comments
Closed

Feature idea: select_index – select that only returns indices #128

ivirshup opened this issue Nov 4, 2022 · 7 comments

Comments

@ivirshup
Copy link
Contributor

ivirshup commented Nov 4, 2022

select returns a subset of the passed dataframe.

For use with anndata and sgkit, we'd probably want to use select to get the indices of the regions of index, then subset the entire AnnData object or xarray.Dataset. Alternatively we may want to subset a selection of the aligned arrays.

If we could get a method in bioframe that only returned the indices of the selected regions, we could reduce the overhead of allocating all the other columns of the passed dataframe. This would make it easier to rely on bioframe in other libraries.

Demo implementation
def select_indices(df, region, cols=None):
    """
    Return indices of genomic intervals in a dataframe that overlap a genomic region.

    Parameters
    ----------
    df : pandas.DataFrame

    region : str or tuple
        The genomic region to select from the dataframe.
        UCSC-style genomic region string, or Triple (chrom, start, end),
        where ``start`` or ``end`` may be ``None``. See :func:`.core.stringops.parse_region()`
        for more information on region formatting.

    cols : (str, str, str) or None
        The names of columns containing the chromosome, start and end of the
        genomic intervals. The default values are 'chrom', 'start', 'end'.
    """
    from bioframe.core import checks
    from bioframe.core.specs import _get_default_colnames

    ck, sk, ek = _get_default_colnames() if cols is None else cols
    checks.is_bedframe(df, raise_errors=True, cols=[ck, sk, ek])

    chrom, start, end = bioframe.parse_region(region)
    if chrom is None:
        raise ValueError("no chromosome detected, check region input")
    if (start is not None) and (end is not None):
        inds = (df[ck] == chrom) & (df[sk] < end) & (df[ek] > start)
    else:
        inds = df[ck] == chrom
    return inds
@gfudenberg
Copy link
Member

gfudenberg commented Nov 4, 2022

Thanks for the suggestion!

In terms of implementation, another option would be to add keywords controlling return behavior to current select, as we do in overlap(), which has defaults:

    return_input=True,
    return_index=False

thoughts?

cc @golobor @nvictus

@nvictus
Copy link
Member

nvictus commented Nov 4, 2022

Perhaps something like bioframe.iselect(df, region) would be short and convenient.

Theres the additional issue of whether to return DataFrame indexes (for .loc[]), or array indexes (for .iloc[] or exogenous arrays aligned with the dataframe), provide these as separate functions or via a parameter.

@ivirshup
Copy link
Contributor Author

ivirshup commented Nov 7, 2022

In terms of implementation, another option would be to add keywords

My preference is generally to use different functions for returning different types (so `iselect), but API consistency is also good.

I do always find the numpy functions that can return a variety of different values based on flags confusing to work with.

Theres the additional issue of whether to return DataFrame indexes

Currently, the indexing statement used in select returns a boolean mask. I think this could be fine. I would also lean for array indices over labels, since labels can be non-unique.

In theory, this could be a slice for sorted ranges, which would be the most efficient representation.

@gfudenberg
Copy link
Member

gfudenberg commented Nov 28, 2022

index selection function discussion 11/28/2022

namespacing:
-- would this be fine to keep out of base namespace, e.g. from bioframe.ops import bselect or from bioframe.iops import bselect? This relates to 1st order proposal about refactoring current private _intidx functions in ops. cc @ivirshup

implementation:
-- should not return pandas indices, because bioframe does not rely on them
-- slices are not good outputs because even in sorted bedframes, a range selection could skip intervals
-- most versatile type of output would be boolean numpy mask (compatible with pandas .loc, preserves shape, easy to get indices for iloc etc).

naming:
-- given this function wouldn't necessarily return pandas indices, some discussed options: select_mask, bselect, iselect

1st order
-- refactor _intidxs functions as a new module, iops or idxops
-- setdiff could also be refactored into iops
-- note that cluster and complement (from which we derive merge & subtract) would not have iops analogs because they can modify the number of elements in the array

2nd order:
-- think about what would be necessary for out-of-core etc

@ivirshup
Copy link
Contributor Author

ivirshup commented Nov 30, 2022

namespacing:

I would like at least some of these to be public for use with AnnData, sgkit etc.

most versatile type of output would be boolean numpy mask (compatible with pandas .loc, preserves shape, easy to get indices for iloc etc).

Also nice that we can invert it.

A little unfortunate that it will always be large, even if the result is very sparse. Not sure there's a great solution there other than two implementations (one mask, one indices). Could be a problem for out-of-core .

@ivirshup
Copy link
Contributor Author

Would love to see a release with select_mask from #132 out. Any idea what the schedule is like for that?

@nvictus
Copy link
Member

nvictus commented Mar 23, 2023

Just released in v0.4.0!

We've included not only select_mask but also select_indices and select_labels.

@nvictus nvictus closed this as completed Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants