Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sequences #78

Open
apcamargo opened this issue Jan 20, 2025 · 4 comments
Open

Support for sequences #78

apcamargo opened this issue Jan 20, 2025 · 4 comments

Comments

@apcamargo
Copy link

Hi @mwiewior

Are there any plans to support data types beyond intervals? I came across this repository while planning to create a plugin with the same name, although my original idea was somewhat different. I think it would be really useful to include functionality for reading FASTA files into tables, enabling sequence operations within a dataframe framework while leveraging Polars' speed.

For instance, a protein FASTA file could be parsed into a two-column dataframe, where one column contains sequence names and the other contains the sequences themselves. This would open the door to various analyses, such as filtering sequences, calculating amino acid frequencies, or deriving numeric protein properties (e.g., molecular weight, isoelectric point). Similarly, for DNA sequences, useful features could include computing the reverse complement, GC content, GC/AT skew, and other sequence-based metrics.

@mwiewior
Copy link
Collaborator

mwiewior commented Jan 20, 2025

hey @apcamargo - in short - absolutely yes - as you can we've just started but the idea is to support as many useful operations and file formats as possible. We already have some basic fasta/fastq support - see: https://biodatageeks.org/polars-bio/api/#polars_bio.read_fasta but it's not yet well tested when it comes to both performance and quality. I'm happy to discuss any further improvements, feature requests as well as open to contributions if you would like to join forces.

@apcamargo
Copy link
Author

Thanks for the response, @mwiewior!

I tried to use pb.io.read_fasta, but got the following error:

PanicException: called `Option::unwrap()` on a `None` value

From what I’ve seen in the code, it seems that the actual FASTA parser hasn’t been implemented yet (?)

I’ve never quite figured out how to write Polars IO plugins, but my plan was to leverage needletail, as it is the fastest parser available. I recall reading somewhere (maybe in the Polars' Discord) that IO plugins were planned to be implemented at the Python level (as opposed to Rust). With that in mind, I started working on Needletail’s Python module (see onecodex/needletail#92 and onecodex/needletail#91), but I could fork it if the parser indeed needs to be written in Python. Otherwise, I believe using Needletail’s Rust crate would be the better approach.

That said, I haven’t made much concrete progress yet. So far, I’ve only written some Rust functions to infer the alphabet of a sequence (DNA, RNA, or protein) and to compute a few protein properties (e.g., molecular weight, charge, isoelectric point). I’d be happy to share these with you if you’re interested.

By the way, I'm curious what is DataFusion role in polars-bio. I noticed that it’s quite heavy to install and was wondering if it could be included as an extra.

@mwiewior
Copy link
Collaborator

mwiewior commented Jan 21, 2025

Just to quickly comment on the size:

  1. yes, it's heavy the reason is two-fold:
    a) Rust extensions to PyPI packages in general are - they need contain compiled Rust code with dependencies - see for instance https://pypi.org/project/polars/#files or plugins https://pypi.org/project/polars-deltalake/#files
    b) we rely on the forked polars 1.17.1 as I did a patch adding anonymous scan - I will try to open a PR to the official repo this week (we needed that to be able to add streaming/out-of-core processing) + there some mismatch in pyo3 0.22 vs 0.23 - and once we sort this out we should be able to cut it by 2
    c) not sure if there is any similar to extra mechanism for these joint Python-Rust packages - in Rust are this 'feature' mechanism but not sure how and if at all it can be used in this case without forcing end-users to compile anything - this I would like to avoid even at a cost of larger size.

Regarding FASTA - please open an issue for that ideally with a test sample. I added this functionality based on https://github.com/wheretrue/exon but with just a very basic unit tests https://github.com/biodatageeks/polars-bio/blob/master/tests/test_io.py so the chances are high there might be something wrong with it.

@apcamargo
Copy link
Author

When I mentioned the size, I was specifically referring to DataFusion. But if Maturin doesn't allow extras and DataFusion is required for some functionality, I guess there's no way around it.

Thank you for your answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants