Support for sequences #78

apcamargo · 2025-01-20T03:17:28Z

Are there any plans to support data types beyond intervals? I came across this repository while planning to create a plugin with the same name, although my original idea was somewhat different. I think it would be really useful to include functionality for reading FASTA files into tables, enabling sequence operations within a dataframe framework while leveraging Polars' speed.

For instance, a protein FASTA file could be parsed into a two-column dataframe, where one column contains sequence names and the other contains the sequences themselves. This would open the door to various analyses, such as filtering sequences, calculating amino acid frequencies, or deriving numeric protein properties (e.g., molecular weight, isoelectric point). Similarly, for DNA sequences, useful features could include computing the reverse complement, GC content, GC/AT skew, and other sequence-based metrics.

mwiewior · 2025-01-20T06:50:26Z

hey @apcamargo - in short - absolutely yes - as you can we've just started but the idea is to support as many useful operations and file formats as possible. We already have some basic fasta/fastq support - see: https://biodatageeks.org/polars-bio/api/#polars_bio.read_fasta but it's not yet well tested when it comes to both performance and quality. I'm happy to discuss any further improvements, feature requests as well as open to contributions if you would like to join forces.

apcamargo · 2025-01-21T00:28:46Z

Thanks for the response, @mwiewior!

I tried to use pb.io.read_fasta, but got the following error:

PanicException: called `Option::unwrap()` on a `None` value

From what I’ve seen in the code, it seems that the actual FASTA parser hasn’t been implemented yet (?)

I’ve never quite figured out how to write Polars IO plugins, but my plan was to leverage needletail, as it is the fastest parser available. I recall reading somewhere (maybe in the Polars' Discord) that IO plugins were planned to be implemented at the Python level (as opposed to Rust). With that in mind, I started working on Needletail’s Python module (see onecodex/needletail#92 and onecodex/needletail#91), but I could fork it if the parser indeed needs to be written in Python. Otherwise, I believe using Needletail’s Rust crate would be the better approach.

That said, I haven’t made much concrete progress yet. So far, I’ve only written some Rust functions to infer the alphabet of a sequence (DNA, RNA, or protein) and to compute a few protein properties (e.g., molecular weight, charge, isoelectric point). I’d be happy to share these with you if you’re interested.

By the way, I'm curious what is DataFusion role in polars-bio. I noticed that it’s quite heavy to install and was wondering if it could be included as an extra.

mwiewior · 2025-01-21T06:06:46Z

Just to quickly comment on the size:

yes, it's heavy the reason is two-fold:
a) Rust extensions to PyPI packages in general are - they need contain compiled Rust code with dependencies - see for instance https://pypi.org/project/polars/#files or plugins https://pypi.org/project/polars-deltalake/#files
b) we rely on the forked polars 1.17.1 as I did a patch adding anonymous scan - I will try to open a PR to the official repo this week (we needed that to be able to add streaming/out-of-core processing) + there some mismatch in pyo3 0.22 vs 0.23 - and once we sort this out we should be able to cut it by 2
c) not sure if there is any similar to extra mechanism for these joint Python-Rust packages - in Rust are this 'feature' mechanism but not sure how and if at all it can be used in this case without forcing end-users to compile anything - this I would like to avoid even at a cost of larger size.

Regarding FASTA - please open an issue for that ideally with a test sample. I added this functionality based on https://github.com/wheretrue/exon but with just a very basic unit tests https://github.com/biodatageeks/polars-bio/blob/master/tests/test_io.py so the chances are high there might be something wrong with it.

apcamargo · 2025-01-26T22:56:32Z

When I mentioned the size, I was specifically referring to DataFusion. But if Maturin doesn't allow extras and DataFusion is required for some functionality, I guess there's no way around it.

Thank you for your answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for sequences #78

Support for sequences #78

apcamargo commented Jan 20, 2025

mwiewior commented Jan 20, 2025 •

edited

Loading

apcamargo commented Jan 21, 2025

mwiewior commented Jan 21, 2025 •

edited

Loading

apcamargo commented Jan 26, 2025

Support for sequences #78

Support for sequences #78

Comments

apcamargo commented Jan 20, 2025

mwiewior commented Jan 20, 2025 • edited Loading

apcamargo commented Jan 21, 2025

mwiewior commented Jan 21, 2025 • edited Loading

apcamargo commented Jan 26, 2025

mwiewior commented Jan 20, 2025 •

edited

Loading

mwiewior commented Jan 21, 2025 •

edited

Loading