-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for sequences #78
Comments
hey @apcamargo - in short - absolutely yes - as you can we've just started but the idea is to support as many useful operations and file formats as possible. We already have some basic fasta/fastq support - see: https://biodatageeks.org/polars-bio/api/#polars_bio.read_fasta but it's not yet well tested when it comes to both performance and quality. I'm happy to discuss any further improvements, feature requests as well as open to contributions if you would like to join forces. |
Thanks for the response, @mwiewior! I tried to use
From what I’ve seen in the code, it seems that the actual FASTA parser hasn’t been implemented yet (?) I’ve never quite figured out how to write Polars IO plugins, but my plan was to leverage That said, I haven’t made much concrete progress yet. So far, I’ve only written some Rust functions to infer the alphabet of a sequence (DNA, RNA, or protein) and to compute a few protein properties (e.g., molecular weight, charge, isoelectric point). I’d be happy to share these with you if you’re interested. By the way, I'm curious what is DataFusion role in polars-bio. I noticed that it’s quite heavy to install and was wondering if it could be included as an extra. |
Just to quickly comment on the size:
Regarding FASTA - please open an issue for that ideally with a test sample. I added this functionality based on https://github.com/wheretrue/exon but with just a very basic unit tests https://github.com/biodatageeks/polars-bio/blob/master/tests/test_io.py so the chances are high there might be something wrong with it. |
When I mentioned the size, I was specifically referring to DataFusion. But if Maturin doesn't allow extras and DataFusion is required for some functionality, I guess there's no way around it. Thank you for your answers! |
Hi @mwiewior
Are there any plans to support data types beyond intervals? I came across this repository while planning to create a plugin with the same name, although my original idea was somewhat different. I think it would be really useful to include functionality for reading FASTA files into tables, enabling sequence operations within a dataframe framework while leveraging Polars' speed.
For instance, a protein FASTA file could be parsed into a two-column dataframe, where one column contains sequence names and the other contains the sequences themselves. This would open the door to various analyses, such as filtering sequences, calculating amino acid frequencies, or deriving numeric protein properties (e.g., molecular weight, isoelectric point). Similarly, for DNA sequences, useful features could include computing the reverse complement, GC content, GC/AT skew, and other sequence-based metrics.
The text was updated successfully, but these errors were encountered: