-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify common (sample) metadata values, and replace them with unique integers #337
Comments
This kind of encoding will be greatly beneficial for taxonomy since a lot of nodes will share level 1/2/... values. It might be better to reverse and make it So converting from the number to value would be easier. Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence. |
Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled
... which can be accessed the same way (
This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).
I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different I think this is going to be pretty useful :) |
I like this solution.
That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields.
In that case, maybe as a first pass, we should just encode all values regardless of there occurrence. |
The more general solution to this issue would be to support data compression
in Python and JS. For example the JSON string representing the mapping file is
Zipped and the byte stream is encoded in base64 for JavaScript to read,
decompress, and load from JSON. Seems like there's a JS library for handling
zipped data:
https://gildas-lormeau.github.io/zip.js/core-api.html
Zipping the EMP mapping file we've been using for testing makes the data go
from 27MB to 2.3 MB.
…On (Aug-18-20|18:48), kwcantrell wrote:
> I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.
That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://urldefense.com/v3/__https://github.com/biocore/empress/issues/337*issuecomment-675804585__;Iw!!Mih3wA!U27M2dkxBdKogVlIdwweJYajX59SuIEaxaPO1OFjtLjJY3RK6aFL5HXrIPvaEiY$
|
@ElDeveloper that is probably the better solution because we wouldn't have to refactoring the js. All we would need to to do is implement the compression/decompression. |
I took a bit of time and put together an early version of the python compression code for this. Here's what the code produces on the moving pictures sample metadata, with every non-unique value compressed (compare with the stuff above on the same dataset):
Even on this small dataset, the space saving is pretty clear -- I think this method may have some merit besides (or in addition to) using zip data compression; with that, as far as I can tell, we'd still need to uncompress the original data -- which would involve loading a lot of redundant strings. Here, even the uncompressed data takes up less space, since it's mostly numbers. (Also, while implementing zip.js might take some careful thinking and refactoring, fumbling my way through this solution has gone pretty quickly ...) As @kwcantrell mentioned this approach should be applicable to feature metadata, as well (and would likely be even more useful there, since there are gonna be lots of |
work on biocore#355, which i think makes sense to bundle with biocore#337
JS and tests haven't been updated, but this seems to be working properly. already seems to be saving a lot of space.
partway through biocore#337
This works now! Well, kinda. The tests are still broken, and the JS code still doesn't uncompress the sample metadata. But you can at least generate QZVs now!
Just gotta fix, you know, the other 12 failing ones :P
@kwcantrell brought this up in this morning's meeting. Currently sample metadata is stored as follows in the HTML (this is from the moving pictures dataset, formatting modified for ease of reading):
It would be useful to identify common string values in the metadata, map these to unique integers, and then replace these values in the metadata. Then, when looking up sample metadata info in the BIOM table or something, numeric values would be replaced with their original string value. (This'd work because all metadata is stored as strings in Empress right now.)
An example of what this might look like, by just replacing ten nonunique values I arbitrarily picked:
The metadata looks a lot smaller (and I haven't even replaced nonunique stuff in the month/day/etc. fields). For massive datasets with lots of metadata (e.g. the EMP) this could be really useful.
The text was updated successfully, but these errors were encountered: