Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify common (sample) metadata values, and replace them with unique integers #337

Open
fedarko opened this issue Aug 18, 2020 · 6 comments
Assignees

Comments

@fedarko
Copy link
Collaborator

fedarko commented Aug 18, 2020

@kwcantrell brought this up in this morning's meeting. Currently sample metadata is stored as follows in the HTML (this is from the moving pictures dataset, formatting modified for ease of reading):

["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage", "days-since-experiment-start"],
[["AGTGCGATGCGT", "gut", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["ATGGCAGCTCTA", "gut", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["CTGAGATACGCG", "gut", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CCGACTGAGATG", "gut", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["CCTCTCGTGATC", "gut", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["ACACACTATGGC", "gut", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["ACTACGTGTGGT", "gut", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["AGCTGACTAGTC", "gut", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["ACGATGCGACCA", "left palm", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["AGCTATCCACGA", "left palm", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["ATGCAGCTCAGT", "left palm", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CACGTGACATGT", "left palm", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["CATATCGCAGTT", "left palm", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["CGTGCATTATCA", "left palm", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CTAACGCAGTCA", "left palm", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["CTCAATGACTCA", "left palm", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["ACAGTTGCGCGA", "right palm", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["CACGACAGGCTA", "right palm", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["AGTGTCACGGTG", "right palm", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["CAAGTGAGAGAG", "right palm", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CATCGTATCAAC", "right palm", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["ATCGATCTGTGG", "right palm", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["GCGTTACACACA", "right palm", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["GAACTGTATCTC", "right palm", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["CTCGTGGAGTAG", "right palm", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CAGTGTCAGGAC", "tongue", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["ATCTTAGACTGC", "tongue", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["CAGACATTGCGT", "tongue", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["CGATGCACCAGA", "tongue", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CTAGAGACTCTT", "tongue", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["CTGGACTCATAG", "tongue", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["GAGGCTCATCAT", "tongue", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["GATACGTCCTGA", "tongue", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["GATTAGCACTCT", "tongue", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"]]

It would be useful to identify common string values in the metadata, map these to unique integers, and then replace these values in the metadata. Then, when looking up sample metadata info in the BIOM table or something, numeric values would be replaced with their original string value. (This'd work because all metadata is stored as strings in Empress right now.)

An example of what this might look like, by just replacing ten nonunique values I arbitrarily picked:

{"gut": 0, "left palm": 1, "right palm": 2, "tongue": 3, "2008.0": 4, "2009.0": 5, "subject-1": 6, "subject-2": 7, "No": 8, "Yes": 9},
["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage", "days-since-experiment-start"],
[["AGTGCGATGCGT", 0, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["ATGGCAGCTCTA", 0, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["CTGAGATACGCG", 0, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CCGACTGAGATG", 0, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["CCTCTCGTGATC", 0, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["ACACACTATGGC", 0, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["ACTACGTGTGGT", 0, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["AGCTGACTAGTC", 0, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["ACGATGCGACCA", 1, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["AGCTATCCACGA", 1, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["ATGCAGCTCAGT", 1, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CACGTGACATGT", 1, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["CATATCGCAGTT", 1, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["CGTGCATTATCA", 1, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CTAACGCAGTCA", 1, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["CTCAATGACTCA", 1, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["ACAGTTGCGCGA", 2, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["CACGACAGGCTA", 2, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["AGTGTCACGGTG", 2, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["CAAGTGAGAGAG", 2, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CATCGTATCAAC", 2, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["ATCGATCTGTGG", 2, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["GCGTTACACACA", 2, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["GAACTGTATCTC", 2, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["CTCGTGGAGTAG", 2, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CAGTGTCAGGAC", 3, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["ATCTTAGACTGC", 3, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["CAGACATTGCGT", 3, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["CGATGCACCAGA", 3, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CTAGAGACTCTT", 3, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["CTGGACTCATAG", 3, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["GAGGCTCATCAT", 3, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["GATACGTCCTGA", 3, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["GATTAGCACTCT", 3, 5, "4.0", "14.0", 7, 8, "168.0"]]

The metadata looks a lot smaller (and I haven't even replaced nonunique stuff in the month/day/etc. fields). For massive datasets with lots of metadata (e.g. the EMP) this could be really useful.

@kwcantrell
Copy link
Collaborator

kwcantrell commented Aug 19, 2020

This kind of encoding will be greatly beneficial for taxonomy since a lot of nodes will share level 1/2/... values.

It might be better to reverse
{"gut": 0, "left palm": 1, "right palm": 2, "tongue": 3, "2008.0": 4, "2009.0": 5, "subject-1": 6, "subject-2": 7, "No": 8, "Yes": 9}

and make it
{0: "gut", 1: "left palm", 2: "right palm", 3: "tongue", 4: "2008.0", 5: "2009.0", 6: "subject-1", 7: "subject-2", 8: "No", 9: "Yes"}

So converting from the number to value would be easier.

Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence.

@fedarko
Copy link
Collaborator Author

fedarko commented Aug 19, 2020

Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled treeData:

["gut", "left palm", "right palm", "tongue", "2008.0", "2009.0", "subject-1", "subject-2", "No", "yes"]

... which can be accessed the same way (0 -> "gut", etc.) As an added bonus, if we set the encodings so that the first element is the most common value, the second is the next most common value, etc., then it'll be really easy to interpret this line in the HTML and say "what strings are most common throughout this metadata?".

Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence.

This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).

in a metadata field

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

I think this is going to be pretty useful :)

@kwcantrell
Copy link
Collaborator

kwcantrell commented Aug 19, 2020

Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled treeData:

["gut", "left palm", "right palm", "tongue", "2008.0", "2009.0", "subject-1", "subject-2", "No", "yes"]

I like this solution.

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields.

This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).

In that case, maybe as a first pass, we should just encode all values regardless of there occurrence.

@ElDeveloper
Copy link
Member

ElDeveloper commented Aug 19, 2020 via email

@kwcantrell
Copy link
Collaborator

@ElDeveloper that is probably the better solution because we wouldn't have to refactoring the js. All we would need to to do is implement the compression/decompression.

fedarko added a commit to fedarko/empress that referenced this issue Aug 30, 2020
@fedarko
Copy link
Collaborator Author

fedarko commented Aug 30, 2020

I took a bit of time and put together an early version of the python compression code for this. Here's what the code produces on the moving pictures sample metadata, with every non-unique value compressed (compare with the stuff above on the same dataset):

["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage",
 "days-since-experiment-start"],
["2009.0", "No", "subject-1", "subject-2", "17.0", "right palm", "tongue", "gut", "3.0", "140.0",
 "1.0", "20.0", "84.0", "left palm", "2008.0", "10.0", "28.0", "Yes", "0.0", "4.0", "14.0",
 "168.0", "2.0", "112.0"],
[["AGTGCGATGCGT", 7, 0, 8, 4, 2, 1, 9],
 ["ATGGCAGCTCTA", 7, 14, 15, 16, 3, 17, 18],
 ["CTGAGATACGCG", 7, 0, 10, 11, 3, 1, 12],
 ["CCGACTGAGATG", 7, 0, 8, 4, 3, 1, 9],
 ["CCTCTCGTGATC", 7, 0, 19, 20, 3, 1, 21],
 ["ACACACTATGGC", 7, 0, 10, 11, 2, 1, 12],
 ["ACTACGTGTGGT", 7, 0, 22, 4, 2, 1, 23],
 ["AGCTGACTAGTC", 7, 14, 15, 16, 2, 17, 18],
 ["ACGATGCGACCA", 13, 0, 10, 11, 2, 1, 12],
 ["AGCTATCCACGA", 13, 0, 22, 4, 2, 1, 23],
 ["ATGCAGCTCAGT", 13, 0, 8, 4, 2, 1, 9],
 ["CACGTGACATGT", 13, 0, 19, 20, 2, 1, 21],
 ["CATATCGCAGTT", 13, 14, 15, 16, 3, 17, 18],
 ["CGTGCATTATCA", 13, 0, 10, 11, 3, 1, 12],
 ["CTAACGCAGTCA", 13, 0, 8, 4, 3, 1, 9],
 ["CTCAATGACTCA", 13, 0, 19, 20, 3, 1, 21],
 ["ACAGTTGCGCGA", 5, 14, 15, 16, 2, 17, 18],
 ["CACGACAGGCTA", 5, 0, 10, 11, 2, 1, 12],
 ["AGTGTCACGGTG", 5, 0, 22, 4, 2, 1, 23],
 ["CAAGTGAGAGAG", 5, 0, 8, 4, 2, 1, 9],
 ["CATCGTATCAAC", 5, 0, 19, 20, 2, 1, 21],
 ["ATCGATCTGTGG", 5, 14, 15, 16, 3, 17, 18],
 ["GCGTTACACACA", 5, 0, 8, 4, 3, 1, 9],
 ["GAACTGTATCTC", 5, 0, 19, 20, 3, 1, 21],
 ["CTCGTGGAGTAG", 5, 0, 10, 11, 3, 1, 12],
 ["CAGTGTCAGGAC", 6, 14, 15, 16, 2, 17, 18],
 ["ATCTTAGACTGC", 6, 0, 10, 11, 2, 1, 12],
 ["CAGACATTGCGT", 6, 0, 22, 4, 2, 1, 23],
 ["CGATGCACCAGA", 6, 0, 8, 4, 2, 1, 9],
 ["CTAGAGACTCTT", 6, 0, 19, 20, 2, 1, 21],
 ["CTGGACTCATAG", 6, 14, 15, 16, 3, 17, 18],
 ["GAGGCTCATCAT", 6, 0, 10, 11, 3, 1, 12],
 ["GATACGTCCTGA", 6, 0, 8, 4, 3, 1, 9],
 ["GATTAGCACTCT", 6, 0, 19, 20, 3, 1, 21]]

Even on this small dataset, the space saving is pretty clear -- ls -ahlt puts the sample metadata info above at 1.8K and the old sample metadata info at 2.8K. For the EMP empress.html file (without feature metadata since I don't have it), this takes it down from 119 MB to 108 MB.

I think this method may have some merit besides (or in addition to) using zip data compression; with that, as far as I can tell, we'd still need to uncompress the original data -- which would involve loading a lot of redundant strings. Here, even the uncompressed data takes up less space, since it's mostly numbers. (Also, while implementing zip.js might take some careful thinking and refactoring, fumbling my way through this solution has gone pretty quickly ...)

As @kwcantrell mentioned this approach should be applicable to feature metadata, as well (and would likely be even more useful there, since there are gonna be lots of "k__Bacteria"s and so on).

fedarko added a commit to fedarko/empress that referenced this issue Aug 30, 2020
work on biocore#355, which i think makes sense to bundle with biocore#337
fedarko added a commit to fedarko/empress that referenced this issue Aug 31, 2020
JS and tests haven't been updated, but this seems to be working
properly. already seems to be saving a lot of space.
fedarko added a commit to fedarko/empress that referenced this issue Aug 31, 2020
fedarko added a commit to fedarko/empress that referenced this issue Aug 31, 2020
fedarko added a commit to fedarko/empress that referenced this issue Oct 16, 2020
This works now! Well, kinda. The tests are still broken, and the JS
code still doesn't uncompress the sample metadata. But you can at
least generate QZVs now!
fedarko added a commit to fedarko/empress that referenced this issue Oct 16, 2020
Just gotta fix, you know, the other 12 failing ones :P
@fedarko fedarko self-assigned this Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants