Identify common (sample) metadata values, and replace them with unique integers #337

fedarko · 2020-08-18T20:59:19Z

@kwcantrell brought this up in this morning's meeting. Currently sample metadata is stored as follows in the HTML (this is from the moving pictures dataset, formatting modified for ease of reading):

["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage", "days-since-experiment-start"],
[["AGTGCGATGCGT", "gut", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["ATGGCAGCTCTA", "gut", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["CTGAGATACGCG", "gut", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CCGACTGAGATG", "gut", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["CCTCTCGTGATC", "gut", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["ACACACTATGGC", "gut", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["ACTACGTGTGGT", "gut", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["AGCTGACTAGTC", "gut", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["ACGATGCGACCA", "left palm", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["AGCTATCCACGA", "left palm", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["ATGCAGCTCAGT", "left palm", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CACGTGACATGT", "left palm", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["CATATCGCAGTT", "left palm", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["CGTGCATTATCA", "left palm", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CTAACGCAGTCA", "left palm", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["CTCAATGACTCA", "left palm", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["ACAGTTGCGCGA", "right palm", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["CACGACAGGCTA", "right palm", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["AGTGTCACGGTG", "right palm", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["CAAGTGAGAGAG", "right palm", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CATCGTATCAAC", "right palm", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["ATCGATCTGTGG", "right palm", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["GCGTTACACACA", "right palm", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["GAACTGTATCTC", "right palm", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["CTCGTGGAGTAG", "right palm", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CAGTGTCAGGAC", "tongue", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["ATCTTAGACTGC", "tongue", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["CAGACATTGCGT", "tongue", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["CGATGCACCAGA", "tongue", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CTAGAGACTCTT", "tongue", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["CTGGACTCATAG", "tongue", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["GAGGCTCATCAT", "tongue", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["GATACGTCCTGA", "tongue", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["GATTAGCACTCT", "tongue", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"]]

It would be useful to identify common string values in the metadata, map these to unique integers, and then replace these values in the metadata. Then, when looking up sample metadata info in the BIOM table or something, numeric values would be replaced with their original string value. (This'd work because all metadata is stored as strings in Empress right now.)

An example of what this might look like, by just replacing ten nonunique values I arbitrarily picked:

{"gut": 0, "left palm": 1, "right palm": 2, "tongue": 3, "2008.0": 4, "2009.0": 5, "subject-1": 6, "subject-2": 7, "No": 8, "Yes": 9},
["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage", "days-since-experiment-start"],
[["AGTGCGATGCGT", 0, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["ATGGCAGCTCTA", 0, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["CTGAGATACGCG", 0, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CCGACTGAGATG", 0, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["CCTCTCGTGATC", 0, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["ACACACTATGGC", 0, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["ACTACGTGTGGT", 0, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["AGCTGACTAGTC", 0, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["ACGATGCGACCA", 1, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["AGCTATCCACGA", 1, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["ATGCAGCTCAGT", 1, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CACGTGACATGT", 1, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["CATATCGCAGTT", 1, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["CGTGCATTATCA", 1, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CTAACGCAGTCA", 1, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["CTCAATGACTCA", 1, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["ACAGTTGCGCGA", 2, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["CACGACAGGCTA", 2, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["AGTGTCACGGTG", 2, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["CAAGTGAGAGAG", 2, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CATCGTATCAAC", 2, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["ATCGATCTGTGG", 2, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["GCGTTACACACA", 2, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["GAACTGTATCTC", 2, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["CTCGTGGAGTAG", 2, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CAGTGTCAGGAC", 3, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["ATCTTAGACTGC", 3, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["CAGACATTGCGT", 3, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["CGATGCACCAGA", 3, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CTAGAGACTCTT", 3, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["CTGGACTCATAG", 3, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["GAGGCTCATCAT", 3, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["GATACGTCCTGA", 3, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["GATTAGCACTCT", 3, 5, "4.0", "14.0", 7, 8, "168.0"]]

The metadata looks a lot smaller (and I haven't even replaced nonunique stuff in the month/day/etc. fields). For massive datasets with lots of metadata (e.g. the EMP) this could be really useful.

kwcantrell · 2020-08-19T01:21:23Z

This kind of encoding will be greatly beneficial for taxonomy since a lot of nodes will share level 1/2/... values.

It might be better to reverse
{"gut": 0, "left palm": 1, "right palm": 2, "tongue": 3, "2008.0": 4, "2009.0": 5, "subject-1": 6, "subject-2": 7, "No": 8, "Yes": 9}

and make it
{0: "gut", 1: "left palm", 2: "right palm", 3: "tongue", 4: "2008.0", 5: "2009.0", 6: "subject-1", 7: "subject-2", 8: "No", 9: "Yes"}

So converting from the number to value would be easier.

Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence.

fedarko · 2020-08-19T01:42:17Z

Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled treeData:

["gut", "left palm", "right palm", "tongue", "2008.0", "2009.0", "subject-1", "subject-2", "No", "yes"]

... which can be accessed the same way (0 -> "gut", etc.) As an added bonus, if we set the encodings so that the first element is the most common value, the second is the next most common value, etc., then it'll be really easy to interpret this line in the HTML and say "what strings are most common throughout this metadata?".

Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence.

This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).

in a metadata field

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

I think this is going to be pretty useful :)

kwcantrell · 2020-08-19T01:48:29Z

Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled treeData:

["gut", "left palm", "right palm", "tongue", "2008.0", "2009.0", "subject-1", "subject-2", "No", "yes"]

I like this solution.

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields.

This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).

In that case, maybe as a first pass, we should just encode all values regardless of there occurrence.

ElDeveloper · 2020-08-19T15:57:01Z

The more general solution to this issue would be to support data compression in Python and JS. For example the JSON string representing the mapping file is Zipped and the byte stream is encoded in base64 for JavaScript to read, decompress, and load from JSON. Seems like there's a JS library for handling zipped data: https://gildas-lormeau.github.io/zip.js/core-api.html Zipping the EMP mapping file we've been using for testing makes the data go from 27MB to 2.3 MB.

…

On (Aug-18-20|18:48), kwcantrell wrote: > I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together. That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://urldefense.com/v3/__https://github.com/biocore/empress/issues/337*issuecomment-675804585__;Iw!!Mih3wA!U27M2dkxBdKogVlIdwweJYajX59SuIEaxaPO1OFjtLjJY3RK6aFL5HXrIPvaEiY$

kwcantrell · 2020-08-19T16:01:00Z

@ElDeveloper that is probably the better solution because we wouldn't have to refactoring the js. All we would need to to do is implement the compression/decompression.

fedarko · 2020-08-30T03:02:58Z

I took a bit of time and put together an early version of the python compression code for this. Here's what the code produces on the moving pictures sample metadata, with every non-unique value compressed (compare with the stuff above on the same dataset):

["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage",
 "days-since-experiment-start"],
["2009.0", "No", "subject-1", "subject-2", "17.0", "right palm", "tongue", "gut", "3.0", "140.0",
 "1.0", "20.0", "84.0", "left palm", "2008.0", "10.0", "28.0", "Yes", "0.0", "4.0", "14.0",
 "168.0", "2.0", "112.0"],
[["AGTGCGATGCGT", 7, 0, 8, 4, 2, 1, 9],
 ["ATGGCAGCTCTA", 7, 14, 15, 16, 3, 17, 18],
 ["CTGAGATACGCG", 7, 0, 10, 11, 3, 1, 12],
 ["CCGACTGAGATG", 7, 0, 8, 4, 3, 1, 9],
 ["CCTCTCGTGATC", 7, 0, 19, 20, 3, 1, 21],
 ["ACACACTATGGC", 7, 0, 10, 11, 2, 1, 12],
 ["ACTACGTGTGGT", 7, 0, 22, 4, 2, 1, 23],
 ["AGCTGACTAGTC", 7, 14, 15, 16, 2, 17, 18],
 ["ACGATGCGACCA", 13, 0, 10, 11, 2, 1, 12],
 ["AGCTATCCACGA", 13, 0, 22, 4, 2, 1, 23],
 ["ATGCAGCTCAGT", 13, 0, 8, 4, 2, 1, 9],
 ["CACGTGACATGT", 13, 0, 19, 20, 2, 1, 21],
 ["CATATCGCAGTT", 13, 14, 15, 16, 3, 17, 18],
 ["CGTGCATTATCA", 13, 0, 10, 11, 3, 1, 12],
 ["CTAACGCAGTCA", 13, 0, 8, 4, 3, 1, 9],
 ["CTCAATGACTCA", 13, 0, 19, 20, 3, 1, 21],
 ["ACAGTTGCGCGA", 5, 14, 15, 16, 2, 17, 18],
 ["CACGACAGGCTA", 5, 0, 10, 11, 2, 1, 12],
 ["AGTGTCACGGTG", 5, 0, 22, 4, 2, 1, 23],
 ["CAAGTGAGAGAG", 5, 0, 8, 4, 2, 1, 9],
 ["CATCGTATCAAC", 5, 0, 19, 20, 2, 1, 21],
 ["ATCGATCTGTGG", 5, 14, 15, 16, 3, 17, 18],
 ["GCGTTACACACA", 5, 0, 8, 4, 3, 1, 9],
 ["GAACTGTATCTC", 5, 0, 19, 20, 3, 1, 21],
 ["CTCGTGGAGTAG", 5, 0, 10, 11, 3, 1, 12],
 ["CAGTGTCAGGAC", 6, 14, 15, 16, 2, 17, 18],
 ["ATCTTAGACTGC", 6, 0, 10, 11, 2, 1, 12],
 ["CAGACATTGCGT", 6, 0, 22, 4, 2, 1, 23],
 ["CGATGCACCAGA", 6, 0, 8, 4, 2, 1, 9],
 ["CTAGAGACTCTT", 6, 0, 19, 20, 2, 1, 21],
 ["CTGGACTCATAG", 6, 14, 15, 16, 3, 17, 18],
 ["GAGGCTCATCAT", 6, 0, 10, 11, 3, 1, 12],
 ["GATACGTCCTGA", 6, 0, 8, 4, 3, 1, 9],
 ["GATTAGCACTCT", 6, 0, 19, 20, 3, 1, 21]]

Even on this small dataset, the space saving is pretty clear -- ls -ahlt puts the sample metadata info above at 1.8K and the old sample metadata info at 2.8K. For the EMP empress.html file (without feature metadata since I don't have it), this takes it down from 119 MB to 108 MB.

I think this method may have some merit besides (or in addition to) using zip data compression; with that, as far as I can tell, we'd still need to uncompress the original data -- which would involve loading a lot of redundant strings. Here, even the uncompressed data takes up less space, since it's mostly numbers. (Also, while implementing zip.js might take some careful thinking and refactoring, fumbling my way through this solution has gone pretty quickly ...)

As @kwcantrell mentioned this approach should be applicable to feature metadata, as well (and would likely be even more useful there, since there are gonna be lots of "k__Bacteria"s and so on).

work on biocore#355, which i think makes sense to bundle with biocore#337

JS and tests haven't been updated, but this seems to be working properly. already seems to be saving a lot of space.

partway through biocore#337

This works now! Well, kinda. The tests are still broken, and the JS code still doesn't uncompress the sample metadata. But you can at least generate QZVs now!

Just gotta fix, you know, the other 12 failing ones :P

fedarko added the performance label Aug 18, 2020

fedarko added a commit to fedarko/empress that referenced this issue Aug 30, 2020

PERF: early python impl of sm val compression biocore#337

c110082

fedarko added a commit to fedarko/empress that referenced this issue Aug 30, 2020

MNT: stab at storing fm in 2d list rather than obj

11e48fb

work on biocore#355, which i think makes sense to bundle with biocore#337

fedarko added a commit to fedarko/empress that referenced this issue Aug 31, 2020

PERF: compress recurring vals in fm biocore#337

ce5236f

JS and tests haven't been updated, but this seems to be working properly. already seems to be saving a lot of space.

fedarko added a commit to fedarko/empress that referenced this issue Aug 31, 2020

BUG/MNT: unbreak fm coloring and page loading

bb5440d

partway through biocore#337

fedarko added a commit to fedarko/empress that referenced this issue Aug 31, 2020

BUG: fix fm menu for biocore#337 optimization

dc7cd5b

fedarko added a commit to fedarko/empress that referenced this issue Oct 16, 2020

Merge master changes into biocore#337 code!

6211505

This works now! Well, kinda. The tests are still broken, and the JS code still doesn't uncompress the sample metadata. But you can at least generate QZVs now!

fedarko added a commit to fedarko/empress that referenced this issue Oct 16, 2020

TST: fix one of the compression utils tests biocore#337

afe31c6

Just gotta fix, you know, the other 12 failing ones :P

fedarko self-assigned this Oct 29, 2020

fedarko mentioned this issue Jan 19, 2021

Respect ancestors in feature metadata coloring/propagation #473

Closed

fedarko mentioned this issue Feb 6, 2021

[DON'T MERGE THIS PLS] Include ancestor information in processed taxonomy feature metadata #482

Closed

fedarko mentioned this issue Feb 20, 2021

Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify common (sample) metadata values, and replace them with unique integers #337

Identify common (sample) metadata values, and replace them with unique integers #337

fedarko commented Aug 18, 2020 •

edited

Loading

kwcantrell commented Aug 19, 2020 •

edited

Loading

fedarko commented Aug 19, 2020

kwcantrell commented Aug 19, 2020 •

edited

Loading

ElDeveloper commented Aug 19, 2020 via email

kwcantrell commented Aug 19, 2020

fedarko commented Aug 30, 2020

Identify common (sample) metadata values, and replace them with unique integers #337

Identify common (sample) metadata values, and replace them with unique integers #337

Comments

fedarko commented Aug 18, 2020 • edited Loading

kwcantrell commented Aug 19, 2020 • edited Loading

fedarko commented Aug 19, 2020

kwcantrell commented Aug 19, 2020 • edited Loading

ElDeveloper commented Aug 19, 2020 via email

kwcantrell commented Aug 19, 2020

fedarko commented Aug 30, 2020

fedarko commented Aug 18, 2020 •

edited

Loading

kwcantrell commented Aug 19, 2020 •

edited

Loading

kwcantrell commented Aug 19, 2020 •

edited

Loading