Option to prune instead of GC when StorageMax is reached #5515

rotemdan · 2018-09-24T06:21:18Z

Currently, the only available approach to deal with overgrowth of the repository is to basically delete everything that's not pinned.

If we equate with traditional notions from the HTTP world where basically pinned = local and unpinned = cached proxied content that approach might be considered extreme and inappropriate.

It would seem more natural to have a setting to only delete the least recently used unpinned content.

I understand that the IPFS datastore may be significantly more complex and fragmented than something like, say, the NGINX or, (on the client side) Firefox/Chrome cache store, however, unconditionally erasing everything is a very crude approach compared to how these systems operate.

In relation to #3679, I would not consider this a duplicate since I think #3679 still perpetuates the misguided idea that cached proxied content = garbage (why?).

The text was updated successfully, but these errors were encountered:

Stebalien · 2018-09-24T07:31:40Z

Ideally, we'd keep some form of frecency metric and prune things we don't use often. However, tracking this becomes really tricky. This will take quite a bit of design work.

Kubuxu · 2018-09-26T17:21:34Z

My (very old) note on frequency based, probabilistic GC: ipfs/notes#130

rotemdan · 2018-09-26T19:25:12Z

I think that a frequency-only metric might be sub-optimal (but better than nothing I guess), because recently-fetched data would most likely have a very low frequency count, and yet, it should definitely not be deleted.

Usually a combination of recency-frequency is used.

NGINX cache uses only recency (file's last access time), I believe. Firefox uses both recency and frequency (the metric is quite complicated and documented here, though I'm not sure if its used only in the awsomebar or influences cache-purging as well).

I think that having true cache management could have a dramatic positive influence on the network:

Clients would retain and share data they value the most.
Gateways would automatically retain popular content, thus highly requested data would be available possibly indefinitely on IPFS.
CDN clusters for particular sets of content could be easily built over IPFS itself (combined with the gateways as HTTP portals, as I described in Option to restrict gateway to only serve locally available / cluster content #5513), where the cache system would naturally replicate popular content through the cluster nodes (based on demand through the gateways).

So the entire network could better "self-organize" itself just by having a more effective caching metric.

Edit: of course the gateways might use a higher layer of caching (http based) so either metric (recency/frequency) wouldn't really apply very effectively. I would assume there would be a way around that - say - "artificially" notifying the IPFS daemon whenever a piece of data has been requested so it would update its frequency/recency metric.

dbaarda · 2018-09-26T23:48:02Z

There are only two hard things in computer science; cache invalidation, naming things, and off-by-one errors. There is a good summary of cache invalidation algorithms here;

http://u.cs.biu.ac.il/~wiseman/2os/2os/os2.pdf

Note that it doesn't mention that LRFU is probably the best performing (in terms of hitrate) algorithm, and favors ARC because of it's O(1) implementation. However, for large caches where the miss cost is very high (ie, network fetch) spending a bit more on your cache algorithm for better hitrates is worth it. Memory cache locality probably also means O(lnN) binheaps are not that much worse than O(1) dlists anyway unless N gets really huge.

The LRFU authors also didn't seem to understand exactly what they were inventing, and they missed some simple things to make it perform even better (additional metadata for entries not in the cache like ARC), and make the implementation more efficient (no atime per entry and simpler decay calculation). They also didn't realize their metric was simply a decaying count of the number of accesses in the past T timeconstant cache lookups, just that it was a "combination of recency-frequency". I've put some thoughts on this here;

http://minkirri.apana.org.au/wiki/DecayingLFUCacheExpiry

And I've started doing some simulation/performance testing on github here;

https://github.com/dbaarda/DLFUCache

But I haven't yet generated the graphs and written up a summary of the results. I also want to throw in an implementation of ARC for comparison.

I'd be interested in having a crack at doing a proper implementation of this in go, but I don't have a lot of time to dedicate to it.

dbaarda · 2018-09-27T00:29:09Z

Just had a look at : ipfs/notes#130

Interesting. Its a way to keep even less metadata per entry by combining them into a bloom filter. It also uses an access count per bloom-filter-entry that could be improved by making it an exponentially decaying count so that it also takes into account recency. I see in that thread you proposed exactly that, and even the idea of increasing the increment instead of decaying the counts (though exponentially growing the increment means you get an exponentially decaying count which is easier to rationalize about).

There is a risk of some unpopular entries "getting a free ride" by falling into the same bloom-filter bucket with a popular entry, but picking a different nonce per GC cycle means they should get flushed out eventually.

What kind of magnitudes for the number of blocks/entries in the cache are we talking? Is the bloom filter to reduce the metadata size even necessary?

Stebalien · 2018-09-27T01:03:40Z

Some users will have a billion or more blocks (although I'm not sure if we support this yet). My laptop has ~400 thousand at the moment and I'm not storing any large datasets.

Kubuxu · 2018-09-27T01:12:45Z

As a rough estimate 10TB of data would be about 50 million objects in blockstore.

Stebalien · 2018-09-27T01:19:16Z

50 million objects assuming 256KiB chunks. In reality, large datasets often have a bunch of tiny files, unfortunately.

Stebalien added the kind/enhancement A net-new feature or improvement to an existing feature label Sep 24, 2018

zonque mentioned this issue Nov 27, 2018

Handle "no space left on device" scenario more gracefully #5773

Open

Stebalien mentioned this issue Dec 5, 2019

Ipfs gc problem #6783

Closed

raulk mentioned this issue Jul 4, 2021

future: shard recombination filecoin-project/dagstore#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to prune instead of GC when StorageMax is reached #5515

Option to prune instead of GC when StorageMax is reached #5515

rotemdan commented Sep 24, 2018

Stebalien commented Sep 24, 2018

Kubuxu commented Sep 26, 2018

rotemdan commented Sep 26, 2018 •

edited

Loading

dbaarda commented Sep 26, 2018

dbaarda commented Sep 27, 2018

Stebalien commented Sep 27, 2018

Kubuxu commented Sep 27, 2018

Stebalien commented Sep 27, 2018

Option to prune instead of GC when StorageMax is reached #5515

Option to prune instead of GC when StorageMax is reached #5515

Comments

rotemdan commented Sep 24, 2018

Stebalien commented Sep 24, 2018

Kubuxu commented Sep 26, 2018

rotemdan commented Sep 26, 2018 • edited Loading

dbaarda commented Sep 26, 2018

dbaarda commented Sep 27, 2018

Stebalien commented Sep 27, 2018

Kubuxu commented Sep 27, 2018

Stebalien commented Sep 27, 2018

rotemdan commented Sep 26, 2018 •

edited

Loading