-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to prune instead of GC when StorageMax is reached #5515
Comments
Ideally, we'd keep some form of frecency metric and prune things we don't use often. However, tracking this becomes really tricky. This will take quite a bit of design work. |
My (very old) note on frequency based, probabilistic GC: ipfs/notes#130 |
I think that a frequency-only metric might be sub-optimal (but better than nothing I guess), because recently-fetched data would most likely have a very low frequency count, and yet, it should definitely not be deleted. Usually a combination of recency-frequency is used. NGINX cache uses only recency (file's last access time), I believe. Firefox uses both recency and frequency (the metric is quite complicated and documented here, though I'm not sure if its used only in the awsomebar or influences cache-purging as well). I think that having true cache management could have a dramatic positive influence on the network:
So the entire network could better "self-organize" itself just by having a more effective caching metric. Edit: of course the gateways might use a higher layer of caching (http based) so either metric (recency/frequency) wouldn't really apply very effectively. I would assume there would be a way around that - say - "artificially" notifying the IPFS daemon whenever a piece of data has been requested so it would update its frequency/recency metric. |
There are only two hard things in computer science; cache invalidation, naming things, and off-by-one errors. There is a good summary of cache invalidation algorithms here; http://u.cs.biu.ac.il/~wiseman/2os/2os/os2.pdf Note that it doesn't mention that LRFU is probably the best performing (in terms of hitrate) algorithm, and favors ARC because of it's O(1) implementation. However, for large caches where the miss cost is very high (ie, network fetch) spending a bit more on your cache algorithm for better hitrates is worth it. Memory cache locality probably also means O(lnN) binheaps are not that much worse than O(1) dlists anyway unless N gets really huge. The LRFU authors also didn't seem to understand exactly what they were inventing, and they missed some simple things to make it perform even better (additional metadata for entries not in the cache like ARC), and make the implementation more efficient (no atime per entry and simpler decay calculation). They also didn't realize their metric was simply a decaying count of the number of accesses in the past T timeconstant cache lookups, just that it was a "combination of recency-frequency". I've put some thoughts on this here; http://minkirri.apana.org.au/wiki/DecayingLFUCacheExpiry And I've started doing some simulation/performance testing on github here; https://github.com/dbaarda/DLFUCache But I haven't yet generated the graphs and written up a summary of the results. I also want to throw in an implementation of ARC for comparison. I'd be interested in having a crack at doing a proper implementation of this in go, but I don't have a lot of time to dedicate to it. |
Just had a look at : ipfs/notes#130 Interesting. Its a way to keep even less metadata per entry by combining them into a bloom filter. It also uses an access count per bloom-filter-entry that could be improved by making it an exponentially decaying count so that it also takes into account recency. I see in that thread you proposed exactly that, and even the idea of increasing the increment instead of decaying the counts (though exponentially growing the increment means you get an exponentially decaying count which is easier to rationalize about). There is a risk of some unpopular entries "getting a free ride" by falling into the same bloom-filter bucket with a popular entry, but picking a different nonce per GC cycle means they should get flushed out eventually. What kind of magnitudes for the number of blocks/entries in the cache are we talking? Is the bloom filter to reduce the metadata size even necessary? |
Some users will have a billion or more blocks (although I'm not sure if we support this yet). My laptop has ~400 thousand at the moment and I'm not storing any large datasets. |
As a rough estimate 10TB of data would be about 50 million objects in blockstore. |
50 million objects assuming 256KiB chunks. In reality, large datasets often have a bunch of tiny files, unfortunately. |
Currently, the only available approach to deal with overgrowth of the repository is to basically delete everything that's not pinned.
If we equate with traditional notions from the HTTP world where basically
pinned = local
andunpinned = cached proxied content
that approach might be considered extreme and inappropriate.It would seem more natural to have a setting to only delete the least recently used unpinned content.
I understand that the IPFS datastore may be significantly more complex and fragmented than something like, say, the NGINX or, (on the client side) Firefox/Chrome cache store, however, unconditionally erasing everything is a very crude approach compared to how these systems operate.
In relation to #3679, I would not consider this a duplicate since I think #3679 still perpetuates the misguided idea that
cached proxied content = garbage
(why?).The text was updated successfully, but these errors were encountered: