Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS Garbage Collection v2 #8

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

gammazero
Copy link

No description provided.

Copy link
Contributor

@momack2 momack2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Let’s do an estimate on additional incurred pinning cost for a large datastore like pinata

proposals/new-ipfs-gc.md Show resolved Hide resolved
- Ability for GC to remove content identified by CID
- New `--cid` option: `ipfs repo gc --cid=<cid>`

Removing pins will be slower because it requires walking DAG to decrement reference counts. Adding items to MFS and removing items from MFS will be slower due to needing to update reference counts. Hopefully any increase in execution time will be insignificant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should use piñata as an example here and estimate what the impact would be (since they add pins all the time, but don’t GC as often now)

Copy link

@jnthnvctr jnthnvctr Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught up with Matt to try and grab some numbers here:
https://www.notion.so/Pinata-5157fa6d2a4741a593bb8ff32bfdb08e

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momack2 I will add some rough estimates by timing some things that I think will be similar in time to the new GC, and comparing with the old GC.

@jnthnvctr - I added a link to this in the doc, and brought some of the content into the doc. Thank you!

proposals/new-ipfs-gc.md Show resolved Hide resolved
@BigLep
Copy link
Contributor

BigLep commented Feb 23, 2021

Associating a solution proposal: https://hackmd.io/bgU7ti6_THK5gAGcrAhoEw?both

BigLep
BigLep previously approved these changes Feb 24, 2021
Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any metrics around this so we can show the before and after effect? I assume for each GC run that we'd want to emit:

  1. How long it took
  2. How many objects we traversed
  3. How many were gc'd
    (I assume there is really good insight from languages that do auto GC that we could draw from for this like the JVM)

If we need to implement some metrics, should we maybe implement it first to capture the current state on PL gateways and then after we do the improvements be able to show the impact.


## Purpose and Impact

### Background and Intent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more anecdotes/justification from the conversation in https://protocollabs.slack.com/archives/G01KZD3FETY/p1613779249428300 can be added. The transitive impact that this has on NFTs is important as we don't want to hamper that usecase.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link did not work for me, but I added it to related discussions below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link did not work for me, but I added it to the "Related discussions" section anyway

@BigLep BigLep self-requested a review February 24, 2021 18:18
@BigLep BigLep dismissed their stale review February 24, 2021 18:18

I accidentally approved

Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together!

proposals/new-ipfs-gc.md Show resolved Hide resolved
proposals/new-ipfs-gc.md Outdated Show resolved Hide resolved
- Maintaining reference counts for blocks will make it quick to determine if a block can be removed, because it is not necessary to load all pinned CID into memory and search this.
- Specific content can be removed by performing GC on the blocks associated with a IPLD DAG identified by CID.

### What does done look like?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to other comments about metrics, should done also include metrics about GC time and the time it takes to pin/unpin now given the reference counting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes more sense to have time metrics for pinning, because the pinning process may also need to fetch content from the network as it is being pinned.

I don't think that metrics for unpin become any more important, as this should be equivalent to walking a dag whose.

proposals/new-ipfs-gc.md Outdated Show resolved Hide resolved
Comment on lines 81 to 83
- Removing pins will be slower because it requires walking DAG to decrement reference counts.
- Adding items to MFS and removing items from MFS will be slower due to needing to update reference counts.
- Keeping reference counts will consume for storage space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have metrics for all of these? I think it would be good to add the instrumentation before the improvements so we can compare before/after.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that eventually there should be metrics for this, but probably not for the initila implementation. Reference counts will be stored in a specific location and a user should be able to see how much space is used on disk in this location.

The project could fail, or fail to be deliverable in the proposed time if reference counting incurs unforseen complexity in its implementation or changes required to other subsystems. The risk of the will be mitigated by early review of reference counting design and prototyping.

### Alternatives
There does not currently appear to be alternate ways to solve these issues. There may be tools that are better suited to implementing portions of GC, such as a specific key-value store that is better suited to managing reference counts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we'll have folks from the Stewards team take a look at the design but are there others at PL or in the PL community that have expertise in "GC in practice"? Is there a channel we can post to see? For example, if there is anyone in our midst who worked on tuning GC at Sun/Oracle, they likely have valuable high-level guidance on doing efficient GC in practice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reached out to some people already, and names can be discovered by looking at previous work here: ipfs/kubo#7752

I think reposting the design on the ipfs slack channel, and inviting people to early reviews of PRs will be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do share the design doc in our internal #engineering channel.

@warpfork
Copy link

I wrote up some thoughts on other approaches to GC (n.b., first draft warning): https://hackmd.io/0T8z0ex8RWmI8BmFNPSafQ

The draft idea in that document is a GC approach, but not a reference counting one.

I'm fairly dubious of reference counting approaches for a disk-backed GC. Reference counting requires a strong degree of consistency, and a great deal of mutexing, which makes it both relatively fragile and high cost at runtime. Metrics evaluated on a full implementation would of course shed more light than my bets, but I'm just not optimistic on that whole angle of approach.

@warpfork
Copy link

I like that this issue also touches on the user story of wanting to actively remove specific content. However, I wonder if that's directly connected to GC. I think we could provide direct remedies for that user story without involving GC at all.

(Although on second thought, active targeted removal would actually be connecting and somewhat more difficult in a reference counting world -- because active removal would have to trigger walks to make sure orphaned subgraphs aren't left behind. Orphaned subgraphs with incorrectly undecremented refcounts which would become effectively unremediable in a refcount world.)

@gammazero
Copy link
Author

@warpfork I read your thought on an approach to GC, linked it in the pitch doc, and suggested we at least look at a prototype for comparison.

I generally agree with your feelings about reference count resource management, but an advantage with the reference counting is that it should work especially well for Pinata's use case: The only need GC for removing content that has become unpinned. The reference-counted implementation allows GC to walk the DAG for only the unpinned content and remove the associated blocks without loading all the other pinned content.

@RubenKelevra
Copy link

This is very interesting, as it's touching some points I noted a while back here.

But additionally to "just" count references my note includes tracking of accesses and access time in a compressed combined metric for blocks stored which are not pinned.

Throwing all unpinned blocks away makes sense for a provider of storage, but not on end-user clients, gateways etc. which can anticipate to a degree on the past which blocks might be useful for the future.

I haven't had time yet to work on it, sadly, because of pandemic reasons - but that's still the plan.

@gammazero
Copy link
Author

@RubenKelevra I agree it would be very useful to be able to be more discriminating about what unpinned content gets removed. That would provide cache management behavior. There are a number of complexities with efficiently maintaining and accessing LRU/LFU data, and the referencing counting approach is intended to be a hopefully simpler solution to address a specific need. That is, a way for GC to determine if a block is pinned or in MFS without having to load and search a set of all pinned blocks.

The immediate goal is to provide faster bulk GC, and more importantly, provide very fast removal of specific content (remove blocks from unpinned DAG). The key feature of ref-counting is that it gives O(blocks-in-dag) performance without having to load other content into memory. The ref-counting also does not incur any overhead for most data access, since it is only used during pinning/unpinning, MFS add/remove, and GC.

More sophisticated data eviction/retention strategies are not precluded by referencing-counting and should be explored.

@RubenKelevra
Copy link

@RubenKelevra I agree it would be very useful to be able to be more discriminating about what unpinned content gets removed. That would provide cache management behavior. There are a number of complexities with efficiently maintaining and accessing LRU/LFU data, and the referencing counting approach is intended to be a hopefully simpler solution to address a specific need. That is, a way for GC to determine if a block is pinned or in MFS without having to load and search a set of all pinned blocks.

I'm just referencing my note, since I planned to do both in one approach - since it's basically the same problem with:

  • the needs to store an intention log of pinning/unpinning operations
  • storing a database in memory with a value for each block
  • writing the database to disk in larger chunks
  • background analysis of pinning/unpinning operations
  • time based locking to avoid to slow down any operation by blocking it

The advantage of combining both cache retention strategy and reference counting is

  • lower total complexity
  • dead locks/data loss gets less likely
  • more code can be shared without refactoring
  • much lower memory consumption

The immediate goal is to provide faster bulk GC, and more importantly, provide very fast removal of specific content (remove blocks from unpinned DAG). The key feature of ref-counting is that it gives O(blocks-in-dag) performance without having to load other content into memory. The ref-counting also does not incur any overhead for most data access, since it is only used during pinning/unpinning, MFS add/remove, and GC.

I don't think bulk-GC is the way to go, but instead just have a tight limitation of the amount of stuff stored in the unpinned area, like "use only 1% of the free storage for unpinned stuff"

This allows for easier locking since we can do some assumptions which helps us to avoid having to lock the whole datastore - roughly:

  • If a block gets added we put it on a list with a timestamp.
  • If a pinning operation starts, we put it on a list of active operations with a timestamp.
  • If a block in this list has an timestamp which is older than the oldest running pinning operation -> move to cache area since it can be safely discarded.

If a pinning operation calls for a block in the cache area, just add a reference and put it in the "pinned" list - but there was always zero guarantee that a block stays in the datastore when it's not pinned. Those assumptions are safe and avoid any locks on the datastore.

So the cleanup would run continuously as background task flagging the oldest and less accessed blocks for removal, not as bulk-cleanup.

@olizilla
Copy link
Contributor

olizilla commented Mar 2, 2021

On behalf of the ipfs.io gateways I implore this endevour to consider an LRU eviction strategy as the default for the IPFS repo. Framing this work as a "cache retention strategy" as @RubenKelevra points to, rather than a "garbage collection improvement". Unpinned blocks in IPFS are not unreachable objects in the traditional GC sense of things, and are much closer to "what files should my CDN keep hot". We have limited bandwidth so should avoid scope creep, but that is also why it's even more important to get the framing of this project right. If we make GC less painful it will be a while before we get time to focus on making it what we need.

A relevant paper came up from the first ResNetLab event that is worth a read Caching the Internet: A View from a Global Multi-tenant CDN

In this paper, we explore the efficacy of popular caching policies in a large-scale, global, multi-tenant CDN. We examine the client behaviors observed in a network of over 125 high-capacity Points of Presence (PoPs). Using production data from the Edgecast CDN, we show that for such a large-scale and diverse use case, simpler caching policies dominate. We find that LRU offers the best compromise between hit-rate and disk I/O, providing 60% fewer writes than FIFO, while maintaining high hit-rates. We further observe that at disk sizes used in a large-scale CDN, LRU performs on par with complex polices like S4LRU.
https://link.springer.com/chapter/10.1007%2F978-3-030-15986-3_5

@RubenKelevra
Copy link

RubenKelevra commented Mar 2, 2021

@olizilla thanks for the paper, will check it out.

My idea is to balance access numbers with time the object stayed in the cache. In short a number will be increased on each access and decreased after a fixed duration.

This way we avoid cache trashing if many objects get accessed for a short while, but afterwards the access pattern changes back.

It's also important to have a rough idea how large a block is, so we know without accessing the datastore how far we've come to meet our target on clearing space.

Additionally it makes sense to measure the amount of bytes which gets added to the datastore and send this information to a process which can delete old cache objects from the datastore in the same rate. This allows us to maintain a constant fill level on the datastore with minimal memory usage.

The datastructure in my note would squeeze this all in a 32 bit integer as additional metadata for each checksum.

While there's probably quite some overhead to put some trees structures for finding the lowest number in the cache section and a specific hash in the reference counter section the total memory consumption should be pretty minimal.

@RubenKelevra
Copy link

RubenKelevra commented Mar 2, 2021

Well, I don't know I don't see why LRU should be better.

The most similar algorithms to my note would be ARC (which is used in ZFS, but is patented) and CAR.

Here's a paper comparing them to LRU:

https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement

The File: /ipfs/Qmaksm3FYJupAzkZK6EPpaxjyc6zocJvzufyo6AFSMkRBm

If the cache is too small for the workload, LRU is basically useless (cache trashing occurs), while ARC/CAR outperforms it with 3-4 times more hits.

@gammazero
Copy link
Author

@olizilla @RubenKelevra I want to be clear that, yes, we do want a cache management approach to garbage collection eventually. It should work in harmony with reference counting. Reference counting is solving the specific problem of needing to know if blocks can be removed, without having to load and search other pinned/mfs content. It addresses a situation that is severely impacting pinning service providers.

Gateways, and many general users, have a definite need for cache management. In summary, that means removing blocks that have a lower calculated probability of reuse than others that are competing for storage space (regardless of algorithms used LRU, LFU, ARC, 2-queue, etc.). I am treating that (which blocks to delete first/incrementally) as a separate independent problem from the problem that ref-counting addresses (efficiently determine if a block can be removed).

A cache management strategy should be a separate proposal/project. If it does happen to share some part of implementation with referencing counting, that is an implementation detail and does not mean these are the same problem. There are existing designs for a cache management approach to GC, and these may be useful/informative for creating that solution.

@RubenKelevra
Copy link

@gammazero I agree that they are somewhat separate from the purpose.

But there's some advantage of having the GC holding two lists: Pinned (or saved in the MFS) and unpinned (not saved in the MFS) blocks. This allows for much quicker removal from the datastore since it can just randomly drop blocks from the "unpinned" list until the list is empty or the storage requirements are met again. If you add a size approximation of each block to the CID you could probably even drop blocks without having to request any additional information from the database and just issue removal requests for the CIDs.

While the CIDs in the refcount data structure needs to be accessible via the CID, so a b-tree is probably required, the CIDs in the unpinned list just need an efficient data structure to save memory. This way you would just drop the oldest entries added, like FIFO.

@mikeal mikeal added confidence:low Confidence rating is 5 or below. ease:low Ease rating is 5 or below. impact:low Impact rating is 5 or below. labels Mar 23, 2021
@mikeal mikeal self-requested a review March 23, 2021 21:08
@mikeal
Copy link
Contributor

mikeal commented Mar 23, 2021

would like to see a review approval from @Stebalien on this one since he’s talked a bit about this already.

@mikeal mikeal requested a review from Stebalien March 23, 2021 21:10
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of a "we should work on this proposal", this is fine. But can we link to the GC system we discussed?

cc @gammazero

@BigLep BigLep added the Steward Priority Stewards priority project due to enabling us to move faster and/or safer. label Apr 19, 2021
@momack2
Copy link
Contributor

momack2 commented Apr 21, 2021

I don't see this obviously linked here - so bumping the updated design doc on GC implementation based on @gammazero's great research and prototyping: https://hackmd.io/bgU7ti6_THK5gAGcrAhoEw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confidence:low Confidence rating is 5 or below. ease:low Ease rating is 5 or below. impact:low Impact rating is 5 or below. Steward Priority Stewards priority project due to enabling us to move faster and/or safer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants