Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storageID to tier_fs #8549

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

nadavsteindler
Copy link
Contributor

Closes #8546

@nadavsteindler nadavsteindler self-assigned this Jan 26, 2025
@nadavsteindler nadavsteindler added exclude-changelog PR description should not be included in next release changelog tech-debt and removed tech-debt labels Jan 26, 2025
Copy link

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

E2E Test Results - Quickstart

11 passed

@nadavsteindler nadavsteindler marked this pull request as draft January 27, 2025 08:08
@nadavsteindler nadavsteindler marked this pull request as ready for review January 27, 2025 08:23
Copy link
Contributor

@itaigilo itaigilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff, thanks for this.

I think that some places are are still missing storageID,
Plus the ns to namespace renaming.

if err != nil {
return nil, err
}
return NewValueIterator(it), nil
}

func (c *committedManager) WriteRange(ctx context.Context, ns graveler.StorageNamespace, it graveler.ValueIterator) (*graveler.RangeInfo, error) {
writer, err := c.RangeManager.GetWriter(ctx, Namespace(ns), nil)
func (c *committedManager) WriteRange(ctx context.Context, storageID graveler.StorageID, ns graveler.StorageNamespace, it graveler.ValueIterator) (*graveler.RangeInfo, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename ns to namespace?
So at least in the context of this manager, the term StorageID will always be attached to Namespace.

(Here, and in every function that have ns.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you're saying, but I don't think we should try to go into graveler and try to make all the names conform to some coding convention- that's not the goal of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the goal, but a side-effect. Renaming it to namespace will help us maintain the code in its new form (that's including storageID).
It's a simple search-and-replace in this file that will improve its readability and align with our general code style of this repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like tabs/spaces. You end up with a commit war over code-style nonsense that's a huge waste of everyone's time.

ns is an acceptable variable name for Go: https://www.reddit.com/r/golang/comments/3aporh/why_so_many_gophers_use_single_letter_variables/

I prefer a little more descriptive names, but I'm not going to go through our whole codebase and start correcting other people's naming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at least rename storageID to sid to keep the same convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can both be right!

  1. We can decide that "ns" is a "familiar variable name". Reddit is not authoritative; the Google Go Style Guide is, and this is the section on abbreviated naming. Single letters are encouraged in receivers, and abbreviations are cool for a few other cases. I am not convinced that "ns" fits these, but I have no strong feelings.
  2. We should definitely not do this renaming in this PR, it is already quite large. It is perfectly acceptable to decide and then do this in a separate PR, either before or after this one. But not as part of this one.

@@ -228,6 +228,7 @@ type mergeContext struct {
srcIt Iterator
baseIt Iterator
strategy graveler.MergeStrategy
storageID graveler.StorageID
ns graveler.StorageNamespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this also be renamed from ns to namespace, to align terminology?

@@ -14,6 +14,9 @@ type ID string
// Namespace is namespace for ID ranges
type Namespace string

// StorageID is id for object storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please place StorageID above Namespace (to be consistent).


// Exists returns true if a MetaRange matching ID exists in namespace ns.
Exists(ctx context.Context, ns StorageNamespace, id MetaRangeID) (bool, error)
Exists(ctx context.Context, storageID StorageID, ns StorageNamespace, id MetaRangeID) (bool, error)

// WriteMetaRangeByIterator flushes the iterator to a new MetaRange and returns the created ID.
WriteMetaRangeByIterator(ctx context.Context, ns StorageNamespace, it ValueIterator, metadata Metadata) (*MetaRangeID, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't WriteMetaRangeByIterator also get StorageID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch


// Commit is the act of taking an existing metaRange (snapshot) and applying a set of changes to it.
// A change is either an entity to write/overwrite, or a tombstone to mark a deletion
// it returns a new MetaRangeID that is expected to be immediately addressable
Commit(ctx context.Context, ns StorageNamespace, baseMetaRangeID MetaRangeID, changes ValueIterator, allowEmpty bool, opts ...SetOptionsFunc) (MetaRangeID, DiffSummary, error)
Commit(ctx context.Context, storageID StorageID, ns StorageNamespace, baseMetaRangeID MetaRangeID, changes ValueIterator, allowEmpty bool, opts ...SetOptionsFunc) (MetaRangeID, DiffSummary, error)

// GetMetaRange returns information where metarangeID is stored.
GetMetaRange(ctx context.Context, ns StorageNamespace, metaRangeID MetaRangeID) (MetaRangeAddress, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't GetMetaRange and GetRange also get StorageID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a funny one. It takes namespace, but if you follow the call tree down it actually doesn't use it anywhere
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check with @arielshaqed or someone with more context if these params can be removed from these functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right! Please remove it, but in a separate PR. Otherwise it is very hard to navigate among so many changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both GetMetaRange and GetRange end up in the GetRemoteURI() implementation, in tier_fs.go.

So if I get it right, FS.GetRemoteURI (in pyramid.go) gets namespace as a parameter, but it isn't used.

The way I see it, we have two options:
Either validate that namespace is indeed not needed in the FS interface and remove it,
Or assume that namespace was added to this interface so that all of its methods will get namespace as a parameter, and in this case storageID should also be added, to align.

pkg/pyramid/tier_fs.go Outdated Show resolved Hide resolved
pkg/pyramid/pyramid.go Outdated Show resolved Hide resolved
pkg/pyramid/pyramid.go Outdated Show resolved Hide resolved
pkg/pyramid/pyramid.go Outdated Show resolved Hide resolved
pkg/pyramid/tier_fs.go Outdated Show resolved Hide resolved
@nadavsteindler nadavsteindler force-pushed the origin/fix/tierfs-storageid2 branch from 0a12d54 to 3836080 Compare January 29, 2025 15:29
@nadavsteindler
Copy link
Contributor Author

me places are are still missing storageID,
Plus the ns to namespace renaming.

OK I added to the other places.
re. renaming- I don't think this is the place to start going through graveler and making general coding style changes. That doesn't advance out task of implementing multi-storage and it seems to me like a waste of time, especially given that we clearly don't have a consensus in our organization about how things should be named. There is a strong tradition in Go for really short variable names, but that clearly isn't a consensus either...

@nadavsteindler nadavsteindler enabled auto-merge (squash) January 30, 2025 05:47
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barging in again as my username was uttered.

First of all thanks for handling a change that manages to infiltrate everything! Even if we reduce explicitness somewhat as offered in some comments, it still obviously changes many components.

As a consequence, though this PR is really large and makes changes all over the code. This is obviously not bad! But I have some suggestions for how we might manage the scope here.

More smaller changes

I would really like it broken into smaller PRs or into more logical commits. For instance, there's a chain of removing the unneeded StorageNamespace argument from GetRemoteURI. This change seems very good - but it does not belong on this PR. Having to review it together with StorageIDs is much harder. (This request is not blocking, but I believe that it will speed time-to-merge of the sum of all PRs.)

Tests

On the dev side, I am unable to find anywhere that tests a non-empty storage ID. This might be worrying? (This request is blocking: we should test non-empty storage IDs, or state on the description of this PR why the existing tests are sufficient, or keep the issue open until we do have a PR that tests it.)

Scope

In some cases it might clarify the code to keep StorageID on some object, rather than drill it through the objects. For instance, AFAIU a MetaRangeManager and its committed.rangeManager refer to objects on a single StorageID; if we keep that StorageID on the object we might be able to avoid having to pass it through so many layers. Everything in the same commit should be on the same storage adapter, amirite? So the CommittedManager might just keep maps StorageID -> stuff, and then never pass all those StorageID arguments.

As a reminder, at some point in the design process we talked about keeping StorageID on the context of each request, and then retrieving it only in the adapter. We decided on the more explicit approach, but perhaps we can dial back on this explicitness? (@itaigilo and/or @N-o-Z can say).

(The only part of this request which is blocking is to understand whether StorageIDs need to be quite this explicit. Explaining why they should be explicit will unblock this.)

@@ -889,7 +889,7 @@ func createPrepareUncommittedTestScenario(t *testing.T, repositoryID string, num
sort.Slice(diffs[i], func(ii, jj int) bool {
return bytes.Compare(diffs[i][ii].Key, diffs[i][jj].Key) < 0
})
test.CommittedManager.EXPECT().Diff(gomock.Any(), repository.StorageNamespace, gomock.Any(), branches[i].CompactedBaseMetaRangeID).MinTimes(1).Return(gUtils.NewDiffIter(diffs[i]), nil)
test.CommittedManager.EXPECT().Diff(gomock.Any(), gomock.Any(), repository.StorageNamespace, gomock.Any(), branches[i].CompactedBaseMetaRangeID).MinTimes(1).Return(gUtils.NewDiffIter(diffs[i]), nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we expect the correct storageID here and in other places?


rangeManager := mock.NewMockRangeManager(ctrl)

writer.EXPECT().Abort().AnyTimes()
metaRangeId := graveler.MetaRangeID("import")
writer.EXPECT().Close(gomock.Any()).Return(&metaRangeId, nil).AnyTimes()
committedManager := committed.NewCommittedManager(metaRangeManager, rangeManager, params)
_, err := committedManager.Import(ctx, "ns", destMetaRangeID, sourceMetaRangeID, tst.prefixes)
_, err := committedManager.Import(ctx, "", "ns", destMetaRangeID, sourceMetaRangeID, tst.prefixes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may indicate a TODO: Import will need to support multiple storages, may as well mark that we'll test it here (or elsewhere; depends on how you plan to do that).

if err != nil {
return nil, err
}
return NewValueIterator(it), nil
}

func (c *committedManager) WriteRange(ctx context.Context, ns graveler.StorageNamespace, it graveler.ValueIterator) (*graveler.RangeInfo, error) {
writer, err := c.RangeManager.GetWriter(ctx, Namespace(ns), nil)
func (c *committedManager) WriteRange(ctx context.Context, storageID graveler.StorageID, ns graveler.StorageNamespace, it graveler.ValueIterator) (*graveler.RangeInfo, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can both be right!

  1. We can decide that "ns" is a "familiar variable name". Reddit is not authoritative; the Google Go Style Guide is, and this is the section on abbreviated naming. Single letters are encouraged in receivers, and abbreviations are cool for a few other cases. I am not convinced that "ns" fits these, but I have no strong feelings.
  2. We should definitely not do this renaming in this PR, it is already quite large. It is perfectly acceptable to decide and then do this in a separate PR, either before or after this one. But not as part of this one.

@@ -43,19 +43,19 @@ func NewMetaRangeManager(params Params, metaManager, rangeManager RangeManager)
}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here and in range_manager.go seem odd. AFAICT, a metarange from a storage backend will manage ranges from that same backend. So why not hold storage ID on the metarange and range managers? Everything on them is either externally owned (like TierFS) or reference-counted (like the pebble read cache, which we manage as "cache Unrefer"). So committedManager might hold a map from StorageID to their metarange and range managers. This would allow us to limit the spread of StorageIDs throughout the code.


// Commit is the act of taking an existing metaRange (snapshot) and applying a set of changes to it.
// A change is either an entity to write/overwrite, or a tombstone to mark a deletion
// it returns a new MetaRangeID that is expected to be immediately addressable
Commit(ctx context.Context, ns StorageNamespace, baseMetaRangeID MetaRangeID, changes ValueIterator, allowEmpty bool, opts ...SetOptionsFunc) (MetaRangeID, DiffSummary, error)
Commit(ctx context.Context, storageID StorageID, ns StorageNamespace, baseMetaRangeID MetaRangeID, changes ValueIterator, allowEmpty bool, opts ...SetOptionsFunc) (MetaRangeID, DiffSummary, error)

// GetMetaRange returns information where metarangeID is stored.
GetMetaRange(ctx context.Context, ns StorageNamespace, metaRangeID MetaRangeID) (MetaRangeAddress, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right! Please remove it, but in a separate PR. Otherwise it is very hard to navigate among so many changes.

@itaigilo
Copy link
Contributor

Well, just saying that my last batch of comments were added before seeing Ariel's review 😄

And regarding this important comment:

...at some point in the design process we talked about keeping StorageID on the context of each request, and then retrieving it only in the adapter. We decided on the more explicit approach, but perhaps we can dial back on this explicitness? (@itaigilo and/or @N-o-Z can say).

I'm still good with the explicit approach - mainly since being implicit makes it much harder to trace bugs, and easier to forget when adding new flows (that storageID is relevant for).
But if it helps to make the code much cleaner, it's worth considering (even if we put it on the context only for the catalog/graveler etc.).
I think it's best to have a short f2f meeting about it and decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add StorageID to objPointer of TierFS
3 participants