-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] Model versions parsing #7204
Changes from all commits
9b9d02d
9b2f8fe
9b5acff
810cec1
96b2a6f
cc3b6af
3d12e45
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,19 +145,22 @@ def perform_lookup(self, unique_id: UniqueID, manifest: "Manifest") -> SourceDef | |
class RefableLookup(dbtClassMixin): | ||
# model, seed, snapshot | ||
_lookup_types: ClassVar[set] = set(NodeType.refable()) | ||
_versioned_types: ClassVar[set] = set(NodeType.versioned()) | ||
|
||
# refables are actually unique, so the Dict[PackageName, UniqueID] will | ||
# only ever have exactly one value, but doing 3 dict lookups instead of 1 | ||
# is not a big deal at all and retains consistency | ||
def __init__(self, manifest: "Manifest"): | ||
self.storage: Dict[str, Dict[PackageName, UniqueID]] = {} | ||
self.storage: Dict[str, Dict[PackageName, List[UniqueID]]] = {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For example, a manifest with the following nodes: "model.jaffle_shop.dim_unversioned": {
"name": "dim_unversioned",
"version": null,
"is_latest_version": null
},
"model.jaffle_shop.dim_customers.v1": {
"name": "dim_customers",
"version": "1",
"is_latest_version": false
},
"model.jaffle_shop.dim_customers.v2": {
"name": "dim_customers",
"version": "2",
"is_latest_version": true
} should find unique_ids for the following refs via
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I considered a few options for the
{
"dim_unversioned": {
"jaffle_shop": ["model.jaffle_shop.dim_unversioned"]
}, "dim_customers": {
"jaffle_shop": ["model.jaffle_shop.dim_customers.v2", "model.jaffle_shop.dim_customers.v1"]
}
}
{
"dim_unversioned": {
"jaffle_shop": {
"??": "model.jaffle_shop.dim_unversioned"
},
},
"dim_customers": {
"jaffle_shop": {
"2": "model.jaffle_shop.dim_customers.v2",
"1": "model.jaffle_shop.dim_customers.v1"
}
}
}
{
"dim_unversioned": {
"jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
} Open to feedback and suggestions here! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the third option, we could store an additional key:value pair for each versioned model that maps the model name to its latest unique id. For example, {
"dim_unversioned": {
"jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v2"
},
"dim_customers": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
} This has the benefits of:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume that the ref_lookup code is going to have to handle external references eventually too, right? Not yet sure how that would impact these design choices. I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me. It doesn't feel like checking the versions would be that much overhead. And it's always going to be numerical order, so the highest integer is always latest, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Same here. I'm leaning to the structure proposed above for those reasons.
This is something I haven't put in much polish on for the spike yet but - but we'll be accepting strings for the version identifier, as well as a model-level If latest_version is not provided, dbt will compute a default latest version by attempting to order the version identifiers numerically if possible (casting to int/float), and fallback to alphanumeric ordering. This is to support a variety of versioning conventions out of the box, even if our recommendation is to use major versioning only. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think the latest proposed structure is appealing. |
||
self.populate(manifest) | ||
|
||
def get_unique_id(self, key, package: Optional[PackageName]): | ||
return find_unique_id_for_package(self.storage, key, package) | ||
def get_unique_id(self, key, package: Optional[PackageName], version: Optional[str]): | ||
return self.find_unique_id_for_package_and_version(key, package, version) | ||
|
||
def find(self, key, package: Optional[PackageName], manifest: "Manifest"): | ||
unique_id = self.get_unique_id(key, package) | ||
def find( | ||
self, key, package: Optional[PackageName], version: Optional[str], manifest: "Manifest" | ||
): | ||
unique_id = self.get_unique_id(key, package, version) | ||
if unique_id is not None: | ||
return self.perform_lookup(unique_id, manifest) | ||
return None | ||
|
@@ -166,7 +169,14 @@ def add_node(self, node: ManifestNode): | |
if node.resource_type in self._lookup_types: | ||
if node.name not in self.storage: | ||
self.storage[node.name] = {} | ||
self.storage[node.name][node.package_name] = node.unique_id | ||
if node.package_name not in self.storage[node.name]: | ||
self.storage[node.name][node.package_name] = [] | ||
|
||
if node.resource_type in self._versioned_types and node.is_latest_version: | ||
# keep latest version at the front of unique id list | ||
self.storage[node.name][node.package_name].insert(0, node.unique_id) | ||
else: | ||
self.storage[node.name][node.package_name].append(node.unique_id) | ||
|
||
def populate(self, manifest): | ||
for node in manifest.nodes.values(): | ||
|
@@ -179,6 +189,37 @@ def perform_lookup(self, unique_id: UniqueID, manifest) -> ManifestNode: | |
) | ||
return manifest.nodes[unique_id] | ||
|
||
def find_unique_id_for_package_and_version( | ||
self, key: str, package: Optional[PackageName], version: Optional[str] | ||
): | ||
if key not in self.storage: | ||
return None | ||
|
||
pkg_dct: Mapping[PackageName, List[UniqueID]] = self.storage[key] | ||
|
||
unique_ids: Optional[List[UniqueID]] = None | ||
|
||
if package is None: | ||
if not pkg_dct: | ||
return None | ||
else: | ||
unique_ids = next(iter(pkg_dct.values())) | ||
elif package in pkg_dct: | ||
unique_ids = pkg_dct[package] | ||
else: | ||
return None | ||
|
||
if len(unique_ids) >= 1: | ||
if version: | ||
for unique_id in unique_ids: | ||
if unique_id.endswith(f"v{version}"): | ||
return unique_id | ||
return None | ||
else: | ||
return unique_ids[0] | ||
else: | ||
return None | ||
|
||
|
||
class MetricLookup(dbtClassMixin): | ||
def __init__(self, manifest: "Manifest"): | ||
|
@@ -899,6 +940,7 @@ def resolve_ref( | |
self, | ||
target_model_name: str, | ||
target_model_package: Optional[str], | ||
target_model_version: Optional[str], | ||
current_project: str, | ||
node_package: str, | ||
) -> MaybeNonSource: | ||
|
@@ -908,7 +950,7 @@ def resolve_ref( | |
|
||
candidates = _search_packages(current_project, node_package, target_model_package) | ||
for pkg in candidates: | ||
node = self.ref_lookup.find(target_model_name, pkg, self) | ||
node = self.ref_lookup.find(target_model_name, pkg, target_model_version, self) | ||
|
||
if node is not None and node.config.enabled: | ||
return node | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opted to repack the
version
arg as a formatted string to keep the structure ofmodel.refs
asList[List[str]]
(result of this method is appended to model.refs here.Leads to some not-so-fun string parsing when refs are processed.
Could also add it as a dictionary, which would hold all kwargs (if we ever add more) going forward. This would be a breaking change in the manifest (model.refs becomes
List[List[Union[str, Dict[str,str]]]]
, but would be more straightforward (somewhat - will need to do some type checking) for consumers to parse when processing refs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first reaction is to go with the dictionary. I think consumers are going to have to either do the same not-so-fun string parsing or checking for a dictionary anyway. I'm not sure how often external consumers would be processing the refs anyway; I think mostly they rely on us providing the references in the depends_on. @jtcohen6 might have a better idea about that than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with the dictionary. IMO:
depends_on.nodes
, rather thanrefs
, in almost all casesThis may require some changes in some of the weirder packages out there, but in a way that should make their lives easier, ultimately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good to go with a dictionary as well!
What I'd originally thought up in terms of representation in the manifest.json for
model.refs
was aList[List[Union[str, Dict[str,str]]]]
. For example,But what might be best is actually just a
List[Dict[str, str]]
. For example,That would enable us to simplify how we parse
ref
here as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're pretty fast & loose with the
package
/project
distinction. I think this is calledpackage_name
in the manifest currently:And... it will continue to mean either/both things, since we'll support resolving
ref
erences to a model in anotherpackage
(status quo) and anotherproject
(via cross-projectref
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for List[Dict[str, str]] too. It's always a bit messy to have two different types. We're not allowing versioned sources, right? Sources currently List[List[str]] but always have two strings in the list, whereas refs only have one. Eventually it might be nice to switch all of those things to dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right - agree it'd be nice to switch them all to dictionaries. I can make an issue for sources, perhaps we can tackle it sa follow-on tech debt work.