Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Model versions parsing #7204

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 41 additions & 16 deletions core/dbt/context/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,16 +212,19 @@ def __call__(self, *args: str) -> Union[str, RelationProxy, MetricReference]:

class BaseRefResolver(BaseResolver):
@abc.abstractmethod
def resolve(self, name: str, package: Optional[str] = None) -> RelationProxy:
def resolve(
self, name: str, package: Optional[str] = None, version: Optional[str] = None
) -> RelationProxy:
...

def _repack_args(self, name: str, package: Optional[str]) -> List[str]:
if package is None:
return [name]
else:
return [package, name]
def _repack_args(self, name: str, package: Optional[str], version: Optional[str]) -> List[str]:
repacked_args = [name] if package is None else [package, name]
if version:
repacked_args.append(f"version:{version}")
Copy link
Contributor Author

@MichelleArk MichelleArk Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted to repack the version arg as a formatted string to keep the structure of model.refs as List[List[str]] (result of this method is appended to model.refs here.

Leads to some not-so-fun string parsing when refs are processed.

Could also add it as a dictionary, which would hold all kwargs (if we ever add more) going forward. This would be a breaking change in the manifest (model.refs becomes List[List[Union[str, Dict[str,str]]]], but would be more straightforward (somewhat - will need to do some type checking) for consumers to parse when processing refs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first reaction is to go with the dictionary. I think consumers are going to have to either do the same not-so-fun string parsing or checking for a dictionary anyway. I'm not sure how often external consumers would be processing the refs anyway; I think mostly they rely on us providing the references in the depends_on. @jtcohen6 might have a better idea about that than me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with the dictionary. IMO:

  • consumers of the manifest should use depends_on.nodes, rather than refs, in almost all cases
  • the list-of-lists always felt a bit janky

This may require some changes in some of the weirder packages out there, but in a way that should make their lives easier, ultimately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good to go with a dictionary as well!

What I'd originally thought up in terms of representation in the manifest.json for model.refs was a List[List[Union[str, Dict[str,str]]]]. For example,

"refs": ["package_name", "model_name", {"version":"3", ...}]

But what might be best is actually just a List[Dict[str, str]]. For example,

"refs": {
"package": "package_name",  # what should we actually call this key..?
"model": "model_name",
"version": "version_name"
}

That would enable us to simplify how we parse ref here as well.

Copy link
Contributor

@jtcohen6 jtcohen6 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should we actually call this key..?

We're pretty fast & loose with the package/project distinction. I think this is called package_name in the manifest currently:

"model.test.dim_customers": {
    "resource_type": "model",
    "name": "dim_customers",
    "package_name": "test",
    ...
}

And... it will continue to mean either/both things, since we'll support resolving references to a model in another package (status quo) and another project (via cross-project ref)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for List[Dict[str, str]] too. It's always a bit messy to have two different types. We're not allowing versioned sources, right? Sources currently List[List[str]] but always have two strings in the list, whereas refs only have one. Eventually it might be nice to switch all of those things to dictionaries.

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not allowing versioned sources, right?

right - agree it'd be nice to switch them all to dictionaries. I can make an issue for sources, perhaps we can tackle it sa follow-on tech debt work.


def validate_args(self, name: str, package: Optional[str]):
return repacked_args

def validate_args(self, name: str, package: Optional[str], version: Optional[str]):
if not isinstance(name, str):
raise CompilationError(
f"The name argument to ref() must be a string, got {type(name)}"
Expand All @@ -232,18 +235,26 @@ def validate_args(self, name: str, package: Optional[str]):
f"The package argument to ref() must be a string or None, got {type(package)}"
)

def __call__(self, *args: str) -> RelationProxy:
if version is not None and not isinstance(version, str):
raise CompilationError(
f"The version argument to ref() must be a string or None, got {type(version)}"
)

def __call__(self, *args: str, **kwargs) -> RelationProxy:
name: str
package: Optional[str] = None
version: Optional[str] = None

if len(args) == 1:
name = args[0]
elif len(args) == 2:
package, name = args
else:
raise RefArgsError(node=self.model, args=args)
self.validate_args(name, package)
return self.resolve(name, package)

version = kwargs.get("version")
self.validate_args(name, package, version)
return self.resolve(name, package, version)


class BaseSourceResolver(BaseResolver):
Expand Down Expand Up @@ -448,8 +459,10 @@ def __getattr__(self, name):

# `ref` implementations
class ParseRefResolver(BaseRefResolver):
def resolve(self, name: str, package: Optional[str] = None) -> RelationProxy:
self.model.refs.append(self._repack_args(name, package))
def resolve(
self, name: str, package: Optional[str] = None, version: Optional[str] = None
) -> RelationProxy:
self.model.refs.append(self._repack_args(name, package, version))

return self.Relation.create_from(self.config, self.model)

Expand All @@ -458,23 +471,30 @@ def resolve(self, name: str, package: Optional[str] = None) -> RelationProxy:


class RuntimeRefResolver(BaseRefResolver):
def resolve(self, target_name: str, target_package: Optional[str] = None) -> RelationProxy:
def resolve(
self,
target_name: str,
target_package: Optional[str] = None,
target_version: Optional[str] = None,
) -> RelationProxy:
target_model = self.manifest.resolve_ref(
target_name,
target_package,
target_version,
self.current_project,
self.model.package_name,
)

if target_model is None or isinstance(target_model, Disabled):
# TODO: add version to error
raise TargetNotFoundError(
node=self.model,
target_name=target_name,
target_kind="node",
target_package=target_package,
disabled=isinstance(target_model, Disabled),
)
self.validate(target_model, target_name, target_package)
self.validate(target_model, target_name, target_package, target_version)
return self.create_relation(target_model, target_name)

def create_relation(self, target_model: ManifestNode, name: str) -> RelationProxy:
Expand All @@ -485,10 +505,14 @@ def create_relation(self, target_model: ManifestNode, name: str) -> RelationProx
return self.Relation.create_from(self.config, target_model)

def validate(
self, resolved: ManifestNode, target_name: str, target_package: Optional[str]
self,
resolved: ManifestNode,
target_name: str,
target_package: Optional[str],
target_version: Optional[str],
) -> None:
if resolved.unique_id not in self.model.depends_on.nodes:
args = self._repack_args(target_name, target_package)
args = self._repack_args(target_name, target_package, target_version)
raise RefBadContextError(node=self.model, args=args)


Expand All @@ -498,6 +522,7 @@ def validate(
resolved: ManifestNode,
target_name: str,
target_package: Optional[str],
target_version: Optional[str],
) -> None:
pass

Expand Down
56 changes: 49 additions & 7 deletions core/dbt/contracts/graph/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,19 +145,22 @@ def perform_lookup(self, unique_id: UniqueID, manifest: "Manifest") -> SourceDef
class RefableLookup(dbtClassMixin):
# model, seed, snapshot
_lookup_types: ClassVar[set] = set(NodeType.refable())
_versioned_types: ClassVar[set] = set(NodeType.versioned())

# refables are actually unique, so the Dict[PackageName, UniqueID] will
# only ever have exactly one value, but doing 3 dict lookups instead of 1
# is not a big deal at all and retains consistency
def __init__(self, manifest: "Manifest"):
self.storage: Dict[str, Dict[PackageName, UniqueID]] = {}
self.storage: Dict[str, Dict[PackageName, List[UniqueID]]] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref lookups now need to handle 3 access patterns (package handling remains unchanged):

  1. Lookup versioned model by name and version => lookup specified version
  2. Lookup versioned model by name without version => get the latest version
  3. Lookup unversioned model by name only => get only 'version'

For example, a manifest with the following nodes:

"model.jaffle_shop.dim_unversioned": {
  "name": "dim_unversioned",
  "version": null, 
  "is_latest_version": null
},
"model.jaffle_shop.dim_customers.v1": {
  "name": "dim_customers",
  "version": "1", 
  "is_latest_version": false
},
"model.jaffle_shop.dim_customers.v2": {
  "name": "dim_customers",
  "version": "2", 
  "is_latest_version": true
}

should find unique_ids for the following refs via RefableLookup.get_unique_id:

ref("dim_unversioned") => "model.jaffle_shop.dim_unversioned"
ref("dim_customers", version="1") => "model.jaffle_shop.dim_customers.v1"
ref("dim_customers") => "model.jaffle_shop.dim_customers.v2"

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered a few options for the storage structure here:

  • Dict[str, Dict[PackageName, List[UniqueID]]]
    • "latest" unique_id stored at the front of list
    • first element of UniqueID list always returned, unless a version is specified - in which case iterating through the list is necessary.
    • implemented in this draft
{
"dim_unversioned": {
    "jaffle_shop": ["model.jaffle_shop.dim_unversioned"]
  }, "dim_customers": {
    "jaffle_shop": ["model.jaffle_shop.dim_customers.v2", "model.jaffle_shop.dim_customers.v1"]
  }
}
  • Dict[str, Dict[PackageName, Dict[VersionName, UniqueID]]]
    • doesn't handle unversioned or "latest" versioned lookups models nicely
{ 
"dim_unversioned": {
  "jaffle_shop": {
    "??": "model.jaffle_shop.dim_unversioned"
  },
},
"dim_customers": {
  "jaffle_shop": {
      "2": "model.jaffle_shop.dim_customers.v2", 
      "1": "model.jaffle_shop.dim_customers.v1"
    }
  }
}
  • Dict[str, Dict[PackageName, UniqueID]]
    • doesn't handle "latest" versioned lookups models nicely
{ 
"dim_unversioned": {
  "jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
}

Open to feedback and suggestions here!

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third option, we could store an additional key:value pair for each versioned model that maps the model name to its latest unique id. For example,

{ 
"dim_unversioned": {
  "jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v2"
},
"dim_customers": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
}

This has the benefits of:

  • keeping the storage data structure + lookup logic consistent across different lookup implementations
  • mapping closely with the access patterns => easier to reason about + constant-time lookup for all access patterns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the ref_lookup code is going to have to handle external references eventually too, right? Not yet sure how that would impact these design choices.

I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me. It doesn't feel like checking the versions would be that much overhead. And it's always going to be numerical order, so the highest integer is always latest, right?

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me.

Same here. I'm leaning to the structure proposed above for those reasons.

And it's always going to be numerical order, so the highest integer is always latest, right?

This is something I haven't put in much polish on for the spike yet but - but we'll be accepting strings for the version identifier, as well as a model-level latest_version. If the latest_version is provided - then the latest is that value, regardless of what the highest version is (to support the concept of prereleases).

If latest_version is not provided, dbt will compute a default latest version by attempting to order the version identifiers numerically if possible (casting to int/float), and fallback to alphanumeric ordering. This is to support a variety of versioning conventions out of the box, even if our recommendation is to use major versioning only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the latest proposed structure is appealing.

self.populate(manifest)

def get_unique_id(self, key, package: Optional[PackageName]):
return find_unique_id_for_package(self.storage, key, package)
def get_unique_id(self, key, package: Optional[PackageName], version: Optional[str]):
return self.find_unique_id_for_package_and_version(key, package, version)

def find(self, key, package: Optional[PackageName], manifest: "Manifest"):
unique_id = self.get_unique_id(key, package)
def find(
self, key, package: Optional[PackageName], version: Optional[str], manifest: "Manifest"
):
unique_id = self.get_unique_id(key, package, version)
if unique_id is not None:
return self.perform_lookup(unique_id, manifest)
return None
Expand All @@ -166,7 +169,14 @@ def add_node(self, node: ManifestNode):
if node.resource_type in self._lookup_types:
if node.name not in self.storage:
self.storage[node.name] = {}
self.storage[node.name][node.package_name] = node.unique_id
if node.package_name not in self.storage[node.name]:
self.storage[node.name][node.package_name] = []

if node.resource_type in self._versioned_types and node.is_latest_version:
# keep latest version at the front of unique id list
self.storage[node.name][node.package_name].insert(0, node.unique_id)
else:
self.storage[node.name][node.package_name].append(node.unique_id)

def populate(self, manifest):
for node in manifest.nodes.values():
Expand All @@ -179,6 +189,37 @@ def perform_lookup(self, unique_id: UniqueID, manifest) -> ManifestNode:
)
return manifest.nodes[unique_id]

def find_unique_id_for_package_and_version(
self, key: str, package: Optional[PackageName], version: Optional[str]
):
if key not in self.storage:
return None

pkg_dct: Mapping[PackageName, List[UniqueID]] = self.storage[key]

unique_ids: Optional[List[UniqueID]] = None

if package is None:
if not pkg_dct:
return None
else:
unique_ids = next(iter(pkg_dct.values()))
elif package in pkg_dct:
unique_ids = pkg_dct[package]
else:
return None

if len(unique_ids) >= 1:
if version:
for unique_id in unique_ids:
if unique_id.endswith(f"v{version}"):
return unique_id
return None
else:
return unique_ids[0]
else:
return None


class MetricLookup(dbtClassMixin):
def __init__(self, manifest: "Manifest"):
Expand Down Expand Up @@ -899,6 +940,7 @@ def resolve_ref(
self,
target_model_name: str,
target_model_package: Optional[str],
target_model_version: Optional[str],
current_project: str,
node_package: str,
) -> MaybeNonSource:
Expand All @@ -908,7 +950,7 @@ def resolve_ref(

candidates = _search_packages(current_project, node_package, target_model_package)
for pkg in candidates:
node = self.ref_lookup.find(target_model_name, pkg, self)
node = self.ref_lookup.find(target_model_name, pkg, target_model_version, self)

if node is not None and node.config.enabled:
return node
Expand Down
9 changes: 9 additions & 0 deletions core/dbt/contracts/graph/nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

from dbt.dataclass_schema import dbtClassMixin, ExtensibleDbtClassMixin

# from dbt.helper_types import IncludeExclude
from dbt.clients.system import write_file
from dbt.contracts.files import FileHash
from dbt.contracts.graph.unparsed import (
Expand Down Expand Up @@ -392,6 +393,10 @@ def patch(self, patch: "ParsedNodePatch"):
self.created_at = time.time()
self.description = patch.description
self.columns = patch.columns
# TODO - these are model specific, so is access
self.version = patch.version
self.is_latest_version = patch.is_latest_version
self.name = patch.name
# This might not be the ideal place to validate the "access" field,
# but at this point we have the information we need to properly
# validate and we don't before this.
Expand Down Expand Up @@ -513,6 +518,8 @@ class HookNode(CompiledNode):
class ModelNode(CompiledNode):
resource_type: NodeType = field(metadata={"restrict": [NodeType.Model]})
access: AccessType = AccessType.Protected
version: Optional[str] = None
is_latest_version: Optional[bool] = None


# TODO: rm?
Expand Down Expand Up @@ -1190,6 +1197,8 @@ class ParsedPatch(HasYamlMetadata, Replaceable):
class ParsedNodePatch(ParsedPatch):
columns: Dict[str, ColumnInfo]
access: Optional[str]
version: Optional[str]
is_latest_version: Optional[bool]


@dataclass
Expand Down
28 changes: 28 additions & 0 deletions core/dbt/contracts/graph/unparsed.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,15 +142,43 @@ class HasConfig:
config: Dict[str, Any] = field(default_factory=dict)


@dataclass
class UnparsedVersion(HasConfig, HasColumnProps):
defined_in: Optional[str] = None
columns: Sequence[Union[dbt.helper_types.IncludeExclude, UnparsedColumn]] = field(
default_factory=list
)
# TODO: all other model configs? Not sure HasColumnProps is the right thing to use here

@property
def include_exclude(self) -> dbt.helper_types.IncludeExclude:
return self._include_exclude

def __post_init__(self):
has_include_exclude = False
self._include_exclude = dbt.helper_types.IncludeExclude(include=[])
for column in self.columns:
if isinstance(column, dbt.helper_types.IncludeExclude):
if not has_include_exclude:
self._include_exclude = column
has_include_exclude = True
else:
raise ParsingError("version can have at most one include/exclude element")


@dataclass
class UnparsedAnalysisUpdate(HasConfig, HasColumnDocs, HasColumnProps, HasYamlMetadata):
access: Optional[str] = None
latest_version: Optional[str] = None
versions: Sequence[UnparsedVersion] = field(default_factory=list)


@dataclass
class UnparsedNodeUpdate(HasConfig, HasColumnTests, HasColumnAndTestProps, HasYamlMetadata):
quote_columns: Optional[bool] = None
access: Optional[str] = None
latest_version: Optional[str] = None
versions: Sequence[UnparsedVersion] = field(default_factory=list)


@dataclass
Expand Down
1 change: 0 additions & 1 deletion core/dbt/graph/selector_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,6 @@ def __init__(
def parsed_nodes(
self, included_nodes: Set[UniqueId]
) -> Iterator[Tuple[UniqueId, ManifestNode]]:

for key, node in self.manifest.nodes.items():
unique_id = UniqueId(key)
if unique_id not in included_nodes:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,17 @@

{% macro default__generate_alias_name(custom_alias_name=none, node=none) -%}

{%- if custom_alias_name is none -%}
{%- if custom_alias_name -%}

{{ node.name }}
{{ custom_alias_name | trim }}

{%- elif node.version -%}

{{ return(node.name ~ "_v" ~ node.version) }}

{%- else -%}

{{ custom_alias_name | trim }}
{{ node.name }}

{%- endif -%}

Expand Down
4 changes: 4 additions & 0 deletions core/dbt/node_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ def refable(cls) -> List["NodeType"]:
cls.Snapshot,
]

@classmethod
def versioned(cls) -> List["NodeType"]:
return [cls.Model]

@classmethod
def documentable(cls) -> List["NodeType"]:
return [
Expand Down
Loading