-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Add Schema class #4318
Python: Add Schema class #4318
Conversation
python/src/iceberg/table/schema.py
Outdated
|
||
return field_id | ||
|
||
def get_field(self, field_id: int) -> NestedField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should use maps that are constructed by indexing with a SchemaVisitor. All of the methods on schema are intended to return a field by full name or by ID from anywhere in the schema. That's why schema is not just a regular struct. Structs will return fields by name or ID, but are limited to just that struct.
That's also why we use names like find_field
rather than get_field
(in addition, we avoid get
in methods names generally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, I saw that in the legacy implementation but now I understand the purpose. I'll update this today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I updated the PR to use the visitor pattern and added the IndexById
and IndexByName
schema visitors. There are four more schema visitor classes but I think those should be added in follow-up PRs. Let me know what you think:
- GetProjectedIds
- PruneColumns
- AssignFreshIds
- CheckCompatibility
python/src/iceberg/table/schema.py
Outdated
return field | ||
raise ValueError(f"Cannot get field, ID does not exist: {field_id}") | ||
|
||
def find_field_type(self, field_id: int) -> IcebergType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be re-named to def find_field_type_by_id(...)
to be more explicit?
This seems a little ambiguous
table_schema.find_field_type(1) # StringType()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated 👍
Rebased to add the changes from the docstest PR that was merged in. |
@rdblue thanks for the review! I brought over the the schema and visitors from the partition spec PR which also addresses the other comments. |
python/src/iceberg/table/schema.py
Outdated
raise ValueError("Cannot find field: {name_or_id}") | ||
return matched_fields[0] | ||
|
||
def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know when we can use PEP 604?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been back-ported the earliest version we support but not enabled by default. We'd have to add from __future__ import annotations
to the top of every file where we use it. I do like the newer syntax, let me know if the visual cost of the import feels worth it and I'll go ahead and update these.
Added an Javaimport org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.TypeUtil;
Schema schema = new Schema(
Types.NestedField.required(1, "foo", Types.StringType.get()),
Types.NestedField.optional(2, "bar", Types.IntegerType.get()),
Types.NestedField.required(3, "baz", Types.BooleanType.get()),
Types.NestedField.required(4, "qux", Types.ListType.ofOptional(5, Types.StringType.get())),
Types.NestedField.required(6, "quux", Types.MapType.ofOptional(7, 8, Types.StringType.get(), Types.MapType.ofOptional(9, 10, Types.StringType.get(), Types.IntegerType.get())))
);
Map<String, Integer> index = TypeUtil.indexByName(schema.asStruct());
System.out.println(index); output:
Pythonfrom iceberg.types import (
BooleanType,
IntegerType,
ListType,
MapType,
NestedField,
StringType,
StructType,
)
from iceberg.table.schema import Schema, index_by_name
schema = Schema(
NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
NestedField(field_id=4, name="qux", field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
NestedField(field_id=6, name="quux", field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True)
)
index = index_by_name(schema)
print(index) output:
|
Also, here's a comparison to Javaimport org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.TypeUtil;
Schema schema = new Schema(
Types.NestedField.required(1, "foo", Types.StringType.get()),
Types.NestedField.optional(2, "bar", Types.IntegerType.get()),
Types.NestedField.required(3, "baz", Types.BooleanType.get()),
Types.NestedField.required(4, "qux", Types.ListType.ofOptional(5, Types.StringType.get())),
Types.NestedField.required(6, "quux", Types.MapType.ofOptional(7, 8, Types.StringType.get(), Types.MapType.ofOptional(9, 10, Types.StringType.get(), Types.IntegerType.get())))
);
Map<Integer, Types.NestedField> index = TypeUtil.indexById(schema.asStruct());
System.out.println(index); output:
Pythonfrom iceberg.types import (
BooleanType,
IntegerType,
ListType,
MapType,
NestedField,
StringType,
StructType,
)
from iceberg.table.schema import Schema, index_by_id
schema = Schema(
NestedField(field_id=1, name="foo", field_type=StringType(), is_optional=False),
NestedField(field_id=2, name="bar", field_type=IntegerType(), is_optional=True),
NestedField(field_id=3, name="baz", field_type=BooleanType(), is_optional=False),
NestedField(field_id=4, name="qux", field_type=ListType(element_id=5, element_type=StringType(), element_is_optional=True), is_optional=True),
NestedField(field_id=6, name="quux", field_type=MapType(key_id=7, key_type=StringType(), value_id=8, value_type=MapType(key_id=9, key_type=StringType(), value_id=10, value_type=IntegerType(), value_is_optional=True), value_is_optional=True), is_optional=True)
)
index = index_by_id(schema)
print(index) output:
|
Co-authored-by: nssalian <[email protected]>
Rebased and also updated it to call
|
python/src/iceberg/schema.py
Outdated
str: The column name | ||
""" | ||
column = self._lazy_id_to_field().get(column_id) | ||
return None if column is None else column.name # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually needs to return the full name, not the field name. In Java, this uses the same index visitor as the name to ID, but it produces the byId
map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, other than the find_column_name
method doesn't return the full column name.
@samredai, if you want, you can throw |
This adds the Schema class by carving it out of PR #3228 by @nssalian and building upon it.
A
Schema
object is created by providing a list ofNestedField
instances, a schema ID, and optionally a map of aliases to field IDs. Theget_field_id(...)
method allows you to retrieve a field ID by providing the field name or an alias, and theget_field(...)
method allows you to retrieve the field object by providing the field ID. There's also aget_type
method that lets you retrieve the type of field by providing the field ID.output