-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support all iceberg types in python library #3234
Conversation
python/src/iceberg/types.py
Outdated
|
||
@unique | ||
class TypeID(Enum): | ||
BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the Java class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is used in Java library for partition_spec
's lazyJavaClasses
or object cast in Accessors.
Seems no need here as python will not cast data to a Java object.
python/src/iceberg/types.py
Outdated
def as_list_type(self): | ||
raise ValueError("Not a list type: " + self) | ||
|
||
def asMapType(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this name wasn't updated to snake_case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is from the legacy library and will clean up those.
python/src/iceberg/types.py
Outdated
class Type(object): | ||
length: int | ||
scale: int | ||
precision: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be here or on the specific types, fixed and decimal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and they should not be there for all types.
python/src/iceberg/types.py
Outdated
return type(self) == type(other) | ||
|
||
def __ne__(self, other): | ||
return not self.__eq__(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not completely same.
Based on https://docs.python.org/3/reference/datamodel.html#object.__ne__
By default, object implements __eq__() by using is, returning NotImplemented in the case of a false comparison: True if x is y else NotImplemented. For __ne__(), by default it delegates to __eq__() and inverts the result unless it is NotImplemented.
default will throw NotImplemented
if false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? This sounds like what you're doing here:
For ne(), by default it delegates to eq() and inverts the result unless it is NotImplemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think it's the same. "returns NotImplemented
when false" is describing the __eq__
behavior but once you implement that you get __ne__
for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the default one in python3 is slightly different from this one in the python legacy module.
This implementation in python_legacy only works if type(self)
implements __eq__
.
The default one in python 3 is better and can handle other cases. IIUC, it works like
def __ne__(self, other):
result = self.__eq__(other)
if result is NotImplemented:
return NotImplemented
return not result
So we should remove __ne__
here if we want to support !=
.
python/src/iceberg/types.py
Outdated
return False | ||
|
||
def as_primitive_type(self): | ||
raise ValueError("Not a primitive type: " + self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these as_
and is_
methods are valuable if we have type annotations. Is the plan to add them later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I didn't add it and might add it later if needed.
Usually, as_type
is used to do the real data transform in python and not sure a Java as_type
kind of method (just return the concrete class type) is helpful in python.
python/src/iceberg/types.py
Outdated
self._precision = precision | ||
self._scale = scale | ||
|
||
def precision(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should precision and scale also have @property
decorators? Any plans to add return type annotations to all properties (e.g def precision(self) -> int:
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @property
and annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, SG.
python/src/iceberg/types.py
Outdated
|
||
class FixedType(Type): | ||
def __init__(self, length: int): | ||
super().__init__(f"fixed[{length}]", f"FixedType[{length}]", is_primitive=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repr string should be something that you can paste into Python to re-create the object, so this should produce FixedType({length})
. That is, it should use parens instead of square brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment and updated accordingly.
return self._is_primitive | ||
|
||
|
||
class FixedType(Type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a @property
method for length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
def type(self): | ||
return self._type | ||
|
||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here should be the one for __str__
. The __repr__
implementation should produce a string that is the Python representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think of having the named arguments in the repr?
return (f"NestedField(is_optional={self._is_optional}, field_id={self._id}, "
f"name={repr(self._name)}, field_type={repr(self._type)}, doc={repr(self._doc)})")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, adding them in #3350
super().__init__(f"map<{key_field.type}, {value_field.type}>", | ||
f"MapType<{key_field.type}, {value_field.type}>") | ||
self._key_field = key_field | ||
self._value_field = value_field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessor methods for key and value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, updated.
def __init__(self, key_field: NestedField, value_field: NestedField): | ||
super().__init__(f"map<{key_field.type}, {value_field.type}>", | ||
f"MapType({repr(key_field)}, {repr(value_field)})") | ||
self._key_field = key_field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: Can this just argument just be key
and then this line can be self._key = key
? Same for value_field
and element_field
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think key, value, and element are probably better names for the constructor args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, updating the argument names in #3350
Should we add the auto formatting stuff in this PR to prevent having to do a huge auto format later? (It's also fine to do in a follow up PR). I was thinking something like: [testenv:format]
description = reformat all source code
basepython = python3
deps =
black
isort
flake8
skip_install = true
commands =
isort --recursive --project iceberg --profile black setup.py src tests
black setup.py src tests
flake8 setup.py src tests
[testenv:linters]
basepython = python3
skip_install = true
deps =
.
{[testenv:isort]deps}
{[testenv:black]deps}
{[testenv:flake8]deps}
{[testenv:bandit]deps}
{[testenv:mypy]deps}
commands =
{[testenv:isort]deps}
{[testenv:black]deps}
{[testenv:flake8]commands}
{[testenv:bandit]commands}
{[testenv:mypy]commands}
[testenv:isort]
basepython = python3
skip_install = true
deps =
isort
commands =
isort --recursive --project iceberg --profile black --check-only setup.py src tests
[testenv:black]
basepython = python3
skip_install = true
deps =
black
commands =
black --check --diff src setup.py tests |
@samredai, if we can format the code here, let's do that. But I wouldn't say let's update the config for formatting, since that's a separate change. |
|
||
|
||
class NestedField(object): | ||
def __init__(self, is_optional: bool, field_id: int, name: str, field_type: Type, doc=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should doc have a type annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated it with str annotation in #3350
[BooleanType, IntegerType, LongType, FloatType, DoubleType, DateType, TimeType, | ||
TimestampType, TimestamptzType, StringType, UUIDType, BinaryType]) | ||
def test_repr_primitive_types(input_type): | ||
assert input_type == eval(repr(input_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
||
@pytest.mark.parametrize("input_type", | ||
[BooleanType, IntegerType, LongType, FloatType, DoubleType, DateType, TimeType, | ||
TimestampType, TimestamptzType, StringType, UUIDType, BinaryType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add struct, map, and list cases, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are tested in individual tests later with additional asserts in addition to eval of repr.
Nice work, @jun-he! There are only nits left, so I'm going to commit this to keep it moving. Thanks! |
For #3216
Note: I collapse type.py and types.py in the python_legacy to the new library, which is similar to existing Java implementation. Please take a look at it. We can use it as an example to discuss how to move towards for pythonic refactoring.