Support all iceberg types in python library #3234

jun-he · 2021-10-06T15:31:33Z

For #3216
Note: I collapse type.py and types.py in the python_legacy to the new library, which is similar to existing Java implementation. Please take a look at it. We can use it as an example to discuss how to move towards for pythonic refactoring.

rdblue · 2021-10-06T15:48:17Z

python/src/iceberg/types.py

+
+@unique
+class TypeID(Enum):
+    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1}


Do we need the Java class?

That is used in Java library for partition_spec's lazyJavaClasses or object cast in Accessors.
Seems no need here as python will not cast data to a Java object.

rdblue · 2021-10-06T15:49:02Z

python/src/iceberg/types.py

+    def as_list_type(self):
+        raise ValueError("Not a list type: " + self)
+
+    def asMapType(self):


Looks like this name wasn't updated to snake_case.

Yep, that is from the legacy library and will clean up those.

rdblue · 2021-10-06T15:49:51Z

python/src/iceberg/types.py

+class Type(object):
+    length: int
+    scale: int
+    precision: int


Do these need to be here or on the specific types, fixed and decimal?

Agree and they should not be there for all types.

rdblue · 2021-10-06T15:51:56Z

python/src/iceberg/types.py

+        return type(self) == type(other)
+
+    def __ne__(self, other):
+        return not self.__eq__(other)


This isn't the default implementation?

Seems not completely same.
Based on https://docs.python.org/3/reference/datamodel.html#object.__ne__

By default, object implements __eq__() by using is, returning NotImplemented in the case of a false comparison: True if x is y else NotImplemented. For __ne__(), by default it delegates to __eq__() and inverts the result unless it is NotImplemented.

default will throw NotImplemented if false.

Are you sure? This sounds like what you're doing here:

For ne(), by default it delegates to eq() and inverts the result unless it is NotImplemented.

+1 I think it's the same. "returns NotImplemented when false" is describing the __eq__ behavior but once you implement that you get __ne__ for free.

What I meant is that the default one in python3 is slightly different from this one in the python legacy module.
This implementation in python_legacy only works if type(self) implements __eq__.
The default one in python 3 is better and can handle other cases. IIUC, it works like

def __ne__(self, other): result = self.__eq__(other) if result is NotImplemented: return NotImplemented return not result

So we should remove __ne__ here if we want to support !=.

rdblue · 2021-10-06T15:53:38Z

python/src/iceberg/types.py

+        return False
+
+    def as_primitive_type(self):
+        raise ValueError("Not a primitive type: " + self)


I think these as_ and is_ methods are valuable if we have type annotations. Is the plan to add them later?

Currently, I didn't add it and might add it later if needed.
Usually, as_type is used to do the real data transform in python and not sure a Java as_type kind of method (just return the concrete class type) is helpful in python.

pdames · 2021-10-15T07:27:24Z

python/src/iceberg/types.py

+        self._precision = precision
+        self._scale = scale
+
+    def precision(self):


Should precision and scale also have @property decorators? Any plans to add return type annotations to all properties (e.g def precision(self) -> int:)?

+1 for @property and annotations.

rdblue · 2021-10-15T15:04:18Z

python/src/iceberg/types.py

+
+class FixedType(Type):
+    def __init__(self, length: int):
+        super().__init__(f"fixed[{length}]", f"FixedType[{length}]", is_primitive=True)


The repr string should be something that you can paste into Python to re-create the object, so this should produce FixedType({length}). That is, it should use parens instead of square brackets.

Thanks for the comment and updated accordingly.

rdblue · 2021-10-15T15:05:39Z

python/src/iceberg/types.py

+        return self._is_primitive
+
+
+class FixedType(Type):


I think this needs a @property method for length.

rdblue · 2021-10-15T15:09:02Z

python/src/iceberg/types.py

+    def type(self):
+        return self._type
+
+    def __repr__(self):


The implementation here should be the one for __str__. The __repr__ implementation should produce a string that is the Python representation.

Yep, updated.

what do you think of having the named arguments in the repr?

return (f"NestedField(is_optional={self._is_optional}, field_id={self._id}, " f"name={repr(self._name)}, field_type={repr(self._type)}, doc={repr(self._doc)})")

I'm all for it!

SG, adding them in #3350

python/src/iceberg/types.py

rdblue · 2021-10-15T15:16:17Z

python/src/iceberg/types.py

+        super().__init__(f"map<{key_field.type}, {value_field.type}>",
+                         f"MapType<{key_field.type}, {value_field.type}>")
+        self._key_field = key_field
+        self._value_field = value_field


Accessor methods for key and value?

Yep, updated.

jun-he · 2021-10-21T05:33:20Z

@rdblue @pdames @samredai I updated the PR and added tests. Can you take another look? Thanks.

samredai · 2021-10-21T21:29:57Z

python/src/iceberg/types.py

+    def __init__(self, key_field: NestedField, value_field: NestedField):
+        super().__init__(f"map<{key_field.type}, {value_field.type}>",
+                         f"MapType({repr(key_field)}, {repr(value_field)})")
+        self._key_field = key_field


small nit: Can this just argument just be key and then this line can be self._key = key? Same for value_field and element_field?

Yeah, I think key, value, and element are probably better names for the constructor args.

SG, updating the argument names in #3350

samredai · 2021-10-21T21:33:33Z

Should we add the auto formatting stuff in this PR to prevent having to do a huge auto format later? (It's also fine to do in a follow up PR). I was thinking something like:

[testenv:format]
description = reformat all source code
basepython = python3
deps =
    black
    isort
    flake8
skip_install = true
commands =
    isort --recursive --project iceberg --profile black setup.py src tests
    black setup.py src tests
    flake8 setup.py src tests

[testenv:linters]
basepython = python3
skip_install = true
deps =
    .
    {[testenv:isort]deps}
    {[testenv:black]deps}
    {[testenv:flake8]deps}
    {[testenv:bandit]deps}
    {[testenv:mypy]deps}
commands =
    {[testenv:isort]deps}
    {[testenv:black]deps}
    {[testenv:flake8]commands}
    {[testenv:bandit]commands}
    {[testenv:mypy]commands}

[testenv:isort]
basepython = python3
skip_install = true
deps =
    isort
commands =
    isort --recursive --project iceberg --profile black --check-only setup.py src tests

[testenv:black]
basepython = python3
skip_install = true
deps =
    black
commands =
    black --check --diff src setup.py tests

rdblue · 2021-10-21T22:31:16Z

@samredai, if we can format the code here, let's do that. But I wouldn't say let's update the config for formatting, since that's a separate change.

rdblue · 2021-10-21T22:32:16Z

python/src/iceberg/types.py

+
+
+class NestedField(object):
+    def __init__(self, is_optional: bool, field_id: int, name: str, field_type: Type, doc=None):


Should doc have a type annotation?

updated it with str annotation in #3350

rdblue · 2021-10-21T22:36:12Z

python/tests/test_types.py

+                         [BooleanType, IntegerType, LongType, FloatType, DoubleType, DateType, TimeType,
+                          TimestampType, TimestamptzType, StringType, UUIDType, BinaryType])
+def test_repr_primitive_types(input_type):
+    assert input_type == eval(repr(input_type))


rdblue · 2021-10-21T22:36:42Z

python/tests/test_types.py

+
+@pytest.mark.parametrize("input_type",
+                         [BooleanType, IntegerType, LongType, FloatType, DoubleType, DateType, TimeType,
+                          TimestampType, TimestamptzType, StringType, UUIDType, BinaryType])


Could we add struct, map, and list cases, too?

Those are tested in individual tests later with additional asserts in addition to eval of repr.

rdblue · 2021-10-21T22:38:38Z

Nice work, @jun-he! There are only nits left, so I'm going to commit this to keep it moving. Thanks!

jun-he · 2021-10-22T06:36:28Z

@samredai that's a great idea to have the formatting. We have an issue (#3282) for that. I agree that we should have it ASAP.

Support all iceberg types in python library

2b165a1

github-actions bot added the python label Oct 6, 2021

rdblue reviewed Oct 6, 2021

View reviewed changes

rewrite types to make it shorter and more pythonic

2ce2078

pdames reviewed Oct 15, 2021

View reviewed changes

rdblue reviewed Oct 15, 2021

View reviewed changes

python/src/iceberg/types.py Show resolved Hide resolved

rdblue reviewed Oct 15, 2021

View reviewed changes

python/src/iceberg/types.py Show resolved Hide resolved

rdblue reviewed Oct 15, 2021

View reviewed changes

address the comments and add unit tests

064c6ec

jun-he marked this pull request as ready for review October 21, 2021 04:40

jun-he requested a review from rdblue October 21, 2021 05:33

samredai reviewed Oct 21, 2021

View reviewed changes

rdblue reviewed Oct 21, 2021

View reviewed changes

rdblue approved these changes Oct 21, 2021

View reviewed changes

rdblue merged commit 9f47e15 into apache:master Oct 21, 2021

nssalian mentioned this pull request Oct 22, 2021

[Python] support partition spec in iceberg python library #3228

Closed

jun-he added a commit to jun-he/incubator-iceberg that referenced this pull request Oct 22, 2021

Improve types classes based on the comment in the PR apache#3234.

0a74569

jun-he mentioned this pull request Oct 22, 2021

[Python] Improve types classes based on the comment in the PR #3234. #3350

Merged

jun-he deleted the jun/add-types branch October 22, 2021 06:58

rdblue pushed a commit that referenced this pull request Oct 22, 2021

Python: Minor changes to types classes, follow up to #3234 (#3350)

84b7d05



		class NestedField(object):
		def __init__(self, is_optional: bool, field_id: int, name: str, field_type: Type, doc=None):

Support all iceberg types in python library #3234

Support all iceberg types in python library #3234

Conversation

jun-he commented Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jun-he Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jun-he Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jun-he commented Oct 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samredai commented Oct 21, 2021

rdblue commented Oct 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Oct 21, 2021

jun-he commented Oct 22, 2021

jun-he commented Oct 6, 2021 •

edited

Loading

jun-he Oct 7, 2021 •

edited

Loading

jun-he Oct 21, 2021 •

edited

Loading