Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all iceberg types in python library #3234

Merged
merged 3 commits into from
Oct 21, 2021
Merged

Conversation

jun-he
Copy link
Collaborator

@jun-he jun-he commented Oct 6, 2021

For #3216
Note: I collapse type.py and types.py in the python_legacy to the new library, which is similar to existing Java implementation. Please take a look at it. We can use it as an example to discuss how to move towards for pythonic refactoring.

@github-actions github-actions bot added the python label Oct 6, 2021

@unique
class TypeID(Enum):
BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the Java class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is used in Java library for partition_spec's lazyJavaClasses or object cast in Accessors.
Seems no need here as python will not cast data to a Java object.

def as_list_type(self):
raise ValueError("Not a list type: " + self)

def asMapType(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this name wasn't updated to snake_case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is from the legacy library and will clean up those.

class Type(object):
length: int
scale: int
precision: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be here or on the specific types, fixed and decimal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and they should not be there for all types.

return type(self) == type(other)

def __ne__(self, other):
return not self.__eq__(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the default implementation?

Copy link
Collaborator Author

@jun-he jun-he Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not completely same.
Based on https://docs.python.org/3/reference/datamodel.html#object.__ne__

By default, object implements __eq__() by using is, returning NotImplemented in the case of a false comparison: True if x is y else NotImplemented. For __ne__(), by default it delegates to __eq__() and inverts the result unless it is NotImplemented. 

default will throw NotImplemented if false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? This sounds like what you're doing here:

For ne(), by default it delegates to eq() and inverts the result unless it is NotImplemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think it's the same. "returns NotImplemented when false" is describing the __eq__ behavior but once you implement that you get __ne__ for free.

Copy link
Collaborator Author

@jun-he jun-he Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the default one in python3 is slightly different from this one in the python legacy module.
This implementation in python_legacy only works if type(self) implements __eq__.
The default one in python 3 is better and can handle other cases. IIUC, it works like

def __ne__(self, other):
    result = self.__eq__(other)
    if result is NotImplemented:
        return NotImplemented
    return not result

So we should remove __ne__ here if we want to support !=.

return False

def as_primitive_type(self):
raise ValueError("Not a primitive type: " + self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these as_ and is_ methods are valuable if we have type annotations. Is the plan to add them later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I didn't add it and might add it later if needed.
Usually, as_type is used to do the real data transform in python and not sure a Java as_type kind of method (just return the concrete class type) is helpful in python.

self._precision = precision
self._scale = scale

def precision(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should precision and scale also have @property decorators? Any plans to add return type annotations to all properties (e.g def precision(self) -> int:)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @property and annotations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, SG.


class FixedType(Type):
def __init__(self, length: int):
super().__init__(f"fixed[{length}]", f"FixedType[{length}]", is_primitive=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repr string should be something that you can paste into Python to re-create the object, so this should produce FixedType({length}). That is, it should use parens instead of square brackets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment and updated accordingly.

return self._is_primitive


class FixedType(Type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a @property method for length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

def type(self):
return self._type

def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here should be the one for __str__. The __repr__ implementation should produce a string that is the Python representation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of having the named arguments in the repr?

        return (f"NestedField(is_optional={self._is_optional}, field_id={self._id}, "
                f"name={repr(self._name)}, field_type={repr(self._type)}, doc={repr(self._doc)})")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, adding them in #3350

super().__init__(f"map<{key_field.type}, {value_field.type}>",
f"MapType<{key_field.type}, {value_field.type}>")
self._key_field = key_field
self._value_field = value_field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessor methods for key and value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated.

@jun-he jun-he marked this pull request as ready for review October 21, 2021 04:40
@jun-he
Copy link
Collaborator Author

jun-he commented Oct 21, 2021

@rdblue @pdames @samredai I updated the PR and added tests. Can you take another look? Thanks.

@jun-he jun-he requested a review from rdblue October 21, 2021 05:33
def __init__(self, key_field: NestedField, value_field: NestedField):
super().__init__(f"map<{key_field.type}, {value_field.type}>",
f"MapType({repr(key_field)}, {repr(value_field)})")
self._key_field = key_field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: Can this just argument just be key and then this line can be self._key = key? Same for value_field and element_field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think key, value, and element are probably better names for the constructor args.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, updating the argument names in #3350

@samredai
Copy link
Contributor

Should we add the auto formatting stuff in this PR to prevent having to do a huge auto format later? (It's also fine to do in a follow up PR). I was thinking something like:

[testenv:format]
description = reformat all source code
basepython = python3
deps =
    black
    isort
    flake8
skip_install = true
commands =
    isort --recursive --project iceberg --profile black setup.py src tests
    black setup.py src tests
    flake8 setup.py src tests

[testenv:linters]
basepython = python3
skip_install = true
deps =
    .
    {[testenv:isort]deps}
    {[testenv:black]deps}
    {[testenv:flake8]deps}
    {[testenv:bandit]deps}
    {[testenv:mypy]deps}
commands =
    {[testenv:isort]deps}
    {[testenv:black]deps}
    {[testenv:flake8]commands}
    {[testenv:bandit]commands}
    {[testenv:mypy]commands}

[testenv:isort]
basepython = python3
skip_install = true
deps =
    isort
commands =
    isort --recursive --project iceberg --profile black --check-only setup.py src tests

[testenv:black]
basepython = python3
skip_install = true
deps =
    black
commands =
    black --check --diff src setup.py tests

@rdblue
Copy link
Contributor

rdblue commented Oct 21, 2021

@samredai, if we can format the code here, let's do that. But I wouldn't say let's update the config for formatting, since that's a separate change.



class NestedField(object):
def __init__(self, is_optional: bool, field_id: int, name: str, field_type: Type, doc=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should doc have a type annotation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it with str annotation in #3350

[BooleanType, IntegerType, LongType, FloatType, DoubleType, DateType, TimeType,
TimestampType, TimestamptzType, StringType, UUIDType, BinaryType])
def test_repr_primitive_types(input_type):
assert input_type == eval(repr(input_type))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


@pytest.mark.parametrize("input_type",
[BooleanType, IntegerType, LongType, FloatType, DoubleType, DateType, TimeType,
TimestampType, TimestamptzType, StringType, UUIDType, BinaryType])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add struct, map, and list cases, too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are tested in individual tests later with additional asserts in addition to eval of repr.

@rdblue rdblue merged commit 9f47e15 into apache:master Oct 21, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 21, 2021

Nice work, @jun-he! There are only nits left, so I'm going to commit this to keep it moving. Thanks!

@jun-he
Copy link
Collaborator Author

jun-he commented Oct 22, 2021

@samredai that's a great idea to have the formatting. We have an issue (#3282) for that. I agree that we should have it ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants