Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Deprecate ContentFile#path API and add location API which returns String #11092

Merged

Conversation

amogh-jahagirdar
Copy link
Contributor

ContentFile#path returns CharSequence instead of String. The original intention of this API returning CharSequence was to directly surface Avro UTF8 without any conversion, however for distributed cases the CharSequence needs to be converted into String since CharSequence itself isn't serializable. Furthermore, in terms of naming we tend to use location in other places like manifestListLocation or metadataLocation, also in FileIO. This change performs the following:

  1. Deprecate path on ContentFile since callers need to convert the CharSequence to String for the majority of cases. This would be removed from the interface in 2.0
  2. Add a String location() API to ContentFile which logically returns the same filePath but it will return String instead. An initial default implementation of location has been provided.

Over time up until 2.0 we can remove references to path and use location. After 2.0 implementations of ContentFile would need to implement location.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to keep the name, but overloading could not apply to the return type.

CharSequence path();

/** Return the fully qualified path to the file, suitable for constructing a Hadoop Path */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we still want to keep the Hadoop Path reference here as it quite evolved over the years and we do not necessarily depend on Hadoop anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I was largely concerned if we didn't preserve this property of being suitable for a hadoop path at the API level, once path is deprecated, it may break the users that do rely on having this. But if we're saying that we don't really need to concern ourselves on maintaining this, we may not need to include it. cc @rdblue thoughts on this?

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @rdblue @danielcweeks and here are some key points on why we shouldn't document anything related to hadoop:

  1. Hadoop pathing doesn't work with storage systems like GCS
  2. Locations are a super set of what can be represented in a hadoop path. Even today path API can return paths which simply cannot be represented in a hadoop Path.
  3. The hadoop path normalization logic simply conflicts with absolute paths

So updated to remove this part of the comment. I'll leave the existing path API comment as is since it's being deprecated anyways.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support this change. Internally, we always use String objects for serializability and most places that consume have explicit conversion to String already.

@amogh-jahagirdar amogh-jahagirdar force-pushed the deprecate-contentfile-path branch from 95b75b5 to 963d321 Compare September 20, 2024 22:16
@amogh-jahagirdar amogh-jahagirdar force-pushed the deprecate-contentfile-path branch from 963d321 to ea7d2ec Compare September 20, 2024 23:54
@amogh-jahagirdar
Copy link
Contributor Author

THanks for reviewing @flyrain @nastra @aokolnychyi , I'll go ahead and merge

@amogh-jahagirdar amogh-jahagirdar merged commit 257bad2 into apache:main Sep 23, 2024
50 checks passed
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants