Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Add examples for Delta Kernel API usage #1926

Merged
merged 4 commits into from
Aug 2, 2023

Conversation

vkorukanti
Copy link
Collaborator

@vkorukanti vkorukanti commented Jul 21, 2023

Description

Adds an example project that shows how to read a Delta table using the Kernel APIs. The sample program can also be used as a command line to read the Delta table.

Single threaded reader

java io.delta.kernel.examples.SingleThreadedTableReader \
    --table=file:<repo-dir>/connectors/golden-tables/src/main/resources/golden/data-reader-primitives \
    --columns=as_int,as_long
    --limit=5

              as_int|             as_long
                null|                null
                   0|                   0
                   1|                   1
                   2|                   2
                   3|                   3

Multi-threaded reader (simulating a distributed execution environment)

java io.delta.kernel.examples.MultiThreadedTableReader
    --table=file:<repo-dir>/connectors/golden-tables/src/main/resources/golden/data-reader-primitives \
    --columns=as_int,as_long
    --limit=20
    --parallelism=5

              as_int|             as_long
                null|                null
                   0|                   0
                   1|                   1
                   2|                   2
                   3|                   3

How was this patch tested?

Manual testing

Usage: java io.delta.kernel.examples.SingleThreadedTableReader [-c <arg>] [-l <arg>] -t <arg>
   -c,--columns <arg>   Comma separated list of columns to read from the table. Ex. --columns=id,name,address
   -l,--limit <arg>     Maximum number of rows to read from the table (default 20).
  -t,--table <arg>     Fully qualified table path
 Usage: java io.delta.kernel.examples.MultiThreadedTableReader [-c <arg>] [-l <arg>] [-p <arg>] -t <arg>
    -c,--columns <arg>       Comma separated list of columns to read from the table. Ex. --columns=id,name,address
    -l,--limit <arg>         Maximum number of rows to read from the table (default 20).
    -p,--parallelism <arg>   Number of parallel readers to use (default 3).
    -t,--table <arg>         Fully qualified table path

@vkorukanti vkorukanti force-pushed the kernel-examples branch 2 times, most recently from d59897b to 1d0c173 Compare July 21, 2023 17:27
@vkorukanti vkorukanti changed the title [Kernel][WIP] Add examples for Delta Kernel API usage [Kernel] Add examples for Delta Kernel API usage Jul 28, 2023

<groupId>org.example</groupId>
<artifactId>table-reader</artifactId>
<version>3.0.0-SNAPSHOT</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be hardcoding this? also why does this need to be same version as the actual delta artifacts that we will publish. unless there is a need to consistent, this is only going to cause more headache to keep it updated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the version to 0.1-SNAPSHOT. Let me know if there is a better version name.

@@ -0,0 +1,49 @@
##
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license.

## tables located in the <repo-root>/connectors/golden-tables/src/main/resources/golden
## directory.
##
## Make sure to run this script from <repo-root> in order for the relative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make this script be runnable from anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sort of lack of flexibility causes future pain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also .. thats why writing such scripts in python gives a lot more flexibility and an entire library of tools to do these sort of stuff. There is a lot of python scripts in this repo (e.g., run-integration-tests.py) that you can steal from.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote this in python (by copying the code from delta integration test).

-Dstaging.repo.url=${EXTRA_MAVEN_REPO:-"___"} \
-Ddelta-kernel.version=${STANDALONE_VERSION:-"3.0.0-SNAPSHOT"} \
-Dexec.args="${test}"
done
Copy link
Contributor

@tdas tdas Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested that this scripts fails (exit nonzero) when compilation and test in the examples fail?

StructType readSchema,
Snapshot snapshot,
int maxRowCount) throws IOException
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are mixing styles all over the place.. even within this function. can we stick to the following since we are all familiar with it (consistent with all the scala code).

try {
...
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows our Java checkstyle guide and Kernel Java code is. Let me know if this needs to be changed.

import io.delta.kernel.data.DefaultJsonRow;
import io.delta.kernel.data.Row;
import io.delta.kernel.internal.types.TableSchemaSerDe;
import io.delta.kernel.types.ArrayType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cant this be just types.*

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is following the Java checkstyle. Also using wildcard in imports is not recommended.

* @return
* @throws Exception
*/
private void readSnapshot(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readData

snapshot is ambiguous. snapshot metadata or snapshot data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed to readData

@tdas
Copy link
Contributor

tdas commented Aug 2, 2023

approved, but I left some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants