Skip to content

Commit

Permalink
Merge branch 'main' into fix/expose_writer_rust_to_python_v2
Browse files Browse the repository at this point in the history
  • Loading branch information
ion-elgreco authored Nov 29, 2023
2 parents b4f9695 + 733b5ff commit 50e7257
Show file tree
Hide file tree
Showing 6 changed files with 412 additions and 4 deletions.
75 changes: 75 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,78 @@ If you want to contribute something more substantial, see our "Projects seeking
## Claiming an issue

If you want to claim an issue to work on, you can write the word `take` as a comment in it and you will be automatically assigned.

## Quick start

- Install Rust, e.g. as described [here](https://doc.rust-lang.org/cargo/getting-started/installation.html)
- Have a compatible Python version installed (check `python/pyproject.toml` for current requirement)
- Create a Python virtual environment (required for development builds), e.g. as described [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/)
- Build the project for development (this requires an active virtual environment and will also install `deltalake` in that virtual environment)
```
cd python
make develop
```

- Run some Python code, e.g. to run a specific test
```
python -m pytest tests/test_writer.py -s -k "test_with_deltalake_schema"
```

- Run some Rust code, e.g. run an example
```
cd crates/deltalake
cargo run --examples basic_operations
```

## Run the docs locally
*This serves your local contens of docs via a web browser, handy for checking what they look like if you are making changes to docs or docstings*
```
(cd python; make develop)
pip install -r docs/requirements.txt
mkdocs serve
```

## To make a pull request (PR)
- Make sure all the following steps run/pass locally before submitting a PR
```
cargo fmt -- --check
cd python
make check-rust
make check-python
make develop
make unit-test
make build-docs
```

## Developing in VSCode

*These are just some basic steps/components to get you started, there are many other very useful extensions for VSCode*

- For a better Rust development experience, install [rust extention](https://marketplace.visualstudio.com/items?itemName=1YiB.rust-bundle)
- For debugging Rust code, install [CodeLLDB](https://marketplace.visualstudio.com/items?itemName=vadimcn.vscode-lldb). The extension should even create Debug launch configurations for the project if you allow it, an easy way to get started. Just set a breakpoint and run the relevant configuration.
- For debugging from Python into Rust, follow this procedure:
1. Add this to `.vscode/launch.json`
```
{
"type": "lldb",
"request": "attach",
"name": "LLDB Attach to Python'",
"program": "${command:python.interpreterPath}",
"pid": "${command:pickMyProcess}",
"args": [],
"stopOnEntry": false,
"environment": [],
"externalConsole": true,
"MIMode": "lldb",
"cwd": "${workspaceFolder}"
}
```
2. Add a `breakpoint()` statement somewhere in your Python code (main function or at any point in Python code you know will be executed when you run it)
3. Add a breakpoint in Rust code in VSCode editor where you want to drop into the debugger
4. Run the relevant Python code function in your terminal, execution should drop into the Python debugger showing `PDB` prompt
5. Run the following in that promt to get the Python process ID: `import os; os.getpid()`
6. Run the `LLDB Attach to Python` from the `Run and Debug` panel of VSCode. This will prompt you for a Process ID to attach to, enter the Python process ID obtained earlier (this will also be in the dropdown but that dropdown will have many process IDs)
7. LLDB make take couple of seconds to attach to the process
8. When the debugger is attached to the process (you will notice the debugger panels get filled with extra info), enter `c`+Enter in the `PDB` prompt in your terminal - the execution should continue until the breakpoint in Rust code is hit. From this point it's a standard debugging procecess.


36 changes: 32 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,35 @@
# Python deltalake package
# The deltalake package

This is the documentation for the native Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/index.html) instead.
This is the documentation for the native Rust/Python implementation of Delta Lake. It is based on the delta-rs Rust library and requires no Spark or JVM dependencies. For the PySpark implementation, see [delta-spark](https://docs.delta.io/latest/api/python/index.html) instead.

This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables from Python without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [Pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/).
This module provides the capability to read, write, and manage [Delta Lake](https://delta.io/) tables with Python or Rust without Spark or Java. It uses [Apache Arrow](https://arrow.apache.org/) under the hood, so is compatible with other Arrow-native or integrated libraries such as [pandas](https://pandas.pydata.org/), [DuckDB](https://duckdb.org/), and [Polars](https://www.pola.rs/).

Note: This module is under active development and some features are experimental. It is not yet as feature-complete as the PySpark implementation of Delta Lake. If you encounter a bug, please let us know in our [GitHub repo](https://github.com/delta-io/delta-rs/issues).
## Important terminology

* "Rust deltalake" refers to the Rust API of delta-rs (no Spark dependency)
* "Python deltalake" refers to the Python API of delta-rs (no Spark dependency)
* "Delta Spark" refers to the Scala impementation of the Delta Lake transaction log protocol. This depends on Spark and Java.

## Why implement the Delta Lake transaction log protocol in Rust and Scala?

Delta Spark depends on Java and Spark, which is fine for many use cases, but not all Delta Lake users want to depend on these libraries. delta-rs allows using Delta Lake in Rust or other native projects when using a JVM is often not an option.

Python deltalake lets you query Delta tables without depending on Java/Scala.

Suppose you want to query a Delta table with pandas on your local machine. Python deltalake makes it easy to query the table with a simple `pip install` command - no need to install Java.

## Contributing

The Delta Lake community welcomes contributors from all developers, regardless of your experience or programming background.

You can write Rust code, Python code, documentation, submit bugs, or give talks to the community. We welcome all of these contributions.

Feel free to [join our Slack](https://go.delta.io/slack) and message us in the #delta-rs channel any time!

We value kind communication and building a productive, friendly environment for maximum collaboration and fun.

## Project history

Check out this video by Denny Lee & QP Hou to learn about the genesis of the delta-rs project:

<iframe width="560" height="315" src="https://www.youtube.com/embed/ZQdEdifcBh8?si=ytGW7FB-kwl6VqsV" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
Loading

0 comments on commit 50e7257

Please sign in to comment.