Skip to content

Commit

Permalink
feat: Updated pysparkler/README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
sweep-ai[bot] authored Jan 10, 2024
1 parent 8686e5d commit 8b2da07
Showing 1 changed file with 116 additions and 1 deletion.
117 changes: 116 additions & 1 deletion pysparkler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,121 @@ PySparkler is a tool that upgrades your PySpark scripts to latest Spark version.
PySpark script as input and outputs latest Spark version compatible script. It is written in Python and uses the
[LibCST](https://github.com/Instagram/LibCST) module to parse the input script and generate the output script.

## Features

The tool provides support for the following features:
- Upgrade PySpark Python script
- Upgrade PySpark Jupyter Notebook
- Upgrade SQL
- Dry-run Mode
- Verbose Mode
- Customize code transformers using YAML config

### Upgrade PySpark Python script

The tool can upgrade a PySpark Python script. It takes the path to the script as input and upgrades it in place:
```bash
pysparkler upgrade --input-file /path/to/script.py
```

If you want to output the upgraded script to a different directory, you can use the `--output-file` flag:
```bash
pysparkler upgrade --input-file /path/to/script.py --output-file /path/to/output.py
```

### Upgrade PySpark Jupyter Notebook

The tool can upgrade a PySpark Jupyter Notebook to latest Spark version. It takes the path to the notebook as input and
upgrades it in place:
```bash
pysparkler upgrade --input-file /path/to/notebook.ipynb
```

Similar to upgrading python scripts, if you want to output the upgraded notebook to a different directory, you can use
the `--output-file` flag:
```bash
pysparkler upgrade --input-file /path/to/notebook.ipynb --output-file /path/to/output.ipynb
```

To change the output kernel name in the output Jupyter notebook, you can use the `--output-kernel` flag:
```bash
pysparkler upgrade --input-file /path/to/notebook.ipynb --output-kernel spark33-python3
```

### Upgrade SQL

PySparkler when encounters a SQL statement in the input script makes an attempt to upgrade them. However, it is not
always possible to upgrade certain formatted string SQL statements that have complex expressions within. In such
cases the tool does leave code hints to let users know that they need to upgrade the SQL themselves.

To facilitate this, it exposes a command `upgrade-sql` for users to perform this DIY. The steps for that include:

1. De-template the SQL.
1. Upgrade the de-templated SQL using `pysparkler upgrade-sql`. See below for details.
1. Re-template the upgraded SQL.
1. Replace the old SQL with the upgraded SQL in the input script.

In order to perform step #2 i.e. you can either echo the SQL statement and pipe it to the tool:
```bash
echo "SELECT * FROM table" | pysparkler upgrade-sql
```

or you can use the `cat` command to pipe the SQL statement to the tool:
```bash
cat /path/to/sql.sql | pysparkler upgrade-sql
```

### Dry-Run Mode

For both the above upgrade options, to run in dry mode, you can use the `--dry-run` flag. This will not write the

upgraded script but will print a unified diff of the input and output scripts for you to inspect the changes:
```bash
pysparkler upgrade --input-file /path/to/script.py --dry-run
```

### Verbose Mode

For both the above upgrade options, to run in verbose mode, you can use the `--verbose` flag. This will print tool's
input variables, the input file content, the output content, and a unified diff of the input and output content:
```bash
pysparkler --verbose upgrade --input-file /path/to/script.py
```

### Customize code transformers using YAML config

The tool uses a YAML config file to customize the code transformers. The config file can be passed using the
`--config-yaml` flag:
```bash
pysparkler --config-yaml /path/to/config.yaml upgrade --input-file /path/to/script.py
```

The config file is a YAML file with the following structure:
```yaml
pysparkler:
dry_run: false # Whether to run in dry-run mode
PY24-30-001: # The code transformer ID
comment: A new comment # The overriden code hint comment to be used by the code transformer
PY24-30-002:
enabled: false # Disable the code transformer
```
```yaml
pysparkler:
dry_run: false # Whether to run in dry-run mode
PY24-30-001: # The code transformer ID
comment: A new comment # The overriden code hint comment to be used by the code transformer
PY24-30-002:
enabled: false # Disable the code transformer
```
[![PyPI version](https://badge.fury.io/py/pysparkler.svg)](https://badge.fury.io/py/pysparkler)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
PySparkler is a tool that upgrades your PySpark scripts to latest Spark version. It is a command line tool that takes a
PySpark script as input and outputs latest Spark version compatible script. It is written in Python and uses the
[LibCST](https://github.com/Instagram/LibCST) module to parse the input script and generate the output script.
## Installation
We recommend installing PySparkler from PyPI using [pipx](https://pypa.github.io/pipx) which allows us to install and
Expand Down Expand Up @@ -311,4 +426,4 @@ process. Transformer classes can be defined to apply specific transformations to
multiple Transformer classes can be combined to form a chain of transformations. This can be useful when dealing with
complex codebases where different parts of the code require different transformations.

More on this can be found [here](https://libcst.readthedocs.io/en/latest/tutorial.html#Build-Visitor-or-Transformer).
More on this can be found [here](The main advantage of using a Transformer is that it allows for more fine-grained control over the transformation process. Transformer classes can be defined to apply specific transformations to specific parts of the codebase, and multiple Transformer classes can be combined to form a chain of transformations. This can be useful when dealing with complex codebases where different parts of the code require different transformations.).

0 comments on commit 8b2da07

Please sign in to comment.