Skip to content

Commit

Permalink
[GJ-13] Refine docs on Velox backend (#102)
Browse files Browse the repository at this point in the history
* refine doc on Velox backend

* revert pom change
  • Loading branch information
rui-mo authored Apr 8, 2022
1 parent 67d420f commit d7cd5f9
Show file tree
Hide file tree
Showing 9 changed files with 74 additions and 70 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,29 +54,29 @@ UDF support. We need to create the interface which use columnar batch as inptu.

Ideally if all native library can return arrow record batch, we can share much features in Spark's JVM. Spark already have Apache arrow dependency, we can make arrow format as Spark's basic columnar format. The problem is that native library may not be 100% compitable with Arrow format, then there will be a transform between their native format and Arrow, usually it's not cheap.

# How to use OAP: Gazelle-Jni
# How to use OAP: Gluten

### Build the Environment

There are two ways to build the env for compiling OAP: Gazelle-Jni
1. Building by Conda Environment
2. Building by Yourself
There are two ways to build the env for compiling OAP: Gluten
1. Build by Conda Environment
2. Build by Yourself

- ### Building by Conda (Recommended)
- ### Build by Conda (Recommended)

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information.

- ### Building by yourself
- ### Build by yourself

If you prefer to build from the source code on your hand, please follow the steps in [Installation Guide](./docs/GazelleJniInstallation.md) to set up your environment.
If you prefer to build from the source code on your hand, please follow the steps in [Installation Guide](./docs/GlutenInstallation.md) to set up your environment.

### Compile and use Gazelle Jni
### Compile and use Gluten

Once your env being successfully deployed, please refer to [Gazelle Jni Usage](./docs/GazelleJniUsage.md) to compile and use Gazelle Jni in Spark.
Once your env being successfully deployed, please refer to [Gluten Usage](./docs/GlutenUsage.md) to compile and use Gluten in Spark.

### Notes for Building Gazelle-Jni with Velox
### Build Gluten with Velox backend

After Gazelle-Jni being successfully deployed in your environment, if you would like to build Gazelle-Jni with **Velox** computing, please checkout to branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev) and follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 test.
After Gluten being successfully deployed in your environment, if you would like to build Gluten with **Velox** computing, please follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 and Q1 test.

# Contact

Expand Down
2 changes: 1 addition & 1 deletion cpp/src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ macro(build_arrow STATIC_ARROW)
ExternalProject_Add(arrow_ep
GIT_REPOSITORY https://github.com/oap-project/arrow.git
SOURCE_DIR ${ARROW_SOURCE_DIR}
GIT_TAG arrow-4.0.0-oap
GIT_TAG arrow-7.0.0-oap
BUILD_IN_SOURCE 1
INSTALL_DIR ${ARROW_PREFIX}
INSTALL_COMMAND make install
Expand Down
6 changes: 3 additions & 3 deletions docs/ArrowInstallation.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ Please make sure your cmake version is qualified based on the prerequisite.
# Arrow
``` shell
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-4.0.0-oap
mkdir -p arrow/cpp/release-build
cd arrow && git checkout arrow-7.0.0-oap
mkdir -p cpp/release-build
cd arrow/cpp/release-build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make -j
Expand All @@ -51,4 +51,4 @@ mvn test -pl adapter/parquet -P arrow-jni
mvn test -pl gandiva -P arrow-jni
```

After arrow installed in the specific directory, please make sure to set up -Dbuild_arrow=OFF -Darrow_root=/path/to/arrow when building Gazelle Plugin.
After arrow installed in the specific directory, please make sure to set up -Dbuild_arrow=OFF -Darrow_root=/path/to/arrow when building Gluten.
6 changes: 3 additions & 3 deletions docs/GazelleJniInstallation.md → docs/GlutenInstallation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How to Build OAP: Gazelle-Jni
# How to Build OAP: Gluten

## Prerequisite

Expand All @@ -12,13 +12,13 @@ Please make sure you have already installed the software in your system.
5. Maven 3.6.3 or higher version
6. Hadoop 2.7.5 or higher version
7. Spark 3.1.1 or higher version
8. Intel Optimized Arrow 4.0.0
8. Intel Optimized Arrow 7.0.0

### GCC installation

#### installing GCC 7.0 or higher version

Please notes for better performance support, GCC 7.0 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE.
Please note for better performance support, GCC 7.0 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE.
https://gcc.gnu.org/install/index.html

Follow the above website to download gcc.
Expand Down
45 changes: 23 additions & 22 deletions docs/GazelleJniUsage.md → docs/GlutenUsage.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,27 @@
### Gazelle Jni Usage
### Gluten Usage

#### How to compile Gazelle Jni jar
#### How to compile Gluten jar

``` shell
git clone -b master https://github.com/oap-project/gazelle-jni.git
cd gazelle-jni
git clone -b master https://github.com/oap-project/gluten.git
cd gluten
```

If compiling on the master branch:
When compiling Gluten, a backend should be enabled for execution.
For example, add below options to enable Velox backend:

```shell script
git checkout master
mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON
-Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}
```

If compiling with Velox:
The full compiling command would be like:

```shell script
git checkout velox_dev
mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dvelox_home=${VELOX_HOME}
mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}
```

If Arrow has once been installed successfully on your env, and there is no change to Arrow, you can
add below config to disable Arrow compilation each time when compiling Gazelle-Jni:
add below config to disable Arrow compilation each time when compiling Gluten:

```shell script
-Dbuild_arrow=OFF -Darrow_root=${ARROW_LIB_PATH}
Expand All @@ -32,26 +31,28 @@ Based on the different environment, there are some parameters can be set via -D

| Parameters | Description | Default Value |
| ---------- | ----------- | ------------- |
| build_cpp | Enable or Disable building CPP library | False |
| cpp_tests | Enable or Disable CPP Tests | False |
| build_arrow | Build Arrow from Source | True |
| build_cpp | Enable or Disable building CPP library | OFF |
| cpp_tests | Enable or Disable CPP Tests | OFF |
| build_arrow | Build Arrow from Source | ON |
| arrow_root | When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. | /usr/local |
| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
| velox_home (only valid on velox_dev branch) | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |
| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. |ON |
| build_velox | Enable or Disable building Velox as a backend. | OFF |
| velox_home (only valid when build_velox is ON) | When building Gluten with Velox, the location of Velox should be set. | /root/velox |

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-7.0.0-oap)
If you wish to change any parameters from Arrow, you can change it from the [build_arrow.sh](../tools/build_arrow.sh) script.

#### How to submit Spark Job

##### Key Spark Configurations when using Gazelle Jni
##### Key Spark Configurations when using Gluten

```shell script
spark.plugins com.intel.oap.GazellePlugin
spark.oap.sql.columnar.backend.lib ${BACKEND}
spark.sql.sources.useV1SourceList avro
spark.memory.offHeap.size 20g
spark.driver.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<version>-snapshot-jar-with-dependencies.jar
spark.executor.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<version>-snapshot-jar-with-dependencies.jar
spark.driver.extraClassPath ${GLUTEN_HOME}/jvm/target/gluten-jvm-<version>-snapshot-jar-with-dependencies.jar
spark.executor.extraClassPath ${GLUTEN_HOME}/jvm/target/gluten-jvm-<version>-snapshot-jar-with-dependencies.jar
```

Below is an example of the script to submit Spark SQL query.
Expand All @@ -69,5 +70,5 @@ time{spark.sql("${QUERY}").show}
Submit the above script from spark-shell to trigger a Spark Job with certain configurations.

```shell script
cat query.scala | spark-shell --name query --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${gazelle_jvm_jar} --conf spark.executor.extraClassPath=${gazelle_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
```
cat query.scala | spark-shell --name query --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.oap.sql.columnar.backend.lib=${BACKEND} --conf spark.driver.extraClassPath=${gluten_jvm_jar} --conf spark.executor.extraClassPath=${gluten_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
```
2 changes: 1 addition & 1 deletion docs/OAP-Installation-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ $ conda create -n oapenv -c conda-forge -c intel -y oap=1.2.0

Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.

- [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.2.0)
- [Arrow](https://github.com/oap-project/arrow/tree/arrow-7.0.0-oap)
- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
- [Vmemcache](https://github.com/pmem/vmemcache.git)
Expand Down
59 changes: 31 additions & 28 deletions docs/Velox.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
## Velox

Currently, Gluten requires Velox being pre-compiled.
In general, please refer to [Velox Installation](https://github.com/facebookincubator/velox/blob/main/scripts/setup-ubuntu.sh) to install all the dependencies and compile Velox.
In addition to that, there are several points worth attention when compiling Gazelle-Jni with Velox.
Gluten depends on this [Velox branch](https://github.com/rui-mo/velox/tree/velox_for_gazelle_jni).
The changes to Velox are planned to be upstreamed in the future.
In addition to that, there are several points worth attention when compiling Gluten with Velox.

Firstly, please note that all the Gazelle-Jni required libraries should be compiled as **position independent code**.
Firstly, please note that all the Gluten required libraries should be compiled as **position independent code**.
That means, for static libraries, "-fPIC" option should be added in their compiling processes.

Currently, Gazelle-Jni with Velox depends on below libraries:
Currently, Gluten with Velox depends on below libraries:

Required static libraries are:

Expand All @@ -21,9 +24,9 @@ Required shared libraries are:
- gtest
- snappy

Gazelle-Jni will try to find above libraries from system lib paths.
Gluten will try to find above libraries from system lib paths.
If they are not installed there, please copy them to system lib paths,
or change the paths about where to find them specified in [CMakeLists.txt](https://github.com/oap-project/gazelle-jni/blob/velox_dev/cpp/src/CMakeLists.txt).
or change the finding paths specified in [CMakeLists.txt](https://github.com/oap-project/gluten/blob/master/cpp/velox/CMakeLists.txt).

```shell script
set(SYSTEM_LIB_PATH "/usr/lib" CACHE PATH "System Lib dir")
Expand All @@ -34,19 +37,24 @@ set(SYSTEM_LOCAL_LIB64_PATH "/usr/local/lib64" CACHE PATH "System Local Lib64 di

Secondly, when compiling Velox, please note that Velox generated static libraries should also be compiled as position independent code.
Also, some OBJECT settings in CMakeLists are removed in order to acquire the static libraries.
For these two changes, please refer to this commit [Velox Compiling](https://github.com/rui-mo/velox/commit/ce1dee8f776bc3afa36cd3fc033161fc062cbe98).
These two changes have already been covered in the Velox branch Gluten depends on.

Currently, we depends on this Velox commmit: **8d3e951 (Jan 18 2022)**
After Velox being successfully compiled, please refer to [GlutenUsage](GlutenUsage.md) and
use below command to compile Gluten with Velox backend.

### An example for Velox computing in Spark based on Gazelle-Jni
```shell script
mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}
```

### An example for offloading Spark's computing to Velox with Gluten

TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support still has several limitations:
TPC-H Q1 and Q6 are supported in Gluten using Velox as backend. Current support has below limitations:

- Only Double type is supported.
- Only first stage of TPC-H Q6 (which occupies the most time in this query) is supported.
- Found Date and Long types in Velox's TableScan are not fully ready,
so converted related columns into Double type.
- Metrics are missing.

#### Test TPC-H Q6 on Gazelle-Jni with Velox computing
#### Test TPC-H Q1 and Q6 on Gluten with Velox backend

##### Data preparation

Expand Down Expand Up @@ -77,6 +85,11 @@ The modified TPC-H Q6 query is:
select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24
```
The modified TPC-H Q6 query is:
```shell script
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate_new <= 10471 group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus
```
Below script shows how to read the ORC data, and submit the modified TPC-H Q6 query.
cat tpch_q6.scala
Expand All @@ -90,7 +103,7 @@ time{spark.sql("select sum(l_extendedprice * l_discount) as revenue from lineite
Submit test script from spark-shell.
```shell script
cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${gazelle_jvm_jar} --conf spark.executor.extraClassPath=${gazelle_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf --conf spark.oap.sql.columnar.backend.lib=velox --conf spark.driver.extraClassPath=${gluten_jvm_jar} --conf spark.executor.extraClassPath=${gluten_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
```
##### Result
Expand All @@ -99,20 +112,10 @@ cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master yarn --deploy-mode
##### Performance
Below table shows the TPC-H Q6 Performance in a multiple-thread test (--num-executors 6 --executor-cores 6) for Velox and vanilla Spark.
Below table shows the TPC-H Q1 and Q6 Performance in a multiple-thread test (--num-executors 6 --executor-cores 6) for Velox and vanilla Spark.
Both Parquet and ORC datasets are sf1024.
| TPC-H Q6 Performance | Velox (ORC) | Vanilla Spark (Parquet) | Vanilla Spark (ORC) |
| ---------- | ----------- | ------------- | ------------- |
| Time(s) | 13.6 | 21.6 | 34.9 |
| Query Performance (s) | Velox (ORC) | Vanilla Spark (Parquet) | Vanilla Spark (ORC) |
|---------------- | ----------- | ------------- | ------------- |
| TPC-H Q6 | 13.6 | 21.6 | 34.9 |
| TPC-H Q1 | 26.1 | 76.7 | 84.9 |
Binary file modified docs/image/TPC-H_Q6_DAG.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion tools/build_arrow.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ echo "ARROW_SOURCE_DIR=${ARROW_SOURCE_DIR}"
echo "ARROW_INSTALL_DIR=${ARROW_INSTALL_DIR}"
mkdir -p $ARROW_SOURCE_DIR
mkdir -p $ARROW_INSTALL_DIR
git clone https://github.com/oap-project/arrow.git --branch arrow-4.0.0-oap $ARROW_SOURCE_DIR
git clone https://github.com/oap-project/arrow.git --branch arrow-7.0.0-oap $ARROW_SOURCE_DIR
pushd $ARROW_SOURCE_DIR

cmake ./cpp \
Expand Down

0 comments on commit d7cd5f9

Please sign in to comment.