[GJ-13] Refine docs on Velox backend (#102)

* refine doc on Velox backend * revert pom change
apache · Apr 8, 2022 · d7cd5f9 · d7cd5f9
1 parent 67d420f
commit d7cd5f9
Show file tree

Hide file tree

Showing 9 changed files with 74 additions and 70 deletions.
diff --git a/README.md b/README.md
@@ -54,29 +54,29 @@ UDF support. We need to create the interface which use columnar batch as inptu.
 
 Ideally if all native library can return arrow record batch, we can share much features in Spark's JVM. Spark already have Apache arrow dependency, we can make arrow format as Spark's basic columnar format. The problem is that native library may not be 100% compitable with Arrow format, then there will be a transform between their native format and Arrow, usually it's not cheap.
 
-# How to use OAP: Gazelle-Jni
+# How to use OAP: Gluten
 
 ### Build the Environment
 
-There are two ways to build the env for compiling OAP: Gazelle-Jni
-1. Building by Conda Environment
-2. Building by Yourself
+There are two ways to build the env for compiling OAP: Gluten
+1. Build by Conda Environment
+2. Build by Yourself
 
-- ### Building by Conda (Recommended)
+- ### Build by Conda (Recommended)
 
 If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information.
 
-- ### Building by yourself
+- ### Build by yourself
 
-If you prefer to build from the source code on your hand, please follow the steps in [Installation Guide](./docs/GazelleJniInstallation.md) to set up your environment.
+If you prefer to build from the source code on your hand, please follow the steps in [Installation Guide](./docs/GlutenInstallation.md) to set up your environment.
 
-### Compile and use Gazelle Jni
+### Compile and use Gluten
 
-Once your env being successfully deployed, please refer to [Gazelle Jni Usage](./docs/GazelleJniUsage.md) to compile and use Gazelle Jni in Spark.
+Once your env being successfully deployed, please refer to [Gluten Usage](./docs/GlutenUsage.md) to compile and use Gluten in Spark.
 
-### Notes for Building Gazelle-Jni with Velox
+### Build Gluten with Velox backend
 
-After Gazelle-Jni being successfully deployed in your environment, if you would like to build Gazelle-Jni with **Velox** computing, please checkout to branch [velox_dev](https://github.com/oap-project/gazelle-jni/tree/velox_dev) and follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 test.
+After Gluten being successfully deployed in your environment, if you would like to build Gluten with **Velox** computing, please follow the steps in [Build with Velox](./docs/Velox.md) to install the needed libraries, compile Velox and try out the TPC-H Q6 and Q1 test.
 
 # Contact
 

diff --git a/cpp/src/CMakeLists.txt b/cpp/src/CMakeLists.txt
@@ -155,7 +155,7 @@ macro(build_arrow STATIC_ARROW)
   ExternalProject_Add(arrow_ep
                       GIT_REPOSITORY https://github.com/oap-project/arrow.git
                       SOURCE_DIR ${ARROW_SOURCE_DIR}
-                      GIT_TAG arrow-4.0.0-oap
+                      GIT_TAG arrow-7.0.0-oap
                       BUILD_IN_SOURCE 1
                       INSTALL_DIR ${ARROW_PREFIX}
                       INSTALL_COMMAND make install

diff --git a/docs/ArrowInstallation.md b/docs/ArrowInstallation.md
@@ -30,8 +30,8 @@ Please make sure your cmake version is qualified based on the prerequisite.
 # Arrow
 ``` shell
 git clone https://github.com/oap-project/arrow.git
-cd arrow && git checkout arrow-4.0.0-oap
-mkdir -p arrow/cpp/release-build
+cd arrow && git checkout arrow-7.0.0-oap
+mkdir -p cpp/release-build
 cd arrow/cpp/release-build
 cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
 make -j
@@ -51,4 +51,4 @@ mvn test -pl adapter/parquet -P arrow-jni
 mvn test -pl gandiva -P arrow-jni
 ```
 
-After arrow installed in the specific directory, please make sure to set up -Dbuild_arrow=OFF -Darrow_root=/path/to/arrow when building Gazelle Plugin.
+After arrow installed in the specific directory, please make sure to set up -Dbuild_arrow=OFF -Darrow_root=/path/to/arrow when building Gluten.
diff --git a/docs/GazelleJniInstallation.md → docs/GlutenInstallation.md b/docs/GazelleJniInstallation.md → docs/GlutenInstallation.md
@@ -1,4 +1,4 @@
-# How to Build OAP: Gazelle-Jni
+# How to Build OAP: Gluten
 
 ## Prerequisite
 
@@ -12,13 +12,13 @@ Please make sure you have already installed the software in your system.
 5. Maven 3.6.3 or higher version
 6. Hadoop 2.7.5 or higher version
 7. Spark 3.1.1 or higher version
-8. Intel Optimized Arrow 4.0.0
+8. Intel Optimized Arrow 7.0.0
 
 ### GCC installation
 
 #### installing GCC 7.0 or higher version
 
-Please notes for better performance support, GCC 7.0 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE.
+Please note for better performance support, GCC 7.0 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE.
 https://gcc.gnu.org/install/index.html
 
 Follow the above website to download gcc.

diff --git a/docs/GazelleJniUsage.md → docs/GlutenUsage.md b/docs/GazelleJniUsage.md → docs/GlutenUsage.md
@@ -1,28 +1,27 @@
-### Gazelle Jni Usage
+### Gluten Usage
 
-#### How to compile Gazelle Jni jar
+#### How to compile Gluten jar
 
 ``` shell
-git clone -b master https://github.com/oap-project/gazelle-jni.git
-cd gazelle-jni
+git clone -b master https://github.com/oap-project/gluten.git
+cd gluten
 ```
 
-If compiling on the master branch:
+When compiling Gluten, a backend should be enabled for execution.
+For example, add below options to enable Velox backend:
 
 ```shell script
-git checkout master
-mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON
+-Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}
 ```
 
-If compiling with Velox:
+The full compiling command would be like:
 
 ```shell script
-git checkout velox_dev
-mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dvelox_home=${VELOX_HOME}
+mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}
 ```
 
 If Arrow has once been installed successfully on your env, and there is no change to Arrow, you can
-add below config to disable Arrow compilation each time when compiling Gazelle-Jni:
+add below config to disable Arrow compilation each time when compiling Gluten:
 
 ```shell script
 -Dbuild_arrow=OFF -Darrow_root=${ARROW_LIB_PATH}
@@ -32,26 +31,28 @@ Based on the different environment, there are some parameters can be set via -D
 
 | Parameters | Description | Default Value |
 | ---------- | ----------- | ------------- |
-| build_cpp | Enable or Disable building CPP library | False |
-| cpp_tests | Enable or Disable CPP Tests | False |
-| build_arrow | Build Arrow from Source | True |
+| build_cpp | Enable or Disable building CPP library | OFF |
+| cpp_tests | Enable or Disable CPP Tests | OFF |
+| build_arrow | Build Arrow from Source | ON |
 | arrow_root | When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. | /usr/local |
-| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |
-| velox_home (only valid on velox_dev branch) | When building Gazelle-Jni with Velox, the location of Velox should be set. | /root/velox |
+| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. |ON |
+| build_velox | Enable or Disable building Velox as a backend. | OFF |
+| velox_home (only valid when build_velox is ON) | When building Gluten with Velox, the location of Velox should be set. | /root/velox |
 
-When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap)
+When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-7.0.0-oap)
 If you wish to change any parameters from Arrow, you can change it from the [build_arrow.sh](../tools/build_arrow.sh) script.
 
 #### How to submit Spark Job
 
-##### Key Spark Configurations when using Gazelle Jni
+##### Key Spark Configurations when using Gluten
 
 ```shell script
 spark.plugins com.intel.oap.GazellePlugin
+spark.oap.sql.columnar.backend.lib ${BACKEND}
 spark.sql.sources.useV1SourceList avro
 spark.memory.offHeap.size 20g
-spark.driver.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<version>-snapshot-jar-with-dependencies.jar
-spark.executor.extraClassPath ${GAZELLE_JNI_HOME}/jvm/target/gazelle-jni-jvm-<version>-snapshot-jar-with-dependencies.jar
+spark.driver.extraClassPath ${GLUTEN_HOME}/jvm/target/gluten-jvm-<version>-snapshot-jar-with-dependencies.jar
+spark.executor.extraClassPath ${GLUTEN_HOME}/jvm/target/gluten-jvm-<version>-snapshot-jar-with-dependencies.jar
 ```
 
 Below is an example of the script to submit Spark SQL query.
@@ -69,5 +70,5 @@ time{spark.sql("${QUERY}").show}
 Submit the above script from spark-shell to trigger a Spark Job with certain configurations.
 
 ```shell script
-cat query.scala | spark-shell --name query --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${gazelle_jvm_jar} --conf spark.executor.extraClassPath=${gazelle_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
-```
+cat query.scala | spark-shell --name query --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.oap.sql.columnar.backend.lib=${BACKEND} --conf spark.driver.extraClassPath=${gluten_jvm_jar} --conf spark.executor.extraClassPath=${gluten_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
+```
diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md
@@ -33,7 +33,7 @@ $ conda create -n oapenv -c conda-forge -c intel -y oap=1.2.0
 
 Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.
 
-- [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.2.0)
+- [Arrow](https://github.com/oap-project/arrow/tree/arrow-7.0.0-oap)
 - [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
 - [Memkind](https://github.com/memkind/memkind/tree/v1.10.1)
 - [Vmemcache](https://github.com/pmem/vmemcache.git)

diff --git a/docs/Velox.md b/docs/Velox.md
@@ -1,12 +1,15 @@
 ## Velox
 
+Currently, Gluten requires Velox being pre-compiled.
 In general, please refer to [Velox Installation](https://github.com/facebookincubator/velox/blob/main/scripts/setup-ubuntu.sh) to install all the dependencies and compile Velox.
-In addition to that, there are several points worth attention when compiling Gazelle-Jni with Velox.
+Gluten depends on this [Velox branch](https://github.com/rui-mo/velox/tree/velox_for_gazelle_jni).
+The changes to Velox are planned to be upstreamed in the future.
+In addition to that, there are several points worth attention when compiling Gluten with Velox.
 
-Firstly, please note that all the Gazelle-Jni required libraries should be compiled as **position independent code**.
+Firstly, please note that all the Gluten required libraries should be compiled as **position independent code**.
 That means, for static libraries, "-fPIC" option should be added in their compiling processes.
 
-Currently, Gazelle-Jni with Velox depends on below libraries:
+Currently, Gluten with Velox depends on below libraries:
 
 Required static libraries are:
 
@@ -21,9 +24,9 @@ Required shared libraries are:
 - gtest
 - snappy
 
-Gazelle-Jni will try to find above libraries from system lib paths.
+Gluten will try to find above libraries from system lib paths.
 If they are not installed there, please copy them to system lib paths,
-or change the paths about where to find them specified in [CMakeLists.txt](https://github.com/oap-project/gazelle-jni/blob/velox_dev/cpp/src/CMakeLists.txt).
+or change the finding paths specified in [CMakeLists.txt](https://github.com/oap-project/gluten/blob/master/cpp/velox/CMakeLists.txt).
 
 ```shell script
 set(SYSTEM_LIB_PATH "/usr/lib" CACHE PATH "System Lib dir")
@@ -34,19 +37,24 @@ set(SYSTEM_LOCAL_LIB64_PATH "/usr/local/lib64" CACHE PATH "System Local Lib64 di
 
 Secondly, when compiling Velox, please note that Velox generated static libraries should also be compiled as position independent code.
 Also, some OBJECT settings in CMakeLists are removed in order to acquire the static libraries.
-For these two changes, please refer to this commit [Velox Compiling](https://github.com/rui-mo/velox/commit/ce1dee8f776bc3afa36cd3fc033161fc062cbe98).
+These two changes have already been covered in the Velox branch Gluten depends on.
 
-Currently, we depends on this Velox commmit: **8d3e951 (Jan 18 2022)**
+After Velox being successfully compiled, please refer to [GlutenUsage](GlutenUsage.md) and
+use below command to compile Gluten with Velox backend.
 
-### An example for Velox computing in Spark based on Gazelle-Jni
+```shell script
+mvn clean package -P full-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}
+```
+
+### An example for offloading Spark's computing to Velox with Gluten
 
-TPC-H Q6 is supported in Gazelle-Jni base on Velox computing. Current support still has several limitations: 
+TPC-H Q1 and Q6 are supported in Gluten using Velox as backend. Current support has below limitations: 
 
-- Only Double type is supported.
-- Only first stage of TPC-H Q6 (which occupies the most time in this query) is supported.
+- Found Date and Long types in Velox's TableScan are not fully ready, 
+so converted related columns into Double type.
 - Metrics are missing.
 
-#### Test TPC-H Q6 on Gazelle-Jni with Velox computing
+#### Test TPC-H Q1 and Q6 on Gluten with Velox backend
 
 ##### Data preparation
 
@@ -77,6 +85,11 @@ The modified TPC-H Q6 query is:
 select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate_new >= 8766 and l_shipdate_new < 9131 and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24
 ```
 
+The modified TPC-H Q6 query is:
+```shell script
+select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate_new <= 10471 group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus
+```
+
 Below script shows how to read the ORC data, and submit the modified TPC-H Q6 query.
 
 cat tpch_q6.scala
@@ -90,7 +103,7 @@ time{spark.sql("select sum(l_extendedprice * l_discount) as revenue from lineite
 Submit test script from spark-shell.
 
 ```shell script
-cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf spark.driver.extraClassPath=${gazelle_jvm_jar} --conf spark.executor.extraClassPath=${gazelle_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
+cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master yarn --deploy-mode client --conf spark.plugins=com.intel.oap.GazellePlugin --conf --conf spark.oap.sql.columnar.backend.lib=velox --conf spark.driver.extraClassPath=${gluten_jvm_jar} --conf spark.executor.extraClassPath=${gluten_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g
 ```
 
 ##### Result
@@ -99,20 +112,10 @@ cat tpch_q6.scala | spark-shell --name tpch_velox_q6 --master yarn --deploy-mode
 
 ##### Performance
 
-Below table shows the TPC-H Q6 Performance in a multiple-thread test (--num-executors 6 --executor-cores 6) for Velox and vanilla Spark.
+Below table shows the TPC-H Q1 and Q6 Performance in a multiple-thread test (--num-executors 6 --executor-cores 6) for Velox and vanilla Spark.
 Both Parquet and ORC datasets are sf1024.
 
-| TPC-H Q6 Performance | Velox (ORC) | Vanilla Spark (Parquet) | Vanilla Spark (ORC) |
-| ---------- | ----------- | ------------- | ------------- |
-| Time(s) | 13.6 | 21.6  | 34.9 |
-
-
-
-
-
-
-
-
-
-
-
+| Query Performance (s) | Velox (ORC) | Vanilla Spark (Parquet) | Vanilla Spark (ORC) |
+|---------------- | ----------- | ------------- | ------------- |
+| TPC-H Q6 | 13.6 | 21.6  | 34.9 |
+| TPC-H Q1 | 26.1 | 76.7 | 84.9 |
diff --git a/docs/image/TPC-H_Q6_DAG.png b/docs/image/TPC-H_Q6_DAG.png
diff --git a/tools/build_arrow.sh b/tools/build_arrow.sh
@@ -62,7 +62,7 @@ echo "ARROW_SOURCE_DIR=${ARROW_SOURCE_DIR}"
 echo "ARROW_INSTALL_DIR=${ARROW_INSTALL_DIR}"
 mkdir -p $ARROW_SOURCE_DIR
 mkdir -p $ARROW_INSTALL_DIR
-git clone https://github.com/oap-project/arrow.git  --branch arrow-4.0.0-oap $ARROW_SOURCE_DIR
+git clone https://github.com/oap-project/arrow.git  --branch arrow-7.0.0-oap $ARROW_SOURCE_DIR
 pushd $ARROW_SOURCE_DIR
 
 cmake ./cpp \