Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beyond the jvm updates #123

Merged
merged 37 commits into from
Apr 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
4e93385
Update git ignore.
holdenk Mar 17, 2024
b0c5480
Start adding some shims for running accelerators.
holdenk Mar 17, 2024
8c547e5
Try and setup velox
holdenk Nov 23, 2023
a14aa7b
Update accel stuff
holdenk Mar 18, 2024
8c90b72
Get Gluten + Spark3.4 to party (note: this fails because of Gluten se…
holdenk Mar 18, 2024
9e607f7
Fix comet resolution
holdenk Mar 20, 2024
0ac92c7
Multiple extensions (Iceberg and Comet)
holdenk Mar 20, 2024
1fe4897
Turn on Comet shuffle
holdenk Mar 20, 2024
f4fe039
Style fixes
holdenk Mar 20, 2024
d9d4a39
Use version 3.4.2 and also use setup rust action for speed
holdenk Mar 20, 2024
c18210c
Seperate out setup comet so we can debug faster.
holdenk Mar 20, 2024
9f79085
Setup jdk versions for happiness.
holdenk Mar 20, 2024
f6e565b
Change caching to make sense
holdenk Mar 20, 2024
a11ec96
Work around the classloader issue we found.
holdenk Mar 20, 2024
9f57e26
shellcheck fix.
holdenk Mar 20, 2024
44e41b0
Hmm why no version.
holdenk Mar 20, 2024
4182d89
Fix version pass in for setup
holdenk Mar 20, 2024
9057522
Fix comet setup
holdenk Mar 20, 2024
b7fc797
Try and fix gluten build
holdenk Mar 20, 2024
dbff266
Style fix and statically link
holdenk Mar 20, 2024
a9a01d3
vcpkg
holdenk Mar 21, 2024
5ad5c3b
Try and fix vcpkg
holdenk Mar 21, 2024
07229d6
meh vcpkg is kind of a pain, lets skip it.
holdenk Mar 25, 2024
f716ef7
Huzzah --driver-class-path does the trick.
holdenk Mar 25, 2024
8354fe7
Make setup_gluten_deps better formated for book inclusion
holdenk Mar 25, 2024
769ffeb
Tag the gluten setup
holdenk Mar 25, 2024
65f4a98
Disable gluten SQL
holdenk Mar 25, 2024
a0fa2ba
Tag comet example for inclusion.
holdenk Mar 25, 2024
a6bd178
Add Python UDF/UDAF examples.
holdenk Mar 25, 2024
50f804c
Style fixes
holdenk Mar 25, 2024
e377e9e
Move SparkSession builder up
holdenk Mar 26, 2024
1b8c55c
style fix
holdenk Mar 26, 2024
362e17a
Fix typing import + pd.DF
holdenk Mar 26, 2024
4cd8622
Style fix
holdenk Apr 1, 2024
4d23e5e
Use axel if present
holdenk Apr 1, 2024
4d3373b
Add mypy to tox.ini so we don't depend on it being in the system setup.
holdenk Apr 1, 2024
2419a8c
Fix axel command.
holdenk Apr 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 77 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,76 @@ jobs:
- name: Run sql examples
run:
./run_sql_examples.sh
# run-gluten-sql-examples:
# runs-on: ubuntu-latest
# steps:
# - name: Checkout
# uses: actions/checkout@v2
# - name: Cache Spark and friends
# uses: actions/cache@v3
# with:
# path: |
# spark*.tgz
# iceberg*.jar
# key: spark-artifacts
# - name: Setup JDK
# uses: actions/setup-java@v3
# with:
# distribution: temurin
# java-version: 17
# - name: Cache Maven packages
# uses: actions/cache@v2
# with:
# path: ~/.m2
# key: ${{ runner.os }}-m2-gluten
# - name: Cache Data
# uses: actions/cache@v3
# with:
# path: |
# data/fetched/*
# key: data-fetched
# - name: Run gluten
# run:
# cd accelerators; ./gluten_spark_34_ex.sh
run-comet-sql-examples:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Cache Spark and friends
uses: actions/cache@v3
with:
path: |
spark*.tgz
iceberg*.jar
key: spark-artifacts
- name: Cache Data
uses: actions/cache@v3
with:
path: |
data/fetched/*
key: data-fetched
- name: Cache Maven packages
uses: actions/cache@v2
with:
path: ~/.m2
key: ${{ runner.os }}-m2-comet
- name: Setup Rust
uses: actions-rs/toolchain@v1
with:
toolchain: stable
override: true
- name: Setup JDK
uses: actions/setup-java@v3
with:
distribution: temurin
java-version: 17
- name: Setup comet
run:
cd accelerators; SPARK_MAJOR=3.4 ./setup_comet.sh
- name: Run comet
run:
cd accelerators; ./comet_ex.sh
run-target-examples:
runs-on: ubuntu-latest
steps:
Expand All @@ -76,6 +146,12 @@ jobs:
spark*.tgz
iceberg*.jar
key: spark-artifacts
- name: Cache Accel
uses: actions/cache@v3
with:
path: |
accelerators/*.jar
key: accelerators-artifacts
- name: Cache Data
uses: actions/cache@v3
with:
Expand Down Expand Up @@ -114,7 +190,7 @@ jobs:
- name: Shellcheck
run: |
sudo apt-get install -y shellcheck
shellcheck $(find -name "*.sh")
shellcheck -e SC2317,SC1091,SC2034,SC2164 $(find -name "*.sh")
- name: Setup JDK
uses: actions/setup-java@v3
with:
Expand Down
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,13 @@ spark_expectations_sample_rules.json
# more python
pyspark_venv.tar.gz
pyspark_venv/

# accel stuff
accelerators/*.jar
accelerators/arrow-datafusion-comet
# ignore gluten
gluten
gluten*.jar
spark-3*hadoop*/
spark-3*hadoop*.tgz
accelerators/incubator-gluten
20 changes: 20 additions & 0 deletions accelerators/comet_env_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

SPARK_EXTRA="
--jars ${COMET_JAR} \
--driver-class-path ${COMET_JAR} \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.comet.exec.all.enabled=true \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.columnar.shuffle.enabled=true"
# Instead of --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions we set
# EXTRA_EXTENSIONS so it can be appended to iceberg
if [ -z "$EXTRA_EXTENSIONS" ]; then
EXTRA_EXTENSIONS="org.apache.comet.CometSparkSessionExtensions"
else
EXTRA_EXTENSIONS="org.apache.comet.CometSparkSessionExtensions,$EXTRA_EXTENSIONS"
fi
export EXTRA_EXTENSIONS
export SPARK_EXTRA
16 changes: 16 additions & 0 deletions accelerators/comet_ex.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
set -ex

# If you change this update the workflow version too.
SPARK_MAJOR=${SPARK_MAJOR:-3.4}
SPARK_VERSION=3.4.2
export SPARK_MAJOR
export SPARK_VERSION

source setup_comet.sh
pushd ..
source ./env_setup.sh
popd
source comet_env_setup.sh
pushd ..
USE_COMET="true" ./run_sql_examples.sh
5 changes: 5 additions & 0 deletions accelerators/gluten_config.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
spark.plugins=io.glutenproject.GlutenPlugin
spark.memory.offHeap.enabled=true
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
# This static allocation is one of the hardest part of using Gluten
spark.memory.offHeap.size=20g
31 changes: 31 additions & 0 deletions accelerators/gluten_env_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# Check if we gluten and gluten UDFs present
GLUTEN_NATIVE_LIB_NAME=libhigh-performance-spark-gluten-0.so
NATIVE_LIB_DIR=$(pwd)/../native/src/
NATIVE_LIB_PATH="${NATIVE_LIB_DIR}${GLUTEN_NATIVE_LIB_NAME}"
GLUTEN_HOME=incubator-gluten
source /etc/lsb-release
if [ -n "$GLUTEN_JAR_PATH" ]; then
GLUTEN_EXISTS="true"
GLUTEN_SPARK_EXTRA="--conf spark.plugins=io.glutenproject.GlutenPlugin \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=5g \
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--jars ${GLUTEN_JAR_PATH}"
fi
if [ -f "${NATIVE_LIB_PATH}" ]; then
if [ "$GLUTEN_EXISTS" == "true" ]; then
GLUTEN_UDF_EXISTS="true"
GLUTEN_SPARK_EXTRA="$GLUTEN_SPARK_EXTRA \
--conf spark.jars=${GLUTEN_JAR_PATH} \
--conf spark.gluten.loadLibFromJar=true \
--files ${NATIVE_LIB_PATH} \
--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=${GLUTEN_NATIVE_LIB_NAME}"
fi
fi
SPARK_EXTRA=GLUTEN_SPARK_EXTRA

export SPARK_EXTRA
export GLUTEN_UDF_EXISTS
export GLUTEN_EXISTS
22 changes: 22 additions & 0 deletions accelerators/gluten_spark_34_ex.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

set -ex

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "${SCRIPT_DIR}"
source "${SCRIPT_DIR}/setup_gluten_spark34.sh"

export SPARK_HOME
PATH="$(pwd)/${SPARK_DIR}/bin:$PATH"
export PATH
"${SPARK_HOME}/bin/spark-sql" --master local[5] \
--conf spark.plugins=io.glutenproject.GlutenPlugin \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=5g \
--jars "${GLUTEN_JAR}" \
--conf spark.eventLog.enabled=true \
-e "SELECT 1"

source gluten_env_setup.sh
cd ..
./run_sql_examples.sh || echo "Expected to fail"
9 changes: 9 additions & 0 deletions accelerators/install_rust_if_needed.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash
if [ -f "$HOME/.cargo/env" ]; then
source "$HOME/.cargo/env"
fi

if ! command -v cargo; then
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
fi
3 changes: 3 additions & 0 deletions accelerators/run_gluten.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

"${SPARK_HOME}/bin/spark-shell" --master local --jars "${ACCEL_JARS}/gluten-velox-bundle-spark${SPARK_MAJOR_VERSION}_2.12-1.1.1.jar" --spark-properties=gluten_config.properties
27 changes: 27 additions & 0 deletions accelerators/setup_comet.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

set -ex
source install_rust_if_needed.sh

if [ -z "${SPARK_MAJOR}" ]; then
echo "Need a spark major version specified."
exit 1
else
echo "Building comet for Spark ${SPARK_MAJOR}"
fi

#tag::build[]
# If we don't have fusion checked out do it
if [ ! -d arrow-datafusion-comet ]; then
git clone https://github.com/apache/arrow-datafusion-comet.git
fi

# Build JAR if not present
if [ -z "$(ls arrow-datafusion-comet/spark/target/comet-spark-spark*.jar)" ]; then
cd arrow-datafusion-comet
make clean release PROFILES="-Pspark-${SPARK_MAJOR}"
cd ..
fi
COMET_JAR="$(pwd)/$(ls arrow-datafusion-comet/spark/target/comet-spark-spark*SNAPSHOT.jar)"
export COMET_JAR
#end::build[]
14 changes: 14 additions & 0 deletions accelerators/setup_gluten_deps.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
set -ex

sudo apt-get update
#tag::gluten_deps[]
sudo apt-get install -y locales wget tar tzdata git ccache cmake ninja-build build-essential \
llvm-dev clang libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev \
libcurl4-openssl-dev maven rapidjson-dev libdouble-conversion-dev libgflags-dev \
libsodium-dev libsnappy-dev nasm
sudo apt install -y libunwind-dev
sudo apt-get install -y libgoogle-glog-dev
sudo apt-get -y install docker-compose
sudo apt-get install -y libre2-9 || sudo apt-get install -y libre2-10
#end::gluten_deps[]
23 changes: 23 additions & 0 deletions accelerators/setup_gluten_from_src.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -ex

# Setup deps
source setup_gluten_deps.sh

# Try gluten w/clickhouse
#if [ ! -d gluten ]; then
# git clone https://github.com/oap-project/gluten.git
# cd gluten
# bash ./ep/build-clickhouse/src/build_clickhouse.sh
#fi

# Build gluten
if [ ! -d gluten ]; then
# We need Spark 3.5 w/scala212
git clone [email protected]:holdenk/gluten.git || git clone https://github.com/holdenk/gluten.git
cd gluten
git checkout add-spark35-scala213-hack
./dev/builddeps-veloxbe.sh
mvn clean package -Pbackends-velox -Pspark-3.5 -DskipTests
cd ..
fi
56 changes: 56 additions & 0 deletions accelerators/setup_gluten_spark34.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

mkdir -p /tmp/spark-events
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ACCEL_JARS=${SCRIPT_DIR}
SPARK_MAJOR_VERSION=3.4
SCALA_VERSION=${SCALA_VERSION:-"2.12"}

set -ex

# Note: this does not work on Ubuntu 23, only on 22
# You might get something like:
# # C [libgluten.so+0x30c753] gluten::Runtime::registerFactory(std::string const&, std::function<gluten::Runtime* (std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)>)+0x23


SPARK_VERSION=3.4.2
SPARK_MAJOR=3.4
HADOOP_VERSION=3
SPARK_DIR="spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}"
SPARK_FILE="${SPARK_DIR}.tgz"

export SPARK_MAJOR
export SPARK_VERSION

source setup_gluten_deps.sh

cd ..
source /etc/lsb-release
# Pre-baked only
if [ "$DISTRIB_RELEASE" == "20.04" ]; then
source ./env_setup.sh
cd "${SCRIPT_DIR}"

GLUTEN_JAR="gluten-velox-bundle-spark${SPARK_MAJOR_VERSION}_${SCALA_VERSION}-1.1.0.jar"
GLUTEN_JAR_PATH="${SCRIPT_DIR}/gluten-velox-bundle-spark${SPARK_MAJOR_VERSION}_${SCALA_VERSION}-1.1.0.jar"

if [ ! -f "${GLUTEN_JAR_PATH}" ]; then
wget "https://github.com/oap-project/gluten/releases/download/v1.1.0/${GLUTEN_JAR}" || unset GLUTEN_JAR_PATH
fi

fi
# Rather than if/else we fall through to build if wget fails because major version is not supported.
if [ -z "$GLUTEN_JAR_PATH" ]; then
#tag::build_gluten[]
if [ ! -d incubator-gluten ]; then
git clone https://github.com/apache/incubator-gluten.git
fi
cd incubator-gluten
sudo ./dev/builddeps-veloxbe.sh --enable_s3=ON
mvn clean package -Pbackends-velox -Pspark-3.4 -DskipTests
GLUTEN_JAR_PATH="$(pwd)/package/target/gluten-package-*-SNAPSHOT-${SPARK_MAJOR_VERSION}.jar"
#end::build_gluten[]
fi

export GLUTEN_JAR_PATH

2 changes: 2 additions & 0 deletions c
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
bloop

Loading
Loading