Skip to content
This repository has been archived by the owner on Jun 17, 2020. It is now read-only.

Development document

wangrupeng edited this page May 22, 2020 · 26 revisions

Kylin on Parquet Development Document

Source code

git clone https://github.com/Kyligence/kylin-on-parquet-v2.git
# Compile 
mvn clean install -DskipTests

Modules

  • kylin-spark-project
    All parquet related code
    • kylin-spark-engine
      cube build engine
    • kylin-spark-common
      commont utils
    • kylin-spark-metadata
      parquet metadata
    • kylin-spark-query query engine
    • kylin-spark-test
      integration test cases
  • parquet-assemly
    package the job jar

Environment

  • Download spark(Not support community version for now)

    # spark version is spark-2.4.1-os-kylin-r3
    wget https://download-resource.s3.cn-north-1.amazonaws.com.cn/osspark/spark-2.4.1-os-kylin-r3.tgz
  • If you submit Spark job through VPN service,you may need to change the following property which in ${SPARK_HOME}/spark-env.sh

    #or add as system envirionment property
    SPARK_LOCAL_IP=${VPN_LOCAL_IP}

How to package && depoly

cd ${KYLIN_SOURCE_CODE}
# For HDP2.x
./build/script/package.sh

# For CDH5.7
./build/script/package.sh -P cdh5.7
# After finished, the package will be avaliable in the directory ${KYLIN_SOURCE_CODE}/dist/

# If running on HDP, you need to uncomment the following properties in kylin.properties
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current
kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current

Configuration

# Most properties are still supported
# As the query engine uses Spark engine, we provided spark configuration for query engine, you can configure spark by add the property like follow:
kylin.query.spark-conf.spark.executor.cores=5
# Cube build engine only supports spark engine, spark configuration as follows:
kylin.engine.spark-conf.spark.executor.cores=5

Debug

There are two ways to debug locally without connecting with sandbox

  • UT All test cases support debug with local metadata. Such as cube build UT SparkCubingJobTest whose path is ${KYLIN_SOURCE_ROOT}/kylin-spark-project/kylin-spark-engine/src/test/java/org/apache/kylin/engine/spark/job/SparkCubingJobTest.java

    • testBuildJob()
      build cube and check parquet file

    • testBuildTwoSegmentsAndMerge()
      merge two segments and check parquet file

  • Debug with tomcat without hadoop environment (It is worth to say that if you want to debug with tomcat without hadoop environment, you can only use local csv data source, cannot use hive tables)

    1. Clone and compile Kylin source code, and suppose the path for Kylin source code is KYLIN_SOURCE_DIR
    git clone https://github.com/Kyligence/kylin-on-parquet-v2.git
    # Compile 
    mvn clean install -DskipTests
    1. Copy WEB-INF under server/src/main/webapp/WEB-INF to webapp/app/WEB-INF
        cd $KYLIN_SOURCE_DIR
    cp -r server/src/main/webapp/WEB-INF webapp/app/WEB-INF
    1. Install the dependencies of web app, please comfirm npm installed on your machine, if not, please refert to https://www.npmjs.com/get-npm
    cd $KYLIN_SOURCE_DIR/webapp
    
    npm install -g bower
    
    bower --allow-root install
    1. Open Kylin project with your IDE (InetlliJ IDEA)

    2. Open the config file for local debug with path "$KYLIN_SOURCE_DIR/examples/test_case_data/sandbox/kylin.properties", config items below:

    • Set kylin.metadata.url to a path of your local metadata or kylin local test metadata which is in "${KYLIN_SOURCE}/example/test_case_data/parquet_test"
    • Set kylin.env.zookeeper-is-local=true
    • Set kylin.storage.url to a path of your local machine, like kylin.storage.url=/tmp/kylin, your should create the folder first
    • set kylin.env.hdfs-working-dir to a path of your local machine with prefix "file://", like kylin.env.hdfs-working-dir=file:///tmp/kylin_data
    • Set kylin.engine.spark-conf.spark.master to local mode, kylin.engine.spark-conf.spark.master=local
    • Set kylin.engine.spark-conf.spark.eventLog.dir to a path of your local machine for the spark log, like kylin.engine.spark-conf.spark.eventLog.dir=/tmp/spark-history, your should create the folder first
    1. Open config menu in "Run->Debug Configurations", set the main class with reference org.apache.kylin.rest.DebugTomcat, set "VM options" with -Dspark.local=true, set "Working directory" with $MODULE_WORKING_DIR$, toggle option "Include dependencies with 'Provided' scope". Press button of Debug.

    1. If all goes well, kylin instance is started in your local machine, login with user name "ADMIN", and its default password "KYLIN"

    1. Create a project

    2. Load csv data source by pressing button "Data Source->Load CSV File as Table" on "Model" page, and set the schema for your table. Then press "submit" to save.

    1. Design your model and cube on "Model" page, please refer to http://kylin.apache.org/docs/tutorial/create_cube.html

    2. Build the cube with some time range

    3. Monitor the cubing job

    1. After cube be built, it will be stored as parquet files

    1. Query the cube data on "Insight" page