SparkHelper

Overview

Version: 1.0.4

This library contains a bunch of low-level basic methods for data processing with Scala Spark. This is a bunch of 5 modules:

HdfsHelper: Wrapper around apache Hadoop FileSystem API (org.apache.hadoop.fs.FileSystem) for file manipulations on hdfs.
SparkHelper: Hdfs file manipulations through the Spark API.
DateHelper: Wrapper around joda-time for dates manipulations.
FieldChecker: Validation for stringified fields
Monitor: Spark custom monitoring/logger and kpi validator

The goal is to remove the maximum of highly used and highly duplicated low-level code from the spark job code and replace it with methods fully tested whose names are self-explanatory and readable.

Using spark_helper:

HdfsHelper:

The full list of methods is available at HdfsHelper.

Contains basic file-related methods mostly based on hdfs apache Hadoop FileSystem API org.apache.hadoop.fs.FileSystem.

A few exemples:

import com.spark_helper.HdfsHelper

// A bunch of methods wrapping the FileSystem API, such as:
HdfsHelper.fileExists("my/hdfs/file/path.txt")
assert(HdfsHelper.listFileNamesInFolder("my/folder/path") == List("file_name_1.txt", "file_name_2.csv"))
assert(HdfsHelper.getFileModificationDate("my/hdfs/file/path.txt") == "20170306")
assert(HdfsHelper.getNbrOfDaysSinceFileWasLastModified("my/hdfs/file/path.txt") == 3)

// Some Xml helpers for hadoop as well:
HdfsHelper.isHdfsXmlCompliantWithXsd("my/hdfs/file/path.xml", getClass.getResource("/some_xml.xsd"))

SparkHelper:

The full list of methods is available at SparkHelper.

Contains basic file/RRD-related methods based on the Spark APIs.

A few exemples:

import com.spark_helper.SparkHelper

// Same as SparkContext.saveAsTextFile, but the result is a single file:
SparkHelper.saveAsSingleTextFile(myOutputRDD, "/my/output/file/path.txt")
// Same as SparkContext.textFile, but instead of reading one record per line,
// it reads records spread over several lines:
SparkHelper.textFileWithDelimiter("/my/input/folder/path", sparkContext, "---\n")

DateHelper:

The full list of methods is available at DateHelper.

Wrapper around joda-time for dates manipulations.

A few exemples:

import com.spark_helper.DateHelper

assert(DateHelper.daysBetween("20161230", "20170101") == List("20161230", "20161231", "20170101"))
assert(DateHelper.today() == "20170310") // If today's "20170310"
assert(DateHelper.yesterday() == "20170309") // If today's "20170310"
assert(DateHelper.reformatDate("20170327", "yyyyMMdd", "yyMMdd") == "170327")
assert(DateHelper.now("HH:mm") == "10:24")

FieldChecker

The full list of methods is available at FieldChecker.

Validation (before cast) for stringified fields:

A few exemples:

import com.spark_helper.FieldChecker

assert(FieldChecker.isInteger("15"))
assert(!FieldChecker.isInteger("1.5"))
assert(FieldChecker.isInteger("-1"))
assert(FieldChecker.isStrictlyPositiveInteger("123"))
assert(!FieldChecker.isYyyyMMddDate("20170333"))
assert(FieldChecker.isCurrencyCode("USD"))

Monitor:

The full list of methods is available at Monitor.

It's a simple logger/report that you update during your job. It contains a report (a simple string) that you can update and a success boolean which can be updated to give a success status on your job. At the end of your job you'll have the possibility to store the report in hdfs.

Have a look at the scaladoc for a cool exemple.

Including spark_helper to your dependencies:

With sbt, just add this one line to your build.sbt:

libraryDependencies += "spark_helper" % "spark_helper" % "1.0.4" from "https://github.com/xavierguihot/spark_helper/releases/download/v1.0.4/spark_helper-1.0.4.jar"

Building the project:

With sbt:

sbt assembly

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkHelper

Overview

Using spark_helper:

HdfsHelper:

SparkHelper:

DateHelper:

FieldChecker

Monitor:

Including spark_helper to your dependencies:

Building the project:

About

Releases

Packages

Languages

License

aijianiula0601/spark_helper

Folders and files

Latest commit

History

Repository files navigation

SparkHelper

Overview

Using spark_helper:

HdfsHelper:

SparkHelper:

DateHelper:

FieldChecker

Monitor:

Including spark_helper to your dependencies:

Building the project:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages