Skip to content

Latest commit

 

History

History

spark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Spark Plugin

This plugin enables Assertainty integration with Apache Spark, using the Kotlin Spark API. It is parameterized in org.apache.spark.sql.Dataset and org.apache.spark.sql.Column.

Gradle

testImplementation("io.github.peterattardo.assertainty:spark-plugin:0.2.0")

Usage

val ds = // create dataset
ds.assert {
    +Column("someColumn") // grouping column
    +"someOtherColumn" // the Spark DSL adds this convenience function to the core DSL to specify grouping columns by String.
    
    always(functions.length(Column("someIdColumn")) eq 15) // Because the plugin is parameterized in org.apache.spark.sql.Column, it can take full advantage of the methods available to that class. 
}

ds.assertSeparateQueries { // Logically identical to assert, but under the hood it runs each assertion as its own call to RelationalGroupedDataset#agg()
    +"someColumn"
    +"someOtherColumn"
    
    minSum(Column("revenue"), 100_000) // we're making good money, eh?
    minCount(100) // averaging $1000/sale is impressive
}

Note

Spark, like the other plugins, defaults to generating a single combined query. However, because of the likelihood of duplicate columns between assertions (count() in particular), all columns are aliased during the query building process. To avoid this behavior, assertSeparateQueries exists, in which each assertion gets its own aggregation call, at the cost of more iterations over the data and slower execution time.