-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13019][Docs] Replace example code in mllib-statistics.md using include_example #11108
Closed
Closed
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
49b7012
[SPARK-13019] raplce for summary staticstics, scala code
keypointt 83592bc
[SPARK-13019] test out on/off, for import part
keypointt 069341b
[SPARK-13019] create separate example files, but cannot compile yet
keypointt 2058b16
[SPARK-13019] move new files into mllib folder
keypointt b328542
[SPARK-13019] remote python init files
keypointt 12fda2b
[SPARK-13019] comment broken code to pass complie process
keypointt 2abfaa9
[SPARK-13019] remove code block tag
keypointt 157da53
[SPARK-13019] make commented code explicit in html content
keypointt 323304f
[SPARK-13019] Stratified Sampling working
keypointt 3692d30
[SPARK-13019] hypothesis testing working
keypointt 89c3d2e
[SPARK-13019] Hypothesis Testing Kolmogorov Smirnov Test Example is w…
keypointt 4dbbc6d
[SPARK-13019] remove empty lines
keypointt f024fc3
[SPARK-13019] random data generation example working
keypointt 6f949cd
[SPARK-13019] Kernel Density Estimation Example is working
keypointt a4dd0fb
[SPARK-13019] code style check
keypointt 3a11802
[SPARK-13019] fix python style
keypointt 0df3e65
[SPARK-13019] remove setMaster, change java to 2-indent
keypointt d817d0b
[SPARK-13019] more java style fix
keypointt f945222
[SPARK-13019] mainly re-organize java import
keypointt aec10ca
[SPARK-13019] re-organize python import
keypointt e2737ee
[SPARK-13019] code review improvement
keypointt 3329394
[SPARK-13019] sorry, forget to delete python file
keypointt acf7096
[SPARK-13019] removing '-'s
keypointt a4eb28d
[SPARK-13019] use asList() for concise code
keypointt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
70 changes: 70 additions & 0 deletions
70
examples/src/main/java/org/apache/spark/examples/mllib/JavaCorrelationsExample.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.examples.mllib; | ||
|
||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.JavaSparkContext; | ||
// $example on$ | ||
import java.util.Arrays; | ||
|
||
import org.apache.spark.api.java.JavaDoubleRDD; | ||
import org.apache.spark.api.java.JavaRDD; | ||
import org.apache.spark.mllib.linalg.Matrix; | ||
import org.apache.spark.mllib.linalg.Vector; | ||
import org.apache.spark.mllib.linalg.Vectors; | ||
import org.apache.spark.mllib.stat.Statistics; | ||
// $example off$ | ||
|
||
public class JavaCorrelationsExample { | ||
public static void main(String[] args) { | ||
|
||
SparkConf conf = new SparkConf().setAppName("JavaCorrelationsExample"); | ||
JavaSparkContext jsc = new JavaSparkContext(conf); | ||
|
||
// $example on$ | ||
JavaDoubleRDD seriesX = jsc.parallelizeDoubles( | ||
Arrays.asList(1.0, 2.0, 3.0, 3.0, 5.0)); // a series | ||
|
||
// must have the same number of partitions and cardinality as seriesX | ||
JavaDoubleRDD seriesY = jsc.parallelizeDoubles( | ||
Arrays.asList(11.0, 22.0, 33.0, 33.0, 555.0)); | ||
|
||
// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. | ||
// If a method is not specified, Pearson's method will be used by default. | ||
Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson"); | ||
System.out.println("Correlation is: " + correlation); | ||
|
||
// note that each Vector is a row and not a column | ||
JavaRDD<Vector> data = jsc.parallelize( | ||
Arrays.asList( | ||
Vectors.dense(1.0, 10.0, 100.0), | ||
Vectors.dense(2.0, 20.0, 200.0), | ||
Vectors.dense(5.0, 33.0, 366.0) | ||
) | ||
); | ||
|
||
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. | ||
// If a method is not specified, Pearson's method will be used by default. | ||
Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson"); | ||
System.out.println(correlMatrix.toString()); | ||
// $example off$ | ||
|
||
jsc.stop(); | ||
} | ||
} | ||
|
84 changes: 84 additions & 0 deletions
84
examples/src/main/java/org/apache/spark/examples/mllib/JavaHypothesisTestingExample.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.examples.mllib; | ||
|
||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.JavaSparkContext; | ||
|
||
// $example on$ | ||
import java.util.Arrays; | ||
|
||
import org.apache.spark.api.java.JavaRDD; | ||
import org.apache.spark.mllib.linalg.Matrices; | ||
import org.apache.spark.mllib.linalg.Matrix; | ||
import org.apache.spark.mllib.linalg.Vector; | ||
import org.apache.spark.mllib.linalg.Vectors; | ||
import org.apache.spark.mllib.regression.LabeledPoint; | ||
import org.apache.spark.mllib.stat.Statistics; | ||
import org.apache.spark.mllib.stat.test.ChiSqTestResult; | ||
// $example off$ | ||
|
||
public class JavaHypothesisTestingExample { | ||
public static void main(String[] args) { | ||
|
||
SparkConf conf = new SparkConf().setAppName("JavaHypothesisTestingExample"); | ||
JavaSparkContext jsc = new JavaSparkContext(conf); | ||
|
||
// $example on$ | ||
// a vector composed of the frequencies of events | ||
Vector vec = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25); | ||
|
||
// compute the goodness of fit. If a second vector to test against is not supplied | ||
// as a parameter, the test runs against a uniform distribution. | ||
ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec); | ||
// summary of the test including the p-value, degrees of freedom, test statistic, | ||
// the method used, and the null hypothesis. | ||
System.out.println(goodnessOfFitTestResult + "\n"); | ||
|
||
// Create a contingency matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) | ||
Matrix mat = Matrices.dense(3, 2, new double[]{1.0, 3.0, 5.0, 2.0, 4.0, 6.0}); | ||
|
||
// conduct Pearson's independence test on the input contingency matrix | ||
ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat); | ||
// summary of the test including the p-value, degrees of freedom... | ||
System.out.println(independenceTestResult + "\n"); | ||
|
||
// an RDD of labeled points | ||
JavaRDD<LabeledPoint> obs = jsc.parallelize( | ||
Arrays.asList( | ||
new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)), | ||
new LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)), | ||
new LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)) | ||
) | ||
); | ||
|
||
// The contingency table is constructed from the raw (feature, label) pairs and used to conduct | ||
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature | ||
// against the label. | ||
ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd()); | ||
int i = 1; | ||
for (ChiSqTestResult result : featureTestResults) { | ||
System.out.println("Column " + i + ":"); | ||
System.out.println(result + "\n"); // summary of the test | ||
i++; | ||
} | ||
// $example off$ | ||
|
||
jsc.stop(); | ||
} | ||
} |
49 changes: 49 additions & 0 deletions
49
...va/org/apache/spark/examples/mllib/JavaHypothesisTestingKolmogorovSmirnovTestExample.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.examples.mllib; | ||
|
||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.JavaSparkContext; | ||
// $example on$ | ||
import java.util.Arrays; | ||
|
||
import org.apache.spark.api.java.JavaDoubleRDD; | ||
import org.apache.spark.mllib.stat.Statistics; | ||
import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult; | ||
// $example off$ | ||
|
||
public class JavaHypothesisTestingKolmogorovSmirnovTestExample { | ||
public static void main(String[] args) { | ||
|
||
SparkConf conf = | ||
new SparkConf().setAppName("JavaHypothesisTestingKolmogorovSmirnovTestExample"); | ||
JavaSparkContext jsc = new JavaSparkContext(conf); | ||
|
||
// $example on$ | ||
JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.1, 0.15, 0.2, 0.3, 0.25)); | ||
KolmogorovSmirnovTestResult testResult = | ||
Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0); | ||
// summary of the test including the p-value, test statistic, and null hypothesis | ||
// if our p-value indicates significance, we can reject the null hypothesis | ||
System.out.println(testResult); | ||
// $example off$ | ||
|
||
jsc.stop(); | ||
} | ||
} | ||
|
53 changes: 53 additions & 0 deletions
53
...les/src/main/java/org/apache/spark/examples/mllib/JavaKernelDensityEstimationExample.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.examples.mllib; | ||
|
||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.JavaSparkContext; | ||
// $example on$ | ||
import java.util.Arrays; | ||
|
||
import org.apache.spark.api.java.JavaRDD; | ||
import org.apache.spark.mllib.stat.KernelDensity; | ||
// $example off$ | ||
|
||
public class JavaKernelDensityEstimationExample { | ||
public static void main(String[] args) { | ||
|
||
SparkConf conf = new SparkConf().setAppName("JavaKernelDensityEstimationExample"); | ||
JavaSparkContext jsc = new JavaSparkContext(conf); | ||
|
||
// $example on$ | ||
// an RDD of sample data | ||
JavaRDD<Double> data = jsc.parallelize( | ||
Arrays.asList(1.0, 1.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 9.0, 9.0)); | ||
|
||
// Construct the density estimator with the sample data | ||
// and a standard deviation for the Gaussian kernels | ||
KernelDensity kd = new KernelDensity().setSample(data).setBandwidth(3.0); | ||
|
||
// Find density estimates for the given values | ||
double[] densities = kd.estimate(new double[]{-1.0, 2.0, 5.0}); | ||
|
||
System.out.println(Arrays.toString(densities)); | ||
// $example off$ | ||
|
||
jsc.stop(); | ||
} | ||
} | ||
|
75 changes: 75 additions & 0 deletions
75
examples/src/main/java/org/apache/spark/examples/mllib/JavaStratifiedSamplingExample.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.examples.mllib; | ||
|
||
import com.google.common.collect.ImmutableMap; | ||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.JavaSparkContext; | ||
|
||
// $example on$ | ||
import java.util.*; | ||
|
||
import scala.Tuple2; | ||
|
||
import org.apache.spark.api.java.JavaPairRDD; | ||
import org.apache.spark.api.java.function.VoidFunction; | ||
// $example off$ | ||
|
||
public class JavaStratifiedSamplingExample { | ||
public static void main(String[] args) { | ||
|
||
SparkConf conf = new SparkConf().setAppName("JavaStratifiedSamplingExample"); | ||
JavaSparkContext jsc = new JavaSparkContext(conf); | ||
|
||
// $example on$ | ||
List<Tuple2<Integer, Character>> list = new ArrayList<Tuple2<Integer, Character>>( | ||
Arrays.<Tuple2<Integer, Character>>asList( | ||
new Tuple2(1, 'a'), | ||
new Tuple2(1, 'b'), | ||
new Tuple2(2, 'c'), | ||
new Tuple2(2, 'd'), | ||
new Tuple2(2, 'e'), | ||
new Tuple2(3, 'f') | ||
) | ||
); | ||
|
||
JavaPairRDD<Integer, Character> data = jsc.parallelizePairs(list); | ||
|
||
// specify the exact fraction desired from each key Map<K, Object> | ||
ImmutableMap<Integer, Object> fractions = | ||
ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); | ||
|
||
// Get an approximate sample from each stratum | ||
JavaPairRDD<Integer, Character> approxSample = data.sampleByKey(false, fractions); | ||
// Get an exact sample from each stratum | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The comment is wrong. |
||
JavaPairRDD<Integer, Character> exactSample = data.sampleByKeyExact(false, fractions); | ||
// $example off$ | ||
|
||
System.out.println("approxSample size is " + approxSample.collect().size()); | ||
for (Tuple2<Integer, Character> t : approxSample.collect()) { | ||
System.out.println(t._1() + " " + t._2()); | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a println here |
||
System.out.println("exactSample size is " + exactSample.collect().size()); | ||
for (Tuple2<Integer, Character> t : exactSample.collect()) { | ||
System.out.println(t._1() + " " + t._2()); | ||
} | ||
|
||
jsc.stop(); | ||
} | ||
} |
56 changes: 56 additions & 0 deletions
56
examples/src/main/java/org/apache/spark/examples/mllib/JavaSummaryStatisticsExample.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.examples.mllib; | ||
|
||
import org.apache.spark.SparkConf; | ||
import org.apache.spark.api.java.JavaSparkContext; | ||
// $example on$ | ||
import java.util.Arrays; | ||
|
||
import org.apache.spark.api.java.JavaRDD; | ||
import org.apache.spark.mllib.linalg.Vector; | ||
import org.apache.spark.mllib.linalg.Vectors; | ||
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary; | ||
import org.apache.spark.mllib.stat.Statistics; | ||
// $example off$ | ||
|
||
public class JavaSummaryStatisticsExample { | ||
public static void main(String[] args) { | ||
|
||
SparkConf conf = new SparkConf().setAppName("JavaSummaryStatisticsExample"); | ||
JavaSparkContext jsc = new JavaSparkContext(conf); | ||
|
||
// $example on$ | ||
JavaRDD<Vector> mat = jsc.parallelize( | ||
Arrays.asList( | ||
Vectors.dense(1.0, 10.0, 100.0), | ||
Vectors.dense(2.0, 20.0, 200.0), | ||
Vectors.dense(3.0, 30.0, 300.0) | ||
) | ||
); // an RDD of Vectors | ||
|
||
// Compute column summary statistics. | ||
MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd()); | ||
System.out.println(summary.mean()); // a dense vector containing the mean value for each column | ||
System.out.println(summary.variance()); // column-wise variance | ||
System.out.println(summary.numNonzeros()); // number of nonzeros in each column | ||
// $example off$ | ||
|
||
jsc.stop(); | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move
println
line inside$example
.