Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show() method failing on dataset got from transform() after deserialization on bundle. #481

Closed
GowthamGoud opened this issue Feb 6, 2019 · 3 comments
Assignees
Labels

Comments

@GowthamGoud
Copy link

Hi i executed below code in pyspark in jupyter notebook.

`from mleap import pyspark
from pyspark.ml import Pipeline, PipelineModel
from mleap.pyspark.spark_support import SimpleSparkSerializer
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.ml.feature import IndexToString
schema = StructType([
StructField("category", IntegerType(), True),
StructField("text", StringType(), True)])
spark = SparkSession.builder.master("local").enableHiveSupport().getOrCreate()
textFile=spark.read.csv(
"/home/opentext/bda/home/bin/notebook/Sell1.csv", header=True, mode="DROPMALFORMED", schema=schema
)
textFile.show()
textFile.write.save("/home/opentext/bda/home/bin/notebook/Sell.parquet", format="parquet")
schemaSell = spark.read.load("/home/opentext/bda/home/bin/notebook/Sell.parquet")
train_data, test_data = schemaSell.randomSplit([0.8, 0.2])
categoryIndexer = StringIndexer(inputCol="category", outputCol="label")
labels = categoryIndexer.fit(train_data).labels

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=10000)
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[categoryIndexer, tokenizer,hashingTF, nb])
model = pipeline.fit(train_data)
pr = model.transform(schemaSell)
( No problem with this below show )
pr.show()

from mleap import pyspark
from pyspark.ml import Pipeline, PipelineModel
from mleap.pyspark.spark_support import SimpleSparkSerializer
model.serializeToBundle("jar:file:///home/opentext/bda/home/bin/notebook/modelnb.zip",pr)
transformer=PipelineModel.deserializeFromBundle("jar:file:///home/opentext/bda/home/bin/notebook/modelnb.zip")
ds=transformer.transform(test_data)
ds.show()
`

On calling show () method ( ds.show()) it throws below exception:

Name: org.apache.toree.interpreter.broker.BrokerException
Message: Py4JJavaError: An error occurred while calling o982.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 56, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: Failed to find a default value for modelType
at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652)
at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
at org.apache.spark.ml.param.Params$class.$(params.scala:656)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
at org.apache.spark.ml.classification.NaiveBayesModel.predictRaw(NaiveBayes.scala:317)
at org.apache.spark.ml.classification.NaiveBayesModel.predictRaw(NaiveBayes.scala:252)
at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:117)
at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:116)
... 16 more

@GowthamGoud
Copy link
Author

It throws Caused by: java.util.NoSuchElementException: Failed to find a default value for modelType
even though i passed modelType ="multinomial" while intializing
NaiveBayes(smoothing=1.0, modelType="multinomial")

@ancasarb ancasarb self-assigned this Feb 7, 2019
@ancasarb
Copy link
Member

ancasarb commented Feb 7, 2019

Hi @GowthamGoud, thanks for raising this bug find! We have a few instances where we've missed some params when re-loading the transformers back into Spark, I've raised #483 to fix the NaiveBayes model and a couple others.

@ancasarb ancasarb added the bug label Mar 3, 2019
@ancasarb
Copy link
Member

ancasarb commented Mar 3, 2019

Closing this, PR #483 with the fix for this issue has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants