Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79

sonalgoyal · 2021-12-16T10:04:27Z

sonalgoyal · 2021-12-16T10:14:57Z

To get to the interactive learner, I had to change the config. dbfs locations had to be added to data/models attributes in the config. Keeping here for records and documentation when we are able to wrap this up.

"data" : [{
"name":"test",
"format":"csv",
"props": {
"location": "dbfs:/FileStore/test.csv",
"delimiter": ",",
"header":false
},
"schema":
"{"type" : "struct",
"fields" : [
{"name":"id", "type":"string", "nullable":false},
{"name":"fname", "type":"string", "nullable":true},
{"name":"lname","type":"string","nullable":true} ,
{"name":"stNo", "type":"string", "nullable":true},
{"name":"add1", "type":"string", "nullable":true},
{"name":"add2","type":"string","nullable":true} ,
{"name":"city", "type":"string", "nullable":true},
{"name":"state", "type":"string", "nullable":true},
{"name":"dob","type":"string","nullable":true} ,
{"name":"ssn","type":"string","nullable":true}
]
}"
}],
"labelDataSampleSize" : 0.5,
"numPartitions":4,
"modelId": 100,

sonalgoyal · 2021-12-17T14:54:01Z

One option here is to mark the training data through a notebook.

read ones which are unmarked

dfUn = spark.sql("select * from parquet.models/100/trainingData/unmarked unmarked WHERE z_cluster NOT IN (select z_cluster from parquet.models/100/trainingData/marked)")
dfUn.show()

define the matched clusters

matches = [['1638895363175:31'], ['1638895363175:3']]
columns=['z_clusterMatched']
dfMatched = spark.createDataFrame(matches, columns)

mark the matched rows

dfM=df.join(dfMatched, df.z_cluster == dfMatched.z_clusterMatched, "inner")
dfM=dfM.withColumn("z_isMatch1", lit(1)).drop("z_clusterMatched").drop("z_isMatch").withColumnRenamed("z_isMatch1","z_isMatch").show()
dfM.write.parquet("models/100/trainingData/marked")

repeat for non matches and cant says

need to verify this again!

sonalgoyal · 2021-12-20T07:52:02Z

Om databricks

First time
-unmarked = spark.read.parquet("/models/100/trainingData/unmarked")
#unmarked.show(500)

matchPairZClusters = [['1639647243412:21'], ['1639647243412:25'], ['1639647243412:29'], ['1639647243412:31'], ['1639647243412:33'], ['1639647243412:9']]
columns=['z_clusterMatched']
matchPairZClustersDF = spark.createDataFrame(matchPairZClusters, columns)

matchDF = unmarked.join(matchPairZClustersDF, unmarked.z_cluster == matchPairZClustersDF.z_clusterMatched, "inner")
matchDF = matchDF.withColumn("z_isMatch1", lit(1)).drop("z_clusterMatched").drop("z_isMatch").withColumnRenamed("z_isMatch1","z_isMatch")
matchDF.show(500)

matchDF.write.parquet("models/100/trainingData/marked")

sonalgoyal · 2021-12-20T09:29:29Z

Next time

from pyspark.sql.functions import lit

unmarked = spark.read.parquet("/models/100/trainingData/unmarked")
#unmarked.show(500)
marked = spark.read.parquet("/models/100/trainingData/marked")
unmarked.registerTempTable("unmarked")
marked.registerTempTable("marked")
unmarkedFinal = spark.sql("select * from unmarked where z_cluster NOT IN (select z_cluster from marked)")
unmarkedFinal.show(500)

matchPairZClusters = [['1639990278797:0'], ['1639990278797:17'], ['1639990278797:23'], ['1639990278797:3'], ['1639990278797:33'], ['1639990278797:37']]
columns=['z_clusterMatched']
matchPairZClustersDF = spark.createDataFrame(matchPairZClusters, columns)

matchDF = unmarked.join(matchPairZClustersDF, unmarked.z_cluster == matchPairZClustersDF.z_clusterMatched, "inner")
matchDF = matchDF.withColumn("z_isMatch1", lit(1)).drop("z_clusterMatched").drop("z_isMatch").withColumnRenamed("z_isMatch1","z_isMatch")
matchDF.show(500)

matchDF.write.mode("append").parquet("models/100/trainingData/marked")

sonalgoyal · 2021-12-20T09:29:58Z

same has to be done for 0 and 2 - non matches and not sures

sonalgoyal · 2021-12-20T17:35:30Z

Attaching a couple of files through which I have done notebook based labelling to build the training data on Databricks.

The findTrainingData phase writes to zinggDir/modelId/trainingData/unmarked folder and the label phase reads through this location and writes to zinggDir/modelId/trainingData/marked location. In all cases, pairs share the same z_cluster, and the z_isMatch flag denotes if they are matches or not. -1 means they are unmarked, 0 stands for not a match, 1 for a match and 2 for can not say. The findTrainingData writes all pairs with z_isMatch as -1. The z_isMatch flag is updated for the pairs and the output saved by the label phase.

The notebooks attached do the same thing. They take in z_clusters of the matches, non matches and cant say records and write them to the marked folder.

Here are the files in the attached zip.

configdb.json - This is the config file I used.
test120k.csv- This is the febrl data with 120k records so it needs a few rounds of findTrainingData and label. The label phase was done within Databricks through the notebooks.
learnerFirstTime (html and py files) - this was run the first time when there were no marked records4. learnerNext was run 2 times to build the training data

After this, Zingg was run in trainMatch mode. I verified and the results looked ok.
databricksLabelling.zip

lsbilbro · 2021-12-20T20:57:23Z

Sonal, these steps and instructions are excellent. Thanks for putting them together.

I tried them on my local workspace - using configs (full config attached):

test dataset: febrl120k/test.csv
"labelDataSampleSize" : 0.1
"numPartitions":2000
cluster workers: 6 DS13_v2 nodes (48 cores, 336 GB mem)

I ran the findTrainingData step through 4 iterations, ending up with 82 labeled pairs (164 records) in my matched parquet table

When I ran the trainMatch step, I ended up with an NPE, but I think the error is that the training data cannot be found. I'll attach the stderr and log here, but I am suspicious of this in the log, which occurs just before everything shuts down:

21/12/20 20:31:36 INFO ZinggBase: Start reading internal configurations and functions
21/12/20 20:31:36 INFO ZinggBase: Finished reading internal configurations and functions
21/12/20 20:31:36 WARN SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/checkpoint' appears to be on the local filesystem.
21/12/20 20:31:36 INFO Trainer: Reading inputs for training phase ...
21/12/20 20:31:36 INFO Trainer: Initializing learning similarity rules
21/12/20 20:31:36 WARN PipeUtil: Reading input PARQUET
21/12/20 20:31:37 INFO Version: Elasticsearch Hadoop v7.2.0 [cfafacd904]
21/12/20 20:31:37 WARN DSUtil: No preexisting marked training samples
21/12/20 20:31:37 WARN DSUtil: No configured training samples
21/12/20 20:31:37 WARN DSUtil: No training data found

One other downside to mention... despite running findTrainingData 4 times, I only ended up with 3 positive labels (and 79 negative labels)... but I suspect this is more related to the test data. If you have suggestions to help find more positive cases, let me know!

zinggconfig.txt

zingg_log4j-active.txt

zingg_stderr.txt

sonalgoyal · 2021-12-21T02:26:26Z

You are correct, the error is that no training data has been found. Let me check what could be happening

sonalgoyal · 2021-12-21T02:35:30Z

By any chance do you have the logs of the findTrainingData phase- the last one that you ran? Also what are the locations of the marked and unmarked folders in the notebooks?

sonalgoyal · 2021-12-21T12:38:04Z

I have a question on the zinggDir setting in the config you have sent @lsbilbro. As this is a dbfs location, should we add a root (/) to it? "zinggDir": "/Bilbro/zingg" instead of "zinggDir": "Bilbro/zingg".

sonalgoyal · 2021-12-21T12:51:07Z

Seems like I attached the wrong config file. Correct config which I used is attached. This has the root location in zinggDir. You will need to update your config to reflect the location of the model and run findTrainingData and label till you get at least 12-15 matches.
config120.json.txt

lsbilbro · 2021-12-21T14:10:56Z

Yep, I can confirm that using an absolute path instead of a relative path - i.e. prepending the zinggDir value with a "/" - does resolve this error.

And after ensuring I had enough positive labels, I am getting full results, which look really good! 😄

sonalgoyal · 2021-12-30T16:49:36Z

Updated documentation to refer to this issue for now. Closing

sonalgoyal mentioned this issue Dec 16, 2021

Client argument list formatting #24

Closed

sonalgoyal mentioned this issue Dec 20, 2021

Databricks training - see how the interactive learner works under the cloud based Spark offering #73

Closed

sonalgoyal self-assigned this Dec 20, 2021

sonalgoyal mentioned this issue Dec 21, 2021

Log training data and test data locations #83

Closed

sonalgoyal closed this as completed Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79

Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79

sonalgoyal commented Dec 16, 2021

sonalgoyal commented Dec 16, 2021

sonalgoyal commented Dec 17, 2021

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Dec 20, 2021

lsbilbro commented Dec 20, 2021 •

edited

Loading

sonalgoyal commented Dec 21, 2021

sonalgoyal commented Dec 21, 2021 •

edited

Loading

sonalgoyal commented Dec 21, 2021

sonalgoyal commented Dec 21, 2021

lsbilbro commented Dec 21, 2021

sonalgoyal commented Dec 30, 2021

Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79

Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79

Comments

sonalgoyal commented Dec 16, 2021

sonalgoyal commented Dec 16, 2021

sonalgoyal commented Dec 17, 2021

read ones which are unmarked

define the matched clusters

mark the matched rows

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Dec 20, 2021

lsbilbro commented Dec 20, 2021 • edited Loading

sonalgoyal commented Dec 21, 2021

sonalgoyal commented Dec 21, 2021 • edited Loading

sonalgoyal commented Dec 21, 2021

sonalgoyal commented Dec 21, 2021

lsbilbro commented Dec 21, 2021

sonalgoyal commented Dec 30, 2021

lsbilbro commented Dec 20, 2021 •

edited

Loading

sonalgoyal commented Dec 21, 2021 •

edited

Loading