Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79

Closed
sonalgoyal opened this issue Dec 16, 2021 · 13 comments
Assignees

Comments

@sonalgoyal
Copy link
Member

image

@sonalgoyal
Copy link
Member Author

To get to the interactive learner, I had to change the config. dbfs locations had to be added to data/models attributes in the config. Keeping here for records and documentation when we are able to wrap this up.

"data" : [{
"name":"test",
"format":"csv",
"props": {
"location": "dbfs:/FileStore/test.csv",
"delimiter": ",",
"header":false
},
"schema":
"{"type" : "struct",
"fields" : [
{"name":"id", "type":"string", "nullable":false},
{"name":"fname", "type":"string", "nullable":true},
{"name":"lname","type":"string","nullable":true} ,
{"name":"stNo", "type":"string", "nullable":true},
{"name":"add1", "type":"string", "nullable":true},
{"name":"add2","type":"string","nullable":true} ,
{"name":"city", "type":"string", "nullable":true},
{"name":"state", "type":"string", "nullable":true},
{"name":"dob","type":"string","nullable":true} ,
{"name":"ssn","type":"string","nullable":true}
]
}"
}],
"labelDataSampleSize" : 0.5,
"numPartitions":4,
"modelId": 100,

@sonalgoyal
Copy link
Member Author

One option here is to mark the training data through a notebook.

read ones which are unmarked

dfUn = spark.sql("select * from parquet.models/100/trainingData/unmarked unmarked WHERE z_cluster NOT IN (select z_cluster from parquet.models/100/trainingData/marked)")
dfUn.show()

define the matched clusters

matches = [['1638895363175:31'], ['1638895363175:3']]
columns=['z_clusterMatched']
dfMatched = spark.createDataFrame(matches, columns)

mark the matched rows

dfM=df.join(dfMatched, df.z_cluster == dfMatched.z_clusterMatched, "inner")
dfM=dfM.withColumn("z_isMatch1", lit(1)).drop("z_clusterMatched").drop("z_isMatch").withColumnRenamed("z_isMatch1","z_isMatch").show()
dfM.write.parquet("models/100/trainingData/marked")

repeat for non matches and cant says

need to verify this again!

@sonalgoyal
Copy link
Member Author

Om databricks

  • First time
    -unmarked = spark.read.parquet("/models/100/trainingData/unmarked")
    #unmarked.show(500)

matchPairZClusters = [['1639647243412:21'], ['1639647243412:25'], ['1639647243412:29'], ['1639647243412:31'], ['1639647243412:33'], ['1639647243412:9']]
columns=['z_clusterMatched']
matchPairZClustersDF = spark.createDataFrame(matchPairZClusters, columns)

matchDF = unmarked.join(matchPairZClustersDF, unmarked.z_cluster == matchPairZClustersDF.z_clusterMatched, "inner")
matchDF = matchDF.withColumn("z_isMatch1", lit(1)).drop("z_clusterMatched").drop("z_isMatch").withColumnRenamed("z_isMatch1","z_isMatch")
matchDF.show(500)

matchDF.write.parquet("models/100/trainingData/marked")

@sonalgoyal
Copy link
Member Author

Next time

from pyspark.sql.functions import lit

unmarked = spark.read.parquet("/models/100/trainingData/unmarked")
#unmarked.show(500)
marked = spark.read.parquet("/models/100/trainingData/marked")
unmarked.registerTempTable("unmarked")
marked.registerTempTable("marked")
unmarkedFinal = spark.sql("select * from unmarked where z_cluster NOT IN (select z_cluster from marked)")
unmarkedFinal.show(500)

matchPairZClusters = [['1639990278797:0'], ['1639990278797:17'], ['1639990278797:23'], ['1639990278797:3'], ['1639990278797:33'], ['1639990278797:37']]
columns=['z_clusterMatched']
matchPairZClustersDF = spark.createDataFrame(matchPairZClusters, columns)

matchDF = unmarked.join(matchPairZClustersDF, unmarked.z_cluster == matchPairZClustersDF.z_clusterMatched, "inner")
matchDF = matchDF.withColumn("z_isMatch1", lit(1)).drop("z_clusterMatched").drop("z_isMatch").withColumnRenamed("z_isMatch1","z_isMatch")
matchDF.show(500)

matchDF.write.mode("append").parquet("models/100/trainingData/marked")

@sonalgoyal
Copy link
Member Author

same has to be done for 0 and 2 - non matches and not sures

@sonalgoyal sonalgoyal self-assigned this Dec 20, 2021
@sonalgoyal
Copy link
Member Author

Attaching a couple of files through which I have done notebook based labelling to build the training data on Databricks.

The findTrainingData phase writes to zinggDir/modelId/trainingData/unmarked folder and the label phase reads through this location and writes to zinggDir/modelId/trainingData/marked location. In all cases, pairs share the same z_cluster, and the z_isMatch flag denotes if they are matches or not. -1 means they are unmarked, 0 stands for not a match, 1 for a match and 2 for can not say. The findTrainingData writes all pairs with z_isMatch as -1. The z_isMatch flag is updated for the pairs and the output saved by the label phase. 

The notebooks attached do the same thing. They take in z_clusters of the matches, non matches and cant say records and write them to the marked folder.

Here are the files in the attached zip.

  1. configdb.json - This is the config file I used.
  2. test120k.csv- This is the febrl data with 120k records so it needs a few rounds of findTrainingData and label. The label phase was done within Databricks through the notebooks.
  3. learnerFirstTime (html and py files) - this was run the first time when there were no marked records4. learnerNext was run 2 times to build the training data

After this, Zingg was run in trainMatch mode. I verified and the results looked ok.
databricksLabelling.zip

@lsbilbro
Copy link

lsbilbro commented Dec 20, 2021

Sonal, these steps and instructions are excellent. Thanks for putting them together.

I tried them on my local workspace - using configs (full config attached):

  • test dataset: febrl120k/test.csv
  • "labelDataSampleSize" : 0.1
  • "numPartitions":2000
  • cluster workers: 6 DS13_v2 nodes (48 cores, 336 GB mem)

I ran the findTrainingData step through 4 iterations, ending up with 82 labeled pairs (164 records) in my matched parquet table

When I ran the trainMatch step, I ended up with an NPE, but I think the error is that the training data cannot be found. I'll attach the stderr and log here, but I am suspicious of this in the log, which occurs just before everything shuts down:

21/12/20 20:31:36 INFO ZinggBase: Start reading internal configurations and functions
21/12/20 20:31:36 INFO ZinggBase: Finished reading internal configurations and functions
21/12/20 20:31:36 WARN SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/checkpoint' appears to be on the local filesystem.
21/12/20 20:31:36 INFO Trainer: Reading inputs for training phase ...
21/12/20 20:31:36 INFO Trainer: Initializing learning similarity rules
21/12/20 20:31:36 WARN PipeUtil: Reading input PARQUET
21/12/20 20:31:37 INFO Version: Elasticsearch Hadoop v7.2.0 [cfafacd904]
21/12/20 20:31:37 WARN DSUtil: No preexisting marked training samples
21/12/20 20:31:37 WARN DSUtil: No configured training samples
21/12/20 20:31:37 WARN DSUtil: No training data found

One other downside to mention... despite running findTrainingData 4 times, I only ended up with 3 positive labels (and 79 negative labels)... but I suspect this is more related to the test data. If you have suggestions to help find more positive cases, let me know!

zinggconfig.txt

zingg_log4j-active.txt

zingg_stderr.txt

@sonalgoyal
Copy link
Member Author

You are correct, the error is that no training data has been found. Let me check what could be happening

@sonalgoyal
Copy link
Member Author

sonalgoyal commented Dec 21, 2021

By any chance do you have the logs of the findTrainingData phase- the last one that you ran? Also what are the locations of the marked and unmarked folders in the notebooks?

@sonalgoyal
Copy link
Member Author

I have a question on the zinggDir setting in the config you have sent @lsbilbro. As this is a dbfs location, should we add a root (/) to it? "zinggDir": "/Bilbro/zingg" instead of "zinggDir": "Bilbro/zingg".

@sonalgoyal
Copy link
Member Author

Seems like I attached the wrong config file. Correct config which I used is attached. This has the root location in zinggDir. You will need to update your config to reflect the location of the model and run findTrainingData and label till you get at least 12-15 matches.
config120.json.txt

@lsbilbro
Copy link

Yep, I can confirm that using an absolute path instead of a relative path - i.e. prepending the zinggDir value with a "/" - does resolve this error.

And after ensuring I had enough positive labels, I am getting full results, which look really good! 😄

@sonalgoyal
Copy link
Member Author

Updated documentation to refer to this issue for now. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants