-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interactive Learner does not work with the Cloud based Spark Systems like Databricks and Azure where we do not have control on the console #79
Comments
To get to the interactive learner, I had to change the config. dbfs locations had to be added to data/models attributes in the config. Keeping here for records and documentation when we are able to wrap this up. "data" : [{ |
One option here is to mark the training data through a notebook. read ones which are unmarkeddfUn = spark.sql("select * from parquet. define the matched clustersmatches = [['1638895363175:31'], ['1638895363175:3']] mark the matched rowsdfM=df.join(dfMatched, df.z_cluster == dfMatched.z_clusterMatched, "inner") repeat for non matches and cant says need to verify this again! |
Om databricks
matchPairZClusters = [['1639647243412:21'], ['1639647243412:25'], ['1639647243412:29'], ['1639647243412:31'], ['1639647243412:33'], ['1639647243412:9']] matchDF = unmarked.join(matchPairZClustersDF, unmarked.z_cluster == matchPairZClustersDF.z_clusterMatched, "inner") matchDF.write.parquet("models/100/trainingData/marked") |
Next time from pyspark.sql.functions import lit unmarked = spark.read.parquet("/models/100/trainingData/unmarked") matchPairZClusters = [['1639990278797:0'], ['1639990278797:17'], ['1639990278797:23'], ['1639990278797:3'], ['1639990278797:33'], ['1639990278797:37']] matchDF = unmarked.join(matchPairZClustersDF, unmarked.z_cluster == matchPairZClustersDF.z_clusterMatched, "inner") matchDF.write.mode("append").parquet("models/100/trainingData/marked") |
same has to be done for 0 and 2 - non matches and not sures |
Attaching a couple of files through which I have done notebook based labelling to build the training data on Databricks. The findTrainingData phase writes to zinggDir/modelId/trainingData/unmarked folder and the label phase reads through this location and writes to zinggDir/modelId/trainingData/marked location. In all cases, pairs share the same z_cluster, and the z_isMatch flag denotes if they are matches or not. -1 means they are unmarked, 0 stands for not a match, 1 for a match and 2 for can not say. The findTrainingData writes all pairs with z_isMatch as -1. The z_isMatch flag is updated for the pairs and the output saved by the label phase. The notebooks attached do the same thing. They take in z_clusters of the matches, non matches and cant say records and write them to the marked folder. Here are the files in the attached zip.
After this, Zingg was run in trainMatch mode. I verified and the results looked ok. |
Sonal, these steps and instructions are excellent. Thanks for putting them together. I tried them on my local workspace - using configs (full config attached):
I ran the findTrainingData step through 4 iterations, ending up with 82 labeled pairs (164 records) in my When I ran the trainMatch step, I ended up with an NPE, but I think the error is that the training data cannot be found. I'll attach the stderr and log here, but I am suspicious of this in the log, which occurs just before everything shuts down:
One other downside to mention... despite running findTrainingData 4 times, I only ended up with 3 positive labels (and 79 negative labels)... but I suspect this is more related to the test data. If you have suggestions to help find more positive cases, let me know! |
You are correct, the error is that no training data has been found. Let me check what could be happening |
By any chance do you have the logs of the findTrainingData phase- the last one that you ran? Also what are the locations of the marked and unmarked folders in the notebooks? |
I have a question on the zinggDir setting in the config you have sent @lsbilbro. As this is a dbfs location, should we add a root (/) to it? "zinggDir": "/Bilbro/zingg" instead of "zinggDir": "Bilbro/zingg". |
Seems like I attached the wrong config file. Correct config which I used is attached. This has the root location in zinggDir. You will need to update your config to reflect the location of the model and run findTrainingData and label till you get at least 12-15 matches. |
Yep, I can confirm that using an absolute path instead of a relative path - i.e. prepending the And after ensuring I had enough positive labels, I am getting full results, which look really good! 😄 |
Updated documentation to refer to this issue for now. Closing |
The text was updated successfully, but these errors were encountered: