-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨Add toolbar dropdown to do remote run options #164
Conversation
ac20cfc
to
7025ddf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, looks a big PR! I've tried the two tests you've mentioned, and it looks like it's running well for both. I'll try this distribution in our spark cluster soon.
From my local though, I went ahead and experiment the command feature, looks like we can do some nice stuff like echo-ing commands.
So there's a few things I noticed:
-
The window appears to have a double resize. I think the inner one is enough, as adjusting the size for the outer one will misadjust the inner one.
-
I don't think we need the "Also, you can go to Kraftboard to check the benchmarks" string for the default distribution. This message we can add to the config msg.
Those 3 for now, I'll add more comments when I've tested it out more. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it in our server with cluster, things are working perfectly.
One thing that I've noted is the difference between the treatment for the last line of the config. Previously I would need to supply a double \\
as below
--conf spark.driver.maxResultSize=10G \\
to perform a spark submit.
Now I would need to omit it to run, otherwise it'll return
sparkTrain.py not found.
As a gap is generated. I think the new change is better, so great job. 😄
Thanks for the review. Solved the 3 issues. Good job noticing the dropdown icon. It was not an easy fix as I though. Yeah, forgot to mention that I add space between the command and file's path. So we don't have to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, based on the feedback we've gotten I've modified the config.ini to Local and Cluster mode.
Local works out of the box, but for cluster mode I had to do a bit more work. For documentation purposes, these are the errors that users might get, and how to resolve it.
1. Module not found error
ModuleNotFoundError: No module named '...'
Need to package xai_components + venv to a zip file, then add these spark configs:
--py-files env_spark.zip \
--archives env_spark.zip \
To make zipping process easier, I've added SparkPackageVenv.xircuits that does just that.
2. Incorrect python cluster version
File "/home/hadoop/nm-local-dir/usercache/fahreza/appcache/application_1655102329321_0303/container_1655102329321_0303_01_000001/env_spark.zip/numpy/version.py", line 1
from __future__ import annotations
^
SyntaxError: future feature annotations is not defined
If the packages require a different python runtime than the default one, users would need to specify the python version. In this example, the Centos that I was uses by default 3.6, while the packages expect a higher version, ie 3.9. To set the python runtime, they'll need these configs:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON='/usr/local/bin/python3.9' \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON='/usr/local/bin/python3.9' \
3. File does not exist
pyspark.sql.utils.AnalysisException: Path does not exist: hdfs://servername:9000/user/fahreza/datasets/wind.csv
The cluster does not have the file used in the workflow, so to upload the file to HDFS:
hdfs dfs -mkdir datasets
hdfs dfs -put datasets/wind.csv datasets/wind.csv
If everything is working, they'd get an output like this in the hadoop stdout log:
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dlralr49 because the default path (/home/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Executing: xSparkSession
/home/hadoop/nm-local-dir/usercache/fahreza/appcache/application_1655102329321_0305/container_1655102329321_0305_01_000001/pyspark.zip/pyspark/context.py:264: RuntimeWarning: Failed to add file [file:///home/fahreza/Github/xircuits-spark-config/env_spark.zip] specified in 'spark.submit.pyFiles' to Python path:
/data/disk3/hadoop/nm-local-dir/usercache/fahreza/filecache/36
/data/disk3/hadoop/nm-local-dir/usercache/fahreza/appcache/application_1655102329321_0305/spark-d581118d-7889-4ae4-881e-740358604fa1/userFiles-40a340a3-57fe-441f-8597-3fae5ac6a412
/data/disk3/hadoop/nm-local-dir/usercache/fahreza/filecache/34/__spark_libs__580593447246665266.zip/spark-core_2.12-3.1.3.jar
/home/hadoop/nm-local-dir/usercache/fahreza/appcache/application_1655102329321_0305/container_1655102329321_0305_01_000001/pyspark.zip
/home/hadoop/nm-local-dir/usercache/fahreza/appcache/application_1655102329321_0305/container_1655102329321_0305_01_000001/py4j-0.10.9-src.zip
/home/hadoop/nm-local-dir/usercache/fahreza/appcache/application_1655102329321_0305/container_1655102329321_0305_01_000001/env_spark.zip
/usr/local/lib/python39.zip
/usr/local/lib/python3.9
/usr/local/lib/python3.9/lib-dynload
/usr/local/lib/python3.9/site-packages
/home/fahreza/Github/xircuits-spark-config/xai_components
warnings.warn(
Executing: SparkReadFile
+------+-----------+
| Year| Wind|
+------+-----------+
|1980.0| 0.0|
|1981.0| 0.0|
|1982.0| 0.0|
|1983.0|0.029667962|
|1984.0|0.050490252|
|1985.0|0.072761883|
|1986.0| 0.14918872|
|1987.0|0.205541414|
|1988.0|0.342871014|
|1989.0| 2.597943|
|1990.0| 3.5356|
|1991.0| 4.096951|
|1992.0| 4.611373|
|1993.0| 5.55795|
|1994.0| 7.284414|
|1995.0| 7.935523|
|1996.0| 9.288649|
|1997.0| 12.134585|
|1998.0| 16.108642|
|1999.0| 21.24186|
+------+-----------+
only showing top 20 rows
Executing: SparkVisualize
Finish Executing
Description
This will enable to add and execute remote custom run (Eg. Spark Submit) with its own configuration using a subprocess module. Initially, it's only for spark submit but now it's more for general usage where we can add other execution type beside spark submit.
It's using the file
config.ini
to get the configuration data. There are 3 separate section to fill which are:Note: Every time
config.ini
is updated, xircuits only detect the change after changing run type on the toolbar.Pull Request Type
Type of Change
Tests
config.ini
will have spark-submit configuration by default.Remote Run
.config.ini
config.ini
by changing the run type from the toolbar.Tested on?
Notes
Thinking of instead of using
config.ini
, use something like.json
file. IMO,.json
is more user-friendly on the data structure