Typically, Spark runs in YARN, which is not convenient if we need finer control of executor placement (for example, run in a single machine with specific number of executors with exactly configuration). Standalone better suites this use cases.
- In order to use the spark-kit:
git clone https://github.com/stevenybw/spark-kit
cd spark-kit
source manage-standalone.sh- Get Spark official release
wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz- Check the environment and follow the direction
check_environment- Adjust the parameters in
manage-standalone.sh. - Establish Spark standalone cluster with all the nodes in ${SLAVES_HOSTLIST}
reset_environment $DIST- Establish Spark standalone cluster with a single node (the first node in ${SLAVES_HOSTLIST})
reset_environment $LOCAL- Establish Spark standalone cluster with a single node (current node running the script)
reset_environment_locally $LOCAL- Check the Spark standalone resource manager master
show_master_webui- Show the command to launch a spark shell (its argument must be the same as how you setup the environments, assume distributed here)
show_spark_shell_command $DIST- Or launch a spark shell
enter_spark_shell $DIST- See the session web UI of the spark job at port 4040