๐Ÿฅ

[Spark] spark-submit ๊ณผ ์˜ต์…˜ ๋ณธ๋ฌธ

๋ฐ์ดํ„ฐ/Spark

[Spark] spark-submit ๊ณผ ์˜ต์…˜

•8• 2024. 3. 24. 02:06

spark-submit ์ด๋ž€?

spark-submit์€ ์ŠคํŒŒํฌ application์„ ํด๋Ÿฌ์Šคํ„ฐ์— ๋ฐฐํฌํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์Šคํฌ๋ฆฝํŠธ์ด๋‹ค.

์ด ์Šคํฌ๋ฆฝํŠธ๋Š” application์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ์–ดํ•˜๋Š” ์—ฌ๋Ÿฌ ํ”Œ๋ž˜๊ทธ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

bin ๋””๋ ‰ํ† ๋ฆฌ ๋‚ด์˜ ์‹คํ–‰ ํŒŒ์ผ

 

spark-submit ์˜ต์…˜

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
    --py-files file1.py,file2.py,file3.zip
    my_python_file.py

์œ„์˜ spark-submit ์ฃผ์š” ์˜ต์…˜ ์˜ˆ์‹œ์˜ ํ”Œ๋ž˜๊ทธ ๊ฐ’๋“ค์„ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

 

master ํ”Œ๋ž˜๊ทธ (cluster manager)

spark application์˜ ์ž์›์„ ํ• ๋‹น๋ฐ›๊ธฐ ์œ„ํ•ด ์–ด๋–ค cluster manager๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์„ค์ •ํ•œ๋‹ค.

master ํ”Œ๋ž˜๊ทธ ๊ฐ’ ์„ค๋ช…
local[*] ๋กœ์ปฌ ๋ชจ๋“œ์—์„œ ๋จธ์‹ ์ด ๊ฐ–๊ณ  ์žˆ๋Š” ๋งŒํผ์˜ ์ฝ”์–ด๋กœ ์‹คํ–‰ํ•œ๋‹ค.
local[N] ๋กœ์ปฌ ๋ชจ๋“œ์—์„œ N๊ฐœ์˜ ์ฝ”์–ด๋กœ ์‹คํ–‰ํ•œ๋‹ค.
local ๋กœ์ปฌ ๋ชจ๋“œ์—์„œ ์‹ฑ๊ธ€ ์ฝ”์–ด๋กœ ์‹คํ–‰ํ•œ๋‹ค.
yarn yarn ํด๋Ÿฌ์Šคํ„ฐ์— ์ ‘์†ํ•œ๋‹ค.
๋ชจ๋“  ์›Œ์ปค๋…ธ๋“œ์— ํ•˜๋‘ก ์„ค์ • ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ `HADOOP_CONF_DIR` ์ด๋‚˜ `YARN_CONF_DIR `ํ™˜๊ฒฝ๋ณ€์ˆ˜๋กœ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.
spark://host:port or mesos://host:pord ์ŠคํŒŒํฌ(standalone)/๋ฉ”์†Œ์Šค ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ง€์ •ํ•œ ํฌํŠธ๋กœ ์ ‘์†ํ•œ๋‹ค.
k8s://host:port ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค๋กœ ์‹คํ–‰ํ•œ๋‹ค.

 

deploy mode (๋ฐฐํฌ ๋ฐฉ๋ฒ•)

application ์‹คํ–‰ ์‹œ ํ• ๋‹น๋ฐ›์„ ๋ฆฌ์†Œ์Šค์˜ ๋ฌผ๋ฆฌ์  ์œ„์น˜๋ฅผ ์„ค์ •ํ•œ๋‹ค.

์ŠคํŒŒํฌ์˜ deploy mode ๋Š” ์•„๋ž˜์˜ ์„ธ ๊ฐ€์ง€์ด๋‹ค.

deploy-mode ์„ค๋ช…
local mode ์ŠคํŒŒํฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ๋‹จ์ผ ๋จธ์‹ ์—์„œ ์‹คํ–‰๋  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฐํฌ ๋ชจ๋“œ์ด๋‹ค.
client mode spark driver๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์™ธ๋ถ€(gateway machine or edge node)์—์„œ ์‹คํ–‰๋œ๋‹ค.
spark executors๋Š” ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์‹คํ–‰๋œ๋‹ค.
cluster mode spark driver์™€ executors ๋ชจ๋‘ ํด๋Ÿฌ์Šคํ„ฐ ์›Œ์ปค ๋…ธ๋“œ์—์„œ ์‹คํ–‰๋œ๋‹ค.

client์™€ cluster mode์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์€ driver์˜ ์‹คํ–‰ ์œ„์น˜์ด๋‹ค.

client mode

client mode์˜ ๊ฒฝ์šฐ application master๋Š” ๋‹จ์ˆœํžˆ ๋…ธ๋“œ ๋งค๋‹ˆ์ €์—๊ฒŒ ์ž์› ์š”์ฒญ๋งŒ ํ•œ๋‹ค. ํด๋ผ์ด์–ธํŠธ ๋จธ์‹  ์œ„์—์„œ driver๊ฐ€ ์‹คํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ”๋กœ ์‚ฌ์šฉ์ค‘์ธ ๋ฆฌ์†Œ์Šค ๋ฐ ๋กœ๊ทธ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ณ , ๊ฐœ๋ฐœ ์‹œ ๋Œ€ํ™”ํ˜• ๋””๋ฒ„๊น…์„ ํ•  ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋“œ์ด๋‹ค.

https://docs.cloudera.com/runtime/7.2.17/running-spark-applications/topics/spark-yarn-deployment-modes.html

 

cluster mode

driver๊ฐ€ ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด๋ถ€์—์„œ ์‹คํ–‰๋˜๋ฉฐ, ์ปดํŒŒ์ผ๋œ jar/python script ๋“ฑ์„ cluster manager์—๊ฒŒ ์ „์†กํ•œ๋‹ค.

driver๊ฐ€ application master ์ƒ์—์„œ ๋™์ž‘๋˜์–ด์„œ ๋กœ๊ทธ๋ฅผ ๋ฐ”๋กœ ํ™•์ธํ•˜๊ธฐ ํž˜๋“ค๋‹ค. ์ฃผ๋กœ ์šด์˜ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฐํฌ ๋ฐฉ์‹์ด๋‹ค.

https://docs.cloudera.com/runtime/7.2.17/running-spark-applications/topics/spark-yarn-deployment-modes.html

 

 

Application Properties Config

์•„๋ž˜๋Š” ๋‹ค์–‘ํ•œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์†์„ฑ์ด๋‹ค. `--conf <key1>=<value1> --conf <key2>=<value2>` ํ˜•ํƒœ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

์ž์ฃผ ์“ฐ์ด๋Š” ํ”Œ๋ž˜๊ทธ๋“ค๋งŒ ์ •๋ฆฌํ–ˆ๊ณ , ๋‹ค๋ฅธ ์„ค์ •๊ฐ’๋“ค์€ ์•„๋ž˜์˜ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด ๋œ๋‹ค.

๋ณดํ†ต `sparkConf` ํด๋ž˜์Šค๋ฅผ ํ†ตํ•ด์„œ๋„ ๋™์ผํ•˜๊ฒŒ ์„ค์ • ๊ฐ€๋Šฅํ•˜๋‹ค.

์ฐธ๊ณ :https://spark.apache.org/docs/latest/configuration.html#application-propertieshttps://spark.apache.org/docs/latest/configuration.html

property name default ์„ค๋ช…
spark.driver.cores 1 driver๊ฐ€ ์‚ฌ์šฉํ•  ์ฝ”์–ด ์ˆ˜๋กœ, cluster ๋ชจ๋“œ์—์„œ๋งŒ ์œ ํšจํ•˜๋‹ค.
spark.driver.memory 1g driver๊ฐ€ ์‚ฌ์šฉํ•  ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์ด๋‹ค.
* client mode์—์„œ๋Š” application์ด sparkContext๊ฐ€ ์ดˆ๊ธฐํ™”๋˜๋Š” ์ˆœ๊ฐ„ driver JVM์ด ์ด๋ฏธ ์‹œ์ž‘๋œ ์ƒํƒœ์ด๊ธฐ ๋•Œ๋ฌธ์— SparkConf ๋กœ ์„ค์ • ๋ถˆ๊ฐ€ํ•˜๋‹ค. spark-submit ๋•Œ ํ•จ๊ป˜ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜ spark-defaults.conf๋กœ ๊ฐ’์„ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.
spark.executor.memory 1g executor์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์ด๋‹ค.
spark.driver.memoryOverhead driverMemory*0.1
(์ตœ์†Œ 384m)
cluster ๋ชจ๋“œ์—์„œ driver๋‹น ํ• ๋‹นํ•  ์˜คํ”„ํž™ ๋ฉ”๋ชจ๋ฆฌ์˜ ํฌ๊ธฐ
yarn, kubernetes์—์„œ๋งŒ ์ง€์›ํ•˜๋Š” ์˜ต์…˜์ด๋‹ค.
spark.executor.cores 1  

 

[์ฐธ๊ณ ] spark configuration ์ ์šฉ ์šฐ์„ ์ˆœ์œ„

spark config๋ฅผ ์ ์šฉํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๊ณผ ์šฐ์„ ์ˆœ์œ„๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • 1์ˆœ์œ„: SparkConf ํด๋ž˜์Šค์—์„œ ์„ค์ • (`set()` ํ•จ์ˆ˜ ์‚ฌ์šฉ)
  • 2์ˆœ์œ„: spark-submit/spark-shell ์‹คํ–‰ ์‹œ ํ•จ๊ป˜ ์ œ๊ณตํ•˜๋Š” ํ”Œ๋ž˜๊ทธ ๊ฐ’
  • 3์ˆœ์œ„: conf/spark-defaults.conf ์„ค์ •๊ฐ’

 

๊ธฐํƒ€ ์˜ต์…˜

flag ์„ค๋ช…
--class main์ด ์žˆ๋Š” ํด๋ž˜์Šค๋ฅผ ์ง€์ •ํ•œ๋‹ค. (java, scala ํ”„๋กœ๊ทธ๋žจ ๋ฐฐํฌ์˜ ๊ฒฝ์šฐ)
--jars ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ classPath์— ์žˆ์–ด์•ผ ํ•  ๊ฒฝ์šฐ Jar ํŒŒ์ผ ๋ชฉ๋ก์ด๋‹ค.
์™ธ๋ถ€ jarํŒŒ์ผ์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์ด ์˜ต์…˜์œผ๋กœ ์ง€์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.
--packages ์ปด๋งˆ ๊ตฌ๋ถ„์ž์˜ driver/executor classPath์— ํฌํ•จํ•  jar์˜ maven coordinates์ด๋‹ค.
ํƒ์ƒ‰ ์ˆœ์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค
local maven -> maven central -> remote repositories
--py-files ์ปด๋งˆ ๊ตฌ๋ถ„์ž์˜ PYTHONPATH์— ๋“ค์–ด๊ฐ€์•ผ ํ•˜๋Š” .zip, .egg, .py ํŒŒ์ผ ๋ชฉ๋ก๋“ค์ด๋‹ค.
client deploy mode์—์„œ๋Š” ๋กœ์ปฌ ํŒŒ์ผ์„ ๊ฐ€๋ฆฌ์ผœ์•ผ ํ•˜๊ณ , cluster deploy mode์—์„œ๋Š” ๋กœ์ปฌํŒŒ์ผ์ด๊ฑฐ๋‚˜(๋ชจ๋“  executor ๋‚ด์— ์กด์žฌํ•ด์•ผ ํ•จ), ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด๋ถ€์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” URL ํ˜•ํƒœ์—ฌ์•ผ ํ•œ๋‹ค.
--repositories packages์— ์ง€์ •๋œ maven ์ขŒํ‘œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ์ปด๋งˆ ๊ตฌ๋ถ„์ž์˜ remote repository ๋ชฉ๋ก์ด๋‹ค.
--driver-class-path ๋“œ๋ผ์ด๋ฒ„์— classPath  ๋ชฉ๋ก์ด๋ฉฐ, --jars๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๊ฐ€๋œ jar๋“ค์€ ์ž๋™์œผ๋กœ classPath์— ํฌํ•จ๋œ๋‹ค.

์ฐธ๊ณ : https://docs.cloudera.com/runtime/7.2.15/running-spark-applications/topics/spark-submit-options.html

'๋ฐ์ดํ„ฐ > Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Spark] Logical Plan ๊ณผ Physical Plan  (0) 2024.03.25
[Spark] RDD vs Dataframe  (2) 2024.03.24
[Spark] Adaptive Query Execution(AQE)  (0) 2024.03.23
[Spark] ์ŠคํŒŒํฌ์˜ Executor Memory ๊ตฌ์กฐ  (0) 2024.03.23
[Spark] GraphX  (0) 2024.03.18