๐Ÿฅ

[Spark] spark์—์„œ s3 ์ ‘๊ทผํ•˜๊ธฐ (ACCESS_KEY, SECRET_KEY) ๋ณธ๋ฌธ

๋ฐ์ดํ„ฐ/Spark

[Spark] spark์—์„œ s3 ์ ‘๊ทผํ•˜๊ธฐ (ACCESS_KEY, SECRET_KEY)

•8• 2023. 12. 19. 18:14

2.4.4 ์ดํ•˜์™€ 2.4.5 ์ด์ƒ ๋ฒ„์ „์—์„œ ํ•˜๋‘ก configuration ์„ค์ •ํ•˜๋Š”๊ฒŒ ์ข€ ๋‹ค๋ฅธ ๋“ฏ ํ•˜๋‹ค.

 

Spark ๋ฒ„์ „ 2.4.4 ์ดํ•˜ 

spark = SparkSession.builder.appName("myapp") \
	.config("some.config", "some.value") \
    .getOrCreate()

# signature V4 ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฆฌ์ „์ผ ๊ฒฝ์šฐ ์•„๋ž˜ property ์„ค์ •
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

# s3 ์ •๋ณด ์„ค์ •
spark.sparkContext._jsc.hadoopConfiguration().set(f"fs.s3a.bucket.{mybucket}.endpoint", my_s3_url)
spark.sparkContext._jsc.hadoopConfiguration().set(f"fs.s3a.bucket.{mybucket}.access.key", my_access_key)
spark.sparkContext._jsc.hadoopConfiguration().set(f"fs.s3a.bucket.{mybucket}.secret.key", secret_key)

2.4.4 ์ดํ•˜๋Š” ์ŠคํŒŒํฌ์—์„œ ํ•˜๋‘ก configuration์— ์ ‘๊ทผํ•˜๋ ค๋ฉด _jsc object (java spark context) ๋ฅผ ํ†ตํ•ด์„œ ํ•ด์•ผํ•œ๋‹ค.

์ค‘๊ฐ„์— fs.s3a.bucekt.๋ฒ„ํ‚ท๋ช…. ... ์ด๋Ÿฐ์‹์œผ๋กœ ํ‚ค๊ฐ’์„ ์ค€ ์ด์œ ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ํ‚ค๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ฒ„ํ‚ท์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.

(์ฐธ๊ณ : https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets)

 

 

 

Spark ๋ฒ„์ „ 2.4.5 ์ด์ƒ

conf = SparkConf()
conf.set('spark.hadoop.fs.s3a.endpoint', my_s3_url)
conf.set('spark.hadoop.fs.s3a.access.key', my_access_key)
conf.set('spark.hadoop.fs.s3a.secret.key', my_secret_key

spark = SparkSession.builder.appName("myapp") \
	.config("some.config", "some.value") \
    .getOrCreate()

2.4.5๋ถ€ํ„ฐ๋Š” hadoop configure๋ฅผ ์„ค์ •ํ•˜๊ธฐ ์œ„ํ•ด _jsc ๋Œ€์‹  ๊ทธ๋ƒฅ `SparkConf.set()`์„ ์‚ฌ์šฉํ•ด์„œ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

(์ฐธ๊ณ : https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration)

 

 

[์ฐธ๊ณ ] ACCESS_KEY, SECRET_KEY ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•

์Šคํฌ๋ฆฝํŠธ์— ํ‚ค ๊ฐ’์„ ๋ฐ•์•„๋†“์„ ์ˆ˜๋Š” ์—†์–ด ์„œ์น˜ํ•ด๋ณด๋‹ˆ boto3 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ spark config์— ๋„ฃ์–ด์ฃผ๋Š” ๋“ฏ ํ•˜๋‹ค.

boto3 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ

boto client์—์„œ ์ž๊ฒฉ์ฆ๋ช…์„ ํ™•์ธํ•˜๋Š” ์ˆœ์„œ

(์ฐธ๊ณ : https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials)

์œ„ ํ•ญ๋ชฉ์ค‘์— ํ•˜๋‚˜๋กœ ์ž๊ฒฉ์ฆ๋ช…์„ ๊ด€๋ฆฌํ•˜๊ณ  ์žˆ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด boto ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

import boto3

# default profile์ธ ๊ฒฝ์šฐ profile_name ์ƒ๋žต ๊ฐ€๋Šฅ
boto_session = boto3.Session(profile_name='my-profile')
credentials = boto_session.get_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key

์ด ๋ฐฉ๋ฒ•์€ cluster deploy mode ์—์„œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•  ๊ฒƒ ๊ฐ™๋‹ค.

spark-defaults.conf ์— ์„ค์ •

1ํšŒ์„ฑ ์Šคํฌ๋ฆฝํŠธ์— ๋งค๋ฒˆ boto3์„ ์ด์šฉํ•ด access key์™€ secret key๋ฅผ ๋„ฃ๋Š”๊ฒŒ 4์ค„์ด์ง€๋งŒ ๊ท€์ฐฎ์„ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿด ๋•Œ๋Š” spark ์„ค์น˜ ๊ฒฝ๋กœ์— ์žˆ๋Š” spark-defaults.conf ์„ค์ •ํŒŒ์ผ์„ ์ˆ˜์ •ํ•˜๋ฉด ๋  ๋“ฏ ํ•˜๋‹ค.

ํŒŒ์ผ์„ ์—ด์–ด๋ณด๋ฉด "spark." ์œผ๋กœ ์‹œ์ž‘ํ•˜๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์„ค์ •๊ฐ’๋“ค์ด ์žˆ๋Š”๋ฐ ์œ„์—์„œ ์ŠคํŒŒํฌ ์ฝ”๋“œ ๋‚ด์—์„œ ์„ค์ •ํ–ˆ๋˜ spark.hadoop.fs.s3a.access.key ๊ฐ™์€ ๊ฐ’๋“ค์„ ์—ฌ๊ธฐ์— ์ง‘์–ด๋„ฃ์–ด์ฃผ๋ฉด ๋œ๋‹ค. 

 

 

* ์ถ”๊ฐ€์ฐธ๊ณ !

pyspark๊ฐ€ submit๋˜๊ณ  sparkcontext intialize ์‹œ์ ์— sparkContext๋Š” Py4J๋ฅผ ํ†ตํ•ด java gateway๋ฅผ ์ดˆ๊ธฐํ™” ํ•จ (JVM์— ์—ฐ๊ฒฐ)

`sc._jvm`: JVM์˜ ๊ฒŒ์ดํŠธ์›จ์ด

`sc._jsc`: ํ•ด๋‹น JVM์˜ spark context์— ๋Œ€ํ•œ ํ”„๋ก์‹œ

์ผ๋ฐ˜์ ์œผ๋กœ `_jsc` ๋ฉค๋ฒ„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋น„์ถ”๋ผ๊ณ  ํ•œ๋‹ค. 

์ถ”๊ฐ€๋กœ python์œผ๋กœ ์ •์˜๋œ udf์˜ ๊ฒฝ์šฐ jvm ๊ฐ์ฒด๋งŒ์œผ๋กœ๋Š” ์‹คํ–‰๋  ์ˆ˜๊ฐ€ ์—†์–ด์„œ python subprocess๋ฅผ ์‹คํ–‰ํ•จ -> ๋ฐ์ดํ„ฐ ์ „์†ก ๋ถ€ํ•˜๊ฐ€ ์žˆ์–ด ๋ณ‘๋ชฉํ˜„์ƒ ๋ฐœ์ƒ 

์ตœ๋Œ€ํ•œ dataframeapi๋กœ๋งŒ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋Š”๊ฒŒ ์ข‹์Œ

์ด๋ถ€๋ถ„์€ ๋‚˜์ค‘์— ์ถ”๊ฐ€ ์ •๋ฆฌ ํ•˜๊ธฐ