๐Ÿฅ

[Spark] TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again. ์˜ค๋ฅ˜ ๋ฐœ์ƒ ์‹œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• ๋ณธ๋ฌธ

๋ฐ์ดํ„ฐ/Spark

[Spark] TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again. ์˜ค๋ฅ˜ ๋ฐœ์ƒ ์‹œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

•8• 2023. 5. 22. 22:13

๋ฎจ์ œ์ƒํ™ฉ

dataframe ๋‘ ๊ฐœ๋ฅผ ์กฐ์ธํ•˜๋ ค๋Š”๋ฐ ์•„๋ž˜์™€ ๊ฐ™์€ ์›Œ๋‹์ด ์ฃผ๋ฅด๋ฅต ๋ฐœ์ƒํ•˜๋”๋‹ˆ ์˜ค๋ฅ˜๋ฅผ ์ถœ๋ ฅํ•˜๊ณ  ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ์ข…๋ฃŒ๋๋‹ค.

...
23/05/22 04:39:35 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:35 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:35 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:36 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:36 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:36 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:36 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:37 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:37 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:38 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:38 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:38 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:39 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:39 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:39 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:39 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:40 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:40 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:40 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:41 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:41 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:41 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
23/05/22 04:39:41 WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.
...

 

์ด ์—๋Ÿฌ๋Š” TaskMemoryManager๊ฐ€ ํŽ˜์ด์ง€๋ฅผ ํ• ๋‹นํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. 

(์ฐธ๊ณ :

 https://github.com/apache/spark/blob/d3f12df6e09ee47dbd7c9e08c9962430a2601941/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L334)

resourceManager์—๊ฒŒ ์ด์ „์— ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์„ ํ—ˆ๋ฝ๋ฐ›์€ ํ›„์— `tungstenMemoryAllocator`๋ฅผ ์ด์šฉํ•ด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹น๋ฐ›๊ณ  `pageNumber`๋ฅผ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค. ์ด๋•Œ OutOfMemory๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ํ•ด๋‹น ์—๋Ÿฌ๋ฅผ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค.

 

๋ฌธ์ œ์˜ ์ƒํ™ฉ์—์„œ๋Š” ํ•œ ์ชฝ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘์•˜๋‹ค. (skewed data)

์‹ค์ œ๋กœ  dag๋ฅผ ์‚ดํŽด๋ณด๋‹ˆ broadcastHashJoin์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์—ˆ๋‹ค.

๋ฐ์ดํ„ฐ์‚ฌ์ด์ฆˆ๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ž‘์€ ๊ฒƒ ๊ฐ™์€๋ฐ ์™œ ์—ฌ๊ธฐ์„œ ์˜ค๋ฅ˜๊ฐ€ ๋‚˜๋Š”์ง€๋Š” ๋ชจ๋ฅด๊ฒ ๋‹ค ใ… 

 

 

ํ•ด๊ฒฐ๋ฐฉ๋ฒ•1. ์ž‘์€ ์ชฝ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํฌ๊ธฐ๋ฅผ ์ž„์˜๋กœ ํ‚ค์šฐ๊ธฐ

small_data = small_data.union(small_data).union(small_data).union(small_data).union(small_data).union(small_data).union(small_data).union(small_data)
big_data.join(small_data.distinct(), [conditions], how='left')

์•ผ๋งค๊ธด ํ•œ๋ฐ ์ž‘์€ ๋ฐ์ดํ„ฐ๋ฅผ union์„ ํ†ตํ•ด ํฌ๊ธฐ๋ฅผ ํ‚ค์›Œ์ค€ ๋’ค, join ์‹œ distinct()๋ฅผ ํ†ตํ•ด ์ค‘๋ณต์ œ๊ฑฐํ•˜๊ณ  ์‚ฌ์šฉํ•˜๋ฉด ์˜ค๋ฅ˜ ์—†์ด ์ •์ƒ์ ์œผ๋กœ ์‹คํ–‰๋๋‹ค. 

์ด ๊ฒฝ์šฐ์—๋Š” broadcast hash join์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.

 

ํ•ด๊ฒฐ๋ฐฉ๋ฒ•2. executor ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋Š˜๋ฆฌ๊ธฐ

spark.executor.memory

 

ํ•ด๊ฒฐ๋ฐฉ๋ฒ•3. broadcastjoin ๋น„ํ™œ์„ฑํ™”

์•„๋ž˜์˜ config ๊ฐ’์„ -1๋กœ ์„ค์ •ํ•˜๋ฉด ๋น„ํ™œ์„ฑํ™”๋˜์–ด ๋‹ค๋ฅธ ์กฐ์ธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

spark.sql.autoBroadcastJoinThreshold

๊ทธ๋Ÿฐ๋ฐ ์ด์ƒํ•œ๊ฑด ์ด๊ฒŒ ๋  ๋•Œ๋„ ์žˆ๊ณ  ์•ˆ๋ ๋•Œ๋„ ์žˆ์–ด์„œ ์กฐ๊ฑด์€ ํ™•์ธํ•ด๋ด์•ผํ•  ๊ฒƒ ๊ฐ™๋‹ค.

 

 

์ถ”๊ฐ€: Project Tungsten

TaskMemoryManager.java์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค์‹œํ”ผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ํ•จ์ˆ˜๋‚˜ ๋ณ€์ˆ˜๋ช…์— `tungsten`์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ๋“ฑ์žฅํ•ด์„œ ๊ถ๊ธˆํ•ด์„œ ์ฐพ์•„๋ดค๋‹ค.

Tungsten์ด๋ผ๋Š” ํ”„๋กœ์ ํŠธ๊ฐ€ ์žˆ๋Š”๋ฐ Spark 2๋ถ€ํ„ฐ ๋„์ž…๋œ ์—”์ง„์ด๋‹ค.

๋””์ŠคํฌI/O๋‚˜ ๋„คํŠธ์›Œํฌ ์ชฝ์„ ๊ฐœ์„ ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋„๋ชจํ•˜๋Š” ๋‹ค๋ฅธ ํ”„๋กœ์ ํŠธ์™€๋Š” ๋‹ฌ๋ฆฌ ์ŠคํŒŒํฌ์˜ CPU์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์— ๊ธฐ์—ฌํ•˜๊ณ  ์ด ๋ถ€๋ถ„์—์„œ ๋งŽ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ฑ…์ž„์ง€๊ณ  ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

(sun.misc.Unsafe ์‚ฌ์šฉ) Tungsten์€ JVM์˜ heap memory๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  native ์˜์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. 

์ฐธ๊ณ ๋กœ sparkSQL ์‚ฌ์šฉ ์‹œ ์ตœ์ ํ™”์—๋„ ์‚ฌ์šฉํ•œ๋‹ค: Catalyst optimizer๋กœ ๋…ผ๋ฆฌ์  ์ฟผ๋ฆฌ ๊ณ„ํš์—์„œ ๋ฌผ๋ฆฌ์  ์ฟผ๋ฆฌ ๊ณ„ํš์„ ์ƒ์„ฑํ•œ ํ›„์— Tungsten์˜ Codegen ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ํ™”๋œ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

 

Tungsten์˜ ์—ญํ• ์€ ํฌ๊ฒŒ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

1. Off-Heap Memory Management: ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌ

2. Cache Locality: ์บ์‹œ๋ฅผ ์ž˜ ๋ฐฐ์น˜..(for high cache hit rate)

3. CodeGen: Whole-Stage Code Generation

 

๊ด€๋ จ ์†์„ฑ์œผ๋กœ๋Š” `spark.sql.tungsten.enabled` ๊ฐ€ ์žˆ๋‹ค.

 

(์ฐธ๊ณ : https://www.linkedin.com/pulse/catalyst-tungsten-apache-sparks-speeding-engine-deepak-rajak)