[Spark] Repartition과 Coalesce

Notice

Recent Posts

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

관리 메뉴

🐥

[Spark] Repartition과 Coalesce 본문

카테고리 없음

[Spark] Repartition과 Coalesce

•8• 2024. 4. 1. 15:21

spark를 통해 연산을 하다 보면 파티션이 과다하게 많거나 너무 적은 상황이 있을 수 있다. 이럴 때에는 repartition과 coalesce를 통해 파티션 개수를 지정해줄 수 있다.

`repartition()`과 `coalesce()` 모두 파티션 개수를 설정해주는 함수이다.

파티션의 개수를 조정해주지만 `coalesce()`는 파티션의 개수를 줄이는 것만 가능하다는 점에서 차이점이 있다.

1. 실행 예시

repartition과 coalesce의 가장 큰 차이점은 shuffle 수행 여부에 있다.

`coalesce`를 사용해서 파티션 수를 줄인 경우, stage2에서 coalesce 가 수행되는 것을 확인할 수 있다.

load df1
load df2
df3 = df1.join(df2, some_conditions, 'inner')
df3.coalesce(5).write.parquet('my_path')

그런데 자세히 보면 default 파티션 수는 200일텐데 task가 200이 있는 stage는 보이지 않는다.

이부분은 "5"라는 파티션 값이 이전의 부모 rdd 에 overwrite 되었고 stage2에서 5개의 task로 수행되었다.

`repartition` 사용의 경우에는 `coalesce`와는 달리 4개의 stage가 수행되었다.

load df1
load df2
df3 = df1.join(df2, some_conditions, 'inner')
df3.repartition(5).write.parquet('my_path')

stage 5에서 Join이 파티션 개수만큼 200개의 병렬 task로 실행되었고, 이부분에서 shuffle이 발생한 것을 알 수 있다.

2. 수행방법

repartition의 경우에는 full shuffle이 일어나면서 파티션 수가 조정이 되고,

coalesce는 셔플을 사용하지 않고 단순이 파티션을 combine하는 방식으로 파티션 조정이 수행되는 것을 알 수 있다.

repartition은 execution time은 증가하지만 shuffle 덕분에 파티션마다 고른 데이터 분포를 가질 수 있다.

반대로 coalesce는 불균형한 분포를 가진 파티션으로 재조정될 수 있다.

참고로 repartition 함수를 살펴보면 그냥 shuffle 값을 True로 두고 coalesce를 호출하는 것을 볼 수 있다.

(참고: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L479)

  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

https://www.youtube.com/watch?v=ijD5zuEV8U8

https://blog.51cto.com/u_14009243/5975125

🐥

[Spark] Repartition과 Coalesce 본문

[Spark] Repartition과 Coalesce

1. 실행 예시

2. 수행방법

티스토리툴바