πŸ₯

[Spark] SQL Hint λ³Έλ¬Έ

데이터/Spark

[Spark] SQL Hint

•8• 2024. 4. 1. 20:38

SQL Hint

μ°Έκ³ : https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html

문법

/*+ hint [ , ... ] */

 

νŒŒν‹°μ…”λ‹ 힌트

νŒŒν‹°μ…”λ‹ μ „λž΅μ„ μ œμ•ˆν•  수 μžˆλ‹€. 

COALESCE

`coalesce` dataset API와 λ™μΌν•˜λ‹€. λ§€κ°œλ³€μˆ˜λŠ” νŒŒν‹°μ…˜ 개수이고, μ§€μ •λœ νŒŒν‹°μ…˜ 수둜 쀄일 수 μžˆλ‹€.

SELECT /*+ COALESCE(3) */ * FROM t;

 

REPARTITION

`repartition` dataset API 와 λ™μΌν•˜λ‹€.νŒŒν‹°μ…˜ 개수, 컬럼λͺ…을 λ§€κ°œλ³€μˆ˜λ‘œ μ‚¬μš©ν•  수 있고 μ§€μ •λœ 수의 νŒŒν‹°μ…˜μœΌλ‘œ λ‹€μ‹œ λΆ„ν• ν•˜λŠ”λ°μ— μ‚¬μš©λœλ‹€.

SELECT /*+ REPARTITION(3) */ * FROM t;

SELECT /*+ REPARTITION(c) */ * FROM t;

SELECT /*+ REPARTITION(3, c) */ * FROM t;

 

REPARTITION_BY_RANGE

μ§€μ •λœ 수의 νŒŒν‹°μ…˜μœΌλ‘œ λΆ„ν• ν•˜λŠ”λ°μ— μ‚¬μš©λœλ‹€. `repartitionByRange` api와 λ™μΌν•˜λ‹€.

SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;

SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;

 

REBALANCE

SQL Hintμ—μ„œλ§Œ μ‚¬μš©ν•  수 μžˆλ‹€. 쿼리 κ²°κ³Ό Output νŒŒν‹°μ…˜μ˜ κ· ν˜•μ„ μž¬μ‘°μ •ν•˜λŠ”λ°μ— μ‚¬μš©λœλ‹€. λͺ¨λ“  νŒŒν‹°μ…˜μ΄ μ μ ˆν•œ 크기가 λ˜λ„λ‘ ν•œλ‹€. 컬럼λͺ…을 λ§€κ°œλ³€μˆ˜λ‘œ μ‚¬μš©ν•œλ‹€λ©΄ μ„€μ •ν•œ μ»¬λŸΌμ„ κΈ°μ€€μœΌλ‘œ 쿼리 κ²°κ³Όλ₯Ό μ΅œμ„ μ˜ λ…Έλ ₯으둜 λΆ„ν• ν•œλ‹€. skewed data라면 REBALANCEλ₯Ό μ‚¬μš©ν•  수 μžˆλ‹€. AQEκ°€ ν™œμ„±ν™”λ˜μ§€ μ•Šμ•˜λ‹€λ©΄ λ¬΄μ‹œλ˜λŠ” νžŒνŠΈμ΄λ‹€.

SELECT /*+ REBALANCE */ * FROM t;

SELECT /*+ REBALANCE(3) */ * FROM t;

SELECT /*+ REBALANCE(c) */ * FROM t;

SELECT /*+ REBALANCE(3, c) */ * FROM t;

 

μ—¬λŸ¬ 힌트λ₯Ό μ œμ‹œν•œλ‹€λ©΄ κ°€μž₯ μ™Όμͺ½μ— μžˆλŠ” Hintλ₯Ό μ„ νƒν•œλ‹€. (μ•„λž˜μ˜ 경우 REPARTITION)

-- multiple partitioning hints
EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t;
== Parsed Logical Plan ==
'UnresolvedHint REPARTITION, [100]
+- 'UnresolvedHint COALESCE, [500]
   +- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c]
      +- 'Project [*]
         +- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==
name: string, c: int
Repartition 100, true
+- Repartition 500, false
   +- RepartitionByExpression [c#30 ASC NULLS FIRST], 3
      +- Project [name#29, c#30]
         +- SubqueryAlias spark_catalog.default.t
            +- Relation[name#29,c#30] parquet

== Optimized Logical Plan ==
Repartition 100, true
+- Relation[name#29,c#30] parquet

== Physical Plan ==
Exchange RoundRobinPartitioning(100), false, [id=#121]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
      Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
      PushedFilters: [], ReadSchema: struct<name:string>

 

 

쑰인 힌트

쑰인 μ „λž΅μ„ μ œμ•ˆν•  수 μžˆλ‹€.

Spark3.0 μ΄μ „μ—λŠ” broadcast join hint만 μ§€μ›λ˜μ—ˆμœΌλ‚˜ 3.0λΆ€ν„°λŠ” `MERGE`, `SHUFFLE_HASH`, `SHUFFLE_REPLICATE_NL` 이 μΆ”κ°€λ˜μ—ˆλ‹€.

 

νŠΉμ • μ „λž΅μ„ μ§€μ›ν•˜μ§€ μ•ŠλŠ” 쑰인 νƒ€μž…μ΄ μžˆμ„ 수 있기 λ•Œλ¬Έμ— 힌트λ₯Ό μ œκ³΅ν•œλ‹€κ³  무쑰건 제곡된 μ „λž΅μ„ μ‚¬μš©ν•œλ‹€κ³  보μž₯받을 수 μ—†λ‹€.

 

BROADCAST

λΈŒλ‘œλ“œμΌ€μŠ€νŠΈ 쑰인을 μ œμ•ˆν•œλ‹€. 쑰인 μœ„μΉ˜ 어디에 μžˆλ“ , 크기가 더 μž‘μ€ μͺ½μ΄ λΈŒλ‘œλ“œμΊμŠ€νŠΈ λœλ‹€.

-- Join Hints for broadcast join
SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;

MERGE

shuffle sort merge join을 μ œμ•ˆν•œλ‹€. 

-- Join Hints for shuffle sort merge join
SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

SHUFFLE_HASH

shuffle hash join을 μ‚¬μš©ν•˜λ„λ‘ μ œμ•ˆν•œλ‹€. λ§Œμ•½ shuffle hash hintλ₯Ό μ œκ³΅λ°›λŠ”λ‹€λ©΄ 더 μž‘μ€ μͺ½μ€ build side둜 μ§€μ •ν•œλ‹€.

-- Join Hints for shuffle hash join
SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

SHUFFLE_REPLICATE_NL

shuffle-and-replicate nested loop join을 μ‚¬μš©ν•˜λ„λ‘ μ œμ•ˆν•œλ‹€.

-- Join Hints for shuffle-and-replicate nested loop join
SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

 

 

μ•„λž˜μ™€ 같이 쑰인의 μ–‘μͺ½μ— μ„œλ‘œ λ‹€λ₯Έ νžŒνŠΈκ°€ μ§€μ •λœλ‹€λ©΄ μ•„λž˜μ˜ μš°μ„ μˆœμœ„λ‘œ 힌트λ₯Ό μ„€μ •ν•œλ‹€.

`BROADCAST` > `MERGE` > `SHUFFLE_HASH` > SHUFFLE_REPLICATE_NL`

SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

μœ„μ˜ μ˜ˆμ‹œμ—μ„œλŠ” BROADCAST 쑰인을 μ‚¬μš©ν•˜κ³  μ•„λž˜μ™€ 같이 μ›Œλ‹ 둜그λ₯Ό λ°œμƒμ‹œν‚¨λ‹€.

org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge) is overridden by another hint and will not take effect.