๐Ÿฅ

[Spark] RDD vs Dataframe ๋ณธ๋ฌธ

๋ฐ์ดํ„ฐ/Spark

[Spark] RDD vs Dataframe

•8• 2024. 3. 24. 02:18

์ŠคํŒŒํฌ์—์„œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ์ข…๋ฅ˜์—๋Š” ์„ธ ๊ฐœ๊ฐ€ ์žˆ๋‹ค.

  • RDD: spark 1.0 ๋„์ž…
  • Dataframe: spark 1.3 ๋„์ž…
  • Dataset: spark 1.6 ๋„์ž…

์ฐธ๊ณ ๋กœ spark 2.0๋ถ€ํ„ฐ dataframe api๋Š” datasets๊ณผ ํ†ตํ•ฉ๋˜์—ˆ๋‹ค. (์Šค์นผ๋ผ์™€ ์ž๋ฐ”์—์„œ)

` Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. `

์ถœ์ฒ˜: https://www.databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

 

1. RDD๋ž€?

Reillient Distributed Data๋กœ ํ’€์–ด์“ฐ์—ฌ์ง€๋Š” RDD๋Š” ํšŒ๋ณต๋ ฅ ์žˆ๋Š”/๋ถˆ๋ณ€์˜ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ •๋„๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ๋‹ค.

์ŠคํŒŒํฌ์˜ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํ„ฐ ๋‹จ์œ„์ด๋‹ค.

 

๋ถˆ๋ณ€์˜ ํŠน์„ฑ

RDD๋Š” ์ด๋ฆ„(Reillient)์—์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋ถˆ๋ณ€(Read only)์˜ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค.

Spark Lineage (์ถœ์ฒ˜: https://sparkbyexamples.com/spark/what-is-lineage-graph-in-spark/)

์˜ˆ๋ฅผ๋“ค์–ด ์–ด๋– ํ•œ RDD ์—ฐ์‚ฐ์ด ์œ„์™€ ๊ฐ™์€ DAG(=lineage)์˜ ์ˆœ์„œ๋กœ ์ˆ˜ํ–‰๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด,

RDD๋Š” read only ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•œ ๋‹จ๊ณ„์˜ ์—ฐ์‚ฐ์„ ๊ฑฐ์น  ๋•Œ๋งˆ๋‹ค ๊ธฐ์กด source RDD๋ฅผ modifyํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์ƒˆ๋กœ์šด RDD๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

์ด์™€ ๊ฐ™์€ ํŠน์„ฑ๋•๋ถ„์— ์ค‘๊ฐ„ ๋‹จ๊ณ„์—์„œ ๋…ธ๋“œ ๊ฐ„ ์ปค๋„ฅ์…˜ ๋“ฑ์˜ ์ด์Šˆ๋กœ RDD๊ฐ€ ์†Œ์‹ค๋˜์—ˆ๋”๋ผ๋„, DAG๋ฅผ ํ†ตํ•ด ๋‹ค์‹œ ๋ณต๊ธฐํ•˜๊ณ  ๋ณต๊ตฌ๋  ์ˆ˜ ์žˆ๋‹ค.(=fault tolerant)

 

Lazy Evaluation (๋Š๊ธ‹ํ•œ ์—ฐ์‚ฐ)

RDD์˜ ์—ฐ์‚ฐ์—๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์˜ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋Š”๋ฐ, Transformation ๊ณผ Action์ด๋‹ค.

Transformation๊ณผ Action ์—ฐ์‚ฐ์˜ ์ข…๋ฅ˜๋Š” ์•„๋ž˜์˜ ๋ฌธ์„œ์— ์ž˜ ๋‚˜์—ด๋˜์–ด ์žˆ๋‹ค.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

 

Transformation

์ƒˆ๋กœ์šด RDD๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค. ๋‹จ์ˆœํžˆ ์–ด๋– ํ•œ ์—ฐ์‚ฐ์ธ์ง€๋ฅผ ๊ธฐ๋กํ•  ๋ฟ์ด๊ณ , ์‹ค์ œ๋กœ๋Š” ์•„๋ฌด๋Ÿฐ ์—ฐ์‚ฐ๋„ ์ˆ˜ํ–‰๋˜์ง€ ์•Š๋Š”๋‹ค.

Action

์‹ค์ œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ฐ์‚ฐ์ž์ด๋‹ค. ๊ธฐ๋ก๋œ ๋ชจ๋“  Transformation ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฆฌํ„ดํ•ด์ค€๋‹ค.

 

 

Transformation ์—ฐ์‚ฐ์ด ์•„๋ฌด๋ฆฌ ๋งŽ์ด ์žˆ์–ด๋„ ์‹ค์ œ Action ์—ฐ์‚ฐ์ž๊ฐ€ ์žˆ๊ธฐ ์ „๊นŒ์ง€๋Š” ์•„๋ฌด๋Ÿฐ ์—ฐ์‚ฐ๋„ ์ˆ˜ํ–‰๋˜์ง€ ์•Š๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๋™์ž‘ ๋ฐฉ์‹์„ Lazy evaluation์œผ๋กœ ๋ถ€๋ฅด๊ณ , RDD ๋™์ž‘์˜ ํ•ต์‹ฌ์ด ๋œ๋‹ค.

์œ„์˜ Spark Lineage๋ฅผ ์˜ˆ๋กœ ๋“ค๋ฉด Filter/Aggregate ์—์„œ๋Š” ์‹ค์ œ ์–ด๋– ํ•œ ์—ฐ์‚ฐ๋„ ์ˆ˜ํ–‰๋˜์ง€ ์•Š๊ณ  ์ผ์ข…์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋งŒ ๊ธฐ๋ก๋˜๋‹ค๊ฐ€, ๋งˆ์ง€๋ง‰ Output์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ Action ์—ฐ์‚ฐ์ž๊ฐ€ ํ˜ธ์ถœ๋  ๋•Œ ์‹ค์ œ ์—ฐ์‚ฐ์ด ์ˆ˜ํ–‰๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

2. Dataframe & Datasets

Dataframe์€ extended RDD๋กœ, RDD์˜ ์ƒ์œ„ ๊ฐœ๋…์— ์†ํ•œ๋‹ค.

RDD๋Š” non-structured data ํ˜•ํƒœ๋ฅผ ๋‹ค๋ฃจ์ง€๋งŒ, dataframe์€ ํ…Œ์ด๋ธ”๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์Šคํ‚ค๋งˆ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” structured data ๊ตฌ์กฐ์ด๋‹ค

 

Dataset์€ extension of Dataframe์œผ๋กœ DataFrame = row ํƒ€์ž…์˜ Dataset ( Dataset[Row] ) ์ด๋‹ค. 

python์„ ์‚ฌ์šฉํ•ด์„œ dataset์˜ ๊ฐœ๋…์ด ์ข€ ์–ด๋ ค์› ๋Š”๋ฐ, Dataframe์—์„œ Dataset์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ดํŽด๋ณด๋‹ˆ ๋Š๋‚Œ์ด ์™”๋‹ค.

case class Person(name: String, age: Int) // ๋ฐ์ดํ„ฐ ํƒ€์ž… ์ •์˜
val Persondf = spark.read.... // ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋กœ๋“œ
val PersonDS = df.as[Person] // Dataset์œผ๋กœ ๋ณ€ํ™˜

Dataframe์„ ์ƒ์„ฑ์‹œ์—๋Š” Person ๊ฐ์ฒด๋ฅผ Rowํƒ€์ž…์„ ์ง๋ ฌํ™”ํ•œ ๋ฐ”์ด๋„ˆ๋ฆฌ ๊ตฌ์กฐ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

Dataset ์ƒ์„ฑ์‹œ์—๋Š” ์ธ์ฝ”๋”(encoder)๊ฐ€ ๋Ÿฐํƒ€์ž„ ์‹œ์— Person ๊ฐ์ฒด๋ฅผ ๋ฐ”์ด๋„ˆ๋ฆฌ ๊ตฌ์กฐ๋กœ ์ƒ์„ฑํ•œ๋‹ค. DatasetAPI ์‚ฌ์šฉ ์‹œ์— ์ŠคํŒŒํฌ๋Š” Dataset์— ์ ‘๊ทผํ•  ๋•Œ Row ํฌ๋งท์ด ์•„๋‹Œ ์‚ฌ์šฉ์ž ์ •์˜ ๋ฐ์ดํ„ฐํƒ€์ž…์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

  • ์žฅ์ : ์‚ฌ์šฉ์ž์—๊ฒŒ ์œ ์—ฐ์„ฑ ์ œ๊ณต
  • ๋‹จ์ : ์‚ฌ์šฉ์ž ์ •์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…์œผ๋กœ์˜ ๋ณ€ํ™˜์ž‘์—…์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Dataframe ๋Œ€๋น„ ๋Š๋ฆฐ ์„ฑ๋Šฅ

Dataset์€ JVM์„ ์ด์šฉํ•˜๋Š” ์Šค์นผ๋ผ์™€ ์ž๋ฐ”์—์„œ๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

datasets์˜ type-safety

์œ„ ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค์‹œํ”ผ, Dataset์€ JVM ํƒ€์ž…์˜ ๊ฐ์ฒด๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ํ‘œํ˜„๋˜๊ธฐ ๋•Œ๋ฌธ์— type-safety๋ฅผ ์ œ๊ณตํ•˜๊ณ , ์ปดํŒŒ์ผ ์‹œ์ ์—์„œ ํƒ€์ž…์„ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์–ด analysis error๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

3. ์‚ฌ์šฉ ์‹œ๊ธฐ

์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ API ๋ณ„๋กœ ์žฅ๋‹จ์ ์ด ๋ช…ํ™•ํ•ด์„œ ์‚ฌ์šฉ๋˜์–ด์•ผ ํ•˜๋Š” ์ผ€์ด์Šค๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌํ•ด๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

RDD

Low-Level API๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๋•Œ

์›์ฒœ๋ฐ์ดํ„ฐ๊ฐ€ unstructured ํ˜•ํƒœ์ผ ๋•Œ

์Šคํ‚ค๋งˆ๋ฅผ ์‹ ๊ฒฝ์“ฐ๊ณ  ์‹ถ์ง€ ์•Š์„ ๋•Œ

์ตœ์ ํ™”๋ฅผ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์„ ๋•Œ

๋ฐ์ดํ„ฐ๋ฅผ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์กฐ์ž‘ํ•˜๊ณ  ์‹ถ์„ ๋•Œ (udf ๋“ฑ)

 

Dataframe

๋ฐ์ดํ„ฐ ์ถ”์ƒํ™”๊ฐ€ ํ•„์š”ํ•  ๋•Œ, ๋„๋ฉ”์ธ๋ณ„ API๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์„ ๋•Œ

๋‹จ์ˆœํ™”๋œ API๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์„ ๋•Œ, ๊ฐ„๊ฒฐํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ณ  ์‹ถ์„ ๋•Œ

 

Dataset

dataframe์ด ํ•„์š”ํ•œ ์ƒํ™ฉ์—์„œ ์ถ”๊ฐ€๋กœ ํƒ€์ž… ์•ˆ์ •์„ฑ์ด ํ•„์š”ํ•  ๋•Œ

dataframe ๋Œ€๋น„ ๋Š๋ฆฐ ์„ฑ๋Šฅ์„ ๊ฐ์ˆ˜ํ•  ์ˆ˜ ์žˆ์„ ๋•Œ

 

์ฐธ๊ณ : https://www.databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

 

 

[์ถ”๊ฐ€] Spark RDD/Dataframe API Category

์œ„์—์„œ๋„ ์„ค๋ช…ํ–ˆ๋‹ค์‹œํ”ผ, ์ŠคํŒŒํฌ์˜ API๋Š” ํฌ๊ฒŒ transformations์™€ actions๋กœ ๋‚˜๋‰œ๋‹ค.

1. Transformation

๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” API. action ์ „๊นŒ์ง€๋Š” ๋ชจ๋“  transformation์€ ์‹คํ–‰๋˜์ง€ ์•Š๋Š”๋‹ค.

1-1. narrow dependency

๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋‰˜์–ด์ ธ์„œ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ๋Š” API.

๋ถ„์‚ฐ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ๋น„์šฉ์˜ transformation์ด๋‹ค.

RDD์—๋Š” `map()`, `mapPartition()`, `flatMap()`, `filter()`, `union()` ์ด ์žˆ๊ณ ,

Dataframe์—๋Š” `select()`, `filter()`, `withColumn()`, `drop()`, `where()` ๋“ฑ์ด ์žˆ๋‹ค.

 

1-2. wide dependency

๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” API.

ํ•œ๊บผ๋ฒˆ์— ๋ชจ์•„์„œ ์ฒ˜๋ฆฌ๋˜๊ธฐ ๋•Œ๋ฌธ์— narrow ์— ๋น„ํ•ด์„œ๋Š” ๋†’์€ ๋น„์šฉ์˜ transformation์ด๋‹ค.

RDD์—๋Š” `groupByKey()`, `aggregateByKey()`, `sortByKey()`, `reduceByKey()`, `aggregate()`, `join()`, `repartitions()` ๋“ฑ์ด ์žˆ๊ณ ,

Dataframe์—๋Š” `groupBy()`, `join()`, `agg()`, `cube()`, `rollup()`, `repartitions()` ๋“ฑ์ด ์žˆ๋‹ค.

 

2. Action

์‹ค์ œ๋กœ job์ด ์‹œ์ž‘๋˜๋Š” ์—ฐ์‚ฐ์ด๋‹ค. action ๋‹จ๊ณ„์—์„œ ์‹ค์ œ๋กœ ๋ชจ๋“  ์—ฐ์‚ฐ์ด ์ˆ˜ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— staging์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

์ŠคํŒŒํฌ๋Š” staging ๋‹จ์œ„๋กœ logical plan์„ ์ž‘์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— action์ด ์–ด๋–ค ์ˆœ๊ฐ„์— ์ˆ˜ํ–‰๋˜๋ƒ์— ๋”ฐ๋ผ optimization ๋ฐฉ์‹์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค.

RDD์—๋Š” `collect()`, `count()`, `first()`, `take()`, `reduce()`, `saveAsTextFile()` ๋“ฑ์ด ์žˆ๊ณ ,

Dataframe์—๋Š” `show()`, `head()`, `count()`, `collect()` ๋“ฑ์ด ์žˆ๋‹ค.