๐Ÿฅ

Apache Flink ๋ณธ๋ฌธ

๋ฐ์ดํ„ฐ/ํ•˜๋‘ก

Apache Flink

•8• 2022. 2. 11. 13:19
  1. Apache Flink
    • streaming dataflow ์—”์ง„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ&๋ฐฐ์น˜ ํ”Œ๋žซํผ
    • streaming mode: native ๋ฐฉ์‹์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ŠคํŠธ๋ฆผ ์ฒ˜๋ฆฌ์— ๋Œ€ํ•ด spark ๋ณด๋‹ค low latency๋ฅผ ๋ณด์ธ๋‹ค
      * spark๋Š” micro-batch ๋ฐฉ์‹์„ ํ†ตํ•ด ์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
    • exactly-once ๋ณด์žฅ

 

  • low-level building block
    stateful and timely stream processing์„ ์ œ๊ณต
    process function์„ ํ†ตํ•ด datastream API์— ์ž„๋ฒ ๋””๋“œ ๋˜์–ด ์žˆ์Œ ← ์ด์ชฝ์€ ๋งŽ์€ ์œ ์ €์—๊ฒŒ๋Š” ํ•„์š”๊ฐ€ ์—†์„ ๊ฒƒ์ด๊ณ , core API๋ฅผ ๋งŽ์ด ์“ธ ๋“ฏํ•˜๋‹ค.
  • core API
    datastream API (bounded/unbounded stream) / dataset API (bounded data set)
    datastream์€ ์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ์ผ ๋•Œ ์‚ฌ์šฉํ•˜๊ณ  dataset์€ ๋ฐฐ์น˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.
    ๋‘˜ ๋‹ค immutable ํ•œ ์†์„ฑ (๋ณ€๊ฒฝ๋˜์ง€ ์•Š์Œ)
    transformation, join, aggregation, window, state์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ํ”„๋กœ์„ธ์‹ฑ์„ ์œ„ํ•œ ๊ธฐ๋Šฅ๋“ค์„ ์ œ๊ณตํ•œ๋‹ค.
  • table API
    stream์„ ํ‘œํ˜„ํ•˜๋Š” ํ…Œ์ด๋ธ”์„ ๋™์ ์œผ๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋Š” ํ™•์žฅ๋œ ๊ด€๊ณ„ํ˜• ๋ชจ๋ธ.
    ํ…Œ์ด๋ธ”์—๋Š” ์Šคํ‚ค๋งˆ๊ฐ€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์œผ๋ฉฐ API๋Š” select, project, join, group by, aggregate ๋“ฑ์˜ ์—ฐ์‚ฐ์ž ์ œ๊ณต
  • SQL
    ๊ฐ€์žฅ ๋†’์€ ๋‹จ๊ณ„์˜ ์ถ”์ƒํ™”์ด๋ฉฐ, Table API์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ sql ์ฟผ๋ฆฌ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.
    table API์— ์ •์˜๋œ ํ…Œ์ด๋ธ”์—์„œ ์ด sql ์ฟผ๋ฆฌ๊ฐ€ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋‹ค.

 

Stream dataflow

  input stream → source

  operator → transformation

  output stream → sink                    ๋กœ ํ‘œํ˜„

(์นดํ”„์นด, ํ‚ค๋„ค์‹œ์Šค ๋“ฑ ๋ฉ”์‹œ์ง€ ํ๋‚˜ ๋ถ„์‚ฐ๋กœ๊ทธ์—์„œ real-time data๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Œ. ๊ทธ ๊ฒฝ์šฐ flink๊ฐ€ consumer)

→ source๋กœ ์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„ ์—ฌ๋Ÿฌ transformation์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜๊ณ  sink๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌ(์ €์žฅ)ํ•˜๋Š” streaming dataflow

* lazy evaluation

spark์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ sink ๋ฉ”์†Œ๋“œ๊ฐ€ ์‹คํ–‰๋˜๊ธฐ ์ „๊นŒ์ง€๋Š” ์‹ค์ œ ์—ฐ์‚ฐ์ด ์ˆ˜ํ–‰๋˜์ง€ ์•Š๊ณ  sink๊ฐ€ ์‹คํ–‰๋˜๋Š” ์ˆœ๊ฐ„ ์ง€์ •๋œ transformation๋“ค์ด ์‹คํ–‰๋จ

 

Parallel dataflows

ํ•˜๋‚˜ ์ด์ƒ์˜ stream partition → stream

ํ•˜๋‚˜ ์ด์ƒ์˜ operator subtasks → operation

operator substasks๋Š” ๋…๋ฆฝ์ ์ด๋ฉฐ, ๋‹ค๋ฅธ ์“ฐ๋ ˆ๋“œ์—์„œ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ๋Ÿฌ ์“ฐ๋ ˆ๋“œ์—์„œ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค.

์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ parallelism์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Œ → ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” tasks๋“ค์€ ๊ฐ๊ฐ์˜ ์“ฐ๋ ˆ๋“œ์—์„œ ์‹คํ–‰๋˜์–ด ์ฒ˜๋ฆฌ๋œ๋‹ค.

1:1 stream: source์™€ map()์—ฐ์‚ฐ์ž ์‚ฌ์ด์ฒ˜๋Ÿผ element์˜ ํŒŒํ‹ฐ์…˜๊ณผ ์ˆœ์„œ ์œ ์ง€

→ map()[1]์€ source[1]์— ์˜ํ•ด ์ƒ์„ฑ๋œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ์ˆœ์„œ ๋™์ผํ•œ ์š”์†Œ๋กœ ๋ณด๊ฒŒ ๋จ

redistributing stream: ๋‘๋ฒˆ์งธ ํŒŒํ‹ฐ์…˜์ฒ˜๋Ÿผ ์ „ ๋‹จ๊ณ„์˜ ๋ชจ๋“  subtasks๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์™€์„œ transformation. ex) keyBy(), broadcast(), rebalance() ์ฒ˜๋Ÿผ ๋‹ค์‹œ ๋ถ„ํ• 

 

Stateful stream processing

flink์˜ operation์€ stateful ํ•  ์ˆ˜ ์žˆ๋‹ค (ํŠน์ • operation์ด statefulํ•  ์ˆ˜ ์žˆ์Œ) → ์ƒํƒœ ์ €์žฅ์ด ๊ฐ€๋Šฅ → ํ•œ ์ด๋ฒคํŠธ๊ฐ€ ์ฒ˜๋ฆฌ๋  ๋•Œ ์ด์ „์— ๋ฐœ์ƒํ•œ ๋ชจ๋“  ์ด๋ฒคํŠธ์˜ ๋ˆ„์ ํšจ๊ณผ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

stateful operator์˜ parallel instances๋Š” sharded key-vallue๋กœ ์ €์žฅ๋˜์–ด ์žˆ์Œ

์œ„ ๊ทธ๋ฆผ์—์„œ ์„ธ ๋ฒˆ์งธ operation์ด statefulํ•˜๋‹ค๊ณ  ํ•œ๋‹ค๋ฉด, key-value ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ „ ๋‹จ๊ณ„์˜ subtasks๋“ค๊ณผ fully network shuffle์ด ๋ฐœ์ƒํ•œ๋‹ค.

 

Timely stream processing

์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ „๋‹ฌ๋˜๋Š” ์ˆœ์„œ๋ณด๋‹ค ์ด๋ฒคํŠธ๊ฐ€ ๋ฐœ์ƒํ•œ ์ˆœ์„œ์— ์ง‘์ค‘ํ•ด์•ผ ํ•จ

ex) ํŠน์ • ์ด๋ฒคํŠธ ํŒจํ„ด์„ ๊ฒ€์ƒ‰ํ•  ๋•Œ (์ง€๊ธˆ๊นŒ์ง€ ๋ฐœ์ƒํ•œ ์ด๋ฒคํŠธ ์‹œํ€€์Šค์˜ ์ €์žฅ์ด ํ•„์š”) ๋“ฑ

→ flink์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์‹œ์Šคํ…œ์˜ ํด๋ฝ์ด ์•„๋‹ˆ๋ผ, ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์— ๊ธฐ๋ก๋œ ์ด๋ฒคํŠธ timestamp๋ฅผ ์ด์šฉ

 

Distributed execution

flink์—๋Š” ๋‘ ๊ฐ€์ง€ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์žˆ์Œ

  • master process (job manager)
    task์˜ ์Šค์ผ€์ฅด๋ง, ์ฒดํฌํฌ์ธํŠธ, ๋ฆฌ์ปค๋ฒ„๋ฆฌ ๋‹ด๋‹น
    worker๋“ค์„ ๊ด€๋ฆฌ
  • worker process (task manager)
    task๋ฅผ ์‹ค์ œ๋กœ ์‹คํ–‰

standalone, resource framework(yarn, mesos etc) ํ™˜๊ฒฝ์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅ

TaskManager

JVM ํ”„๋กœ์„ธ์Šค ๋‹จ์œ„๋กœ ๋™์ž‘

ํ•œ worker์—๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ Task Slot์ด ์žˆ์œผ๋ฉฐ, Task Slot์—๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์“ฐ๋ ˆ๋“œ๊ฐ€ ์žˆ์œผ๋ฉฐ, ๊ฐ ์“ฐ๋ ˆ๋“œ๋กœ subtasks๊ฐ€ ์ˆ˜ํ–‰๋œ๋‹ค.

Time slot๋ผ๋ฆฌ๋Š” worker ๋‚ด์˜ ๋ฆฌ์†Œ์Šค๋ฅผ ๋‚˜๋ˆ„์–ด์„œ ๊ด€๋ฆฌ → slot๋ณ„๋กœ ๊ฐœ๋ณ„์ ์ธ ๋ฆฌ์†Œ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ task๋ฅผ ์‹คํ–‰

→ slot์˜ ๊ฐœ์ˆ˜๋Š” cpu core ๊ฐœ์ˆ˜๋กœ ์ง€์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

ํ•œ ์“ฐ๋ ˆ๋“œ์— ํ•œ Task๊ฐ€ ์‹คํ–‰๋˜์ง€๋งŒ source/map์ฒ˜๋Ÿผ operator chain์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ ํ•œ ์“ฐ๋ ˆ๋“œ์—์„œ ์‹คํ–‰

 

Window

์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ๋Š” ์‹œ์ž‘๊ณผ ๋์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” unbounded data → ์ง‘๊ณ„์™€ ๊ฐ™์€ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์‹œ์ž‘๊ณผ ๋์„ ์ •ํ•ด์ฃผ์–ด์•ผ ํ•จ

window: ์‹œ์ž‘๊ณผ ๋์„ ์ผ์ •ํ•œ ๋ฃฐ์— ๋”ฐ๋ผ์„œ ์ •ํ•ด์ง„ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ

๋นจ๊ฐ„ ๋„ค๋ชจ ํ•œ ์นธ์ด ํ•œ window

Tumbling Window

๊ณ ์ •๋œ ์‹œ๊ฐ„ ๋‹จ์œ„๋กœ ๋ฐ์ดํ„ฐ์˜ ์ค‘๋ณต ์ฒ˜๋ฆฌ ์—†์ด window๋ฅผ ์„ค์ •ํ•˜๋Š” ๋ฐฉ์‹

window size๋ฅผ 5์ดˆ๋กœ ์„ค์ •ํ–ˆ๋‹ค๋ฉด 5์ดˆ ๊ฐ„๊ฒฉ์œผ๋กœ window๊ฐ€ ๋ถ„๋ฆฌ๋จ

Sliding Window

N์‹œ๊ฐ„ ํฌ๊ธฐ์˜ window๋ฅผ M์‹œ๊ฐ„ ์”ฉ slide ํ•ด์„œ window๋ฅผ ์ •ํ•˜๋Š” ๋ฐฉ์‹.

๋ฐ์ดํ„ฐ์˜ ์ค‘๋ณต ์ฒ˜๋ฆฌ ํ—ˆ์šฉ

M์‹œ๊ฐ„ ๋‹จ์œ„์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘๋ณต์ฒ˜๋ฆฌ๋˜๊ณ  ์žˆ๋‹ค.

Session Window

session gap์„ ์„ค์ •

์ผ์ • ๊ธฐ๊ฐ„๋™์•ˆ ๋ฐ˜์‘์ด ์—†๋Š” ๊ฒฝ์šฐ ์„ธ์„  ์‹œ์ž‘๋ถ€ํ„ฐ ๋ฐ˜์‘์ด ์—†๋Š” ์‹œ๊ฐ„๊ฐ€์ง€์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ window size๋กœ ์ฒ˜๋ฆฌ

→ ๋ฐ˜์‘์ด ์—†๋Š” ์‹œ๊ฐ„ >= session gap์ด ๋˜๋ฉด ๋‹ค๋ฅธ window๋กœ ๋ถ„ํ• 

session gap์„ 5์ดˆ๋กœ ์„ค์ •: 5์ดˆ๋™์•ˆ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค์ง€ ์•Š์œผ๋ฉด ์„œ๋กœ ๋‹ค๋ฅธ window๋กœ ์ชผ๊ฐฌ

ํ•˜๋‚˜์˜ window์—์„œ ์ฒ˜๋ฆฌ๋˜๋Š” element์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งค์šฐ๋งค์šฐ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค.

 

Fault Tolerance

sparks๋Š” lineage๋ฅผ ์ด์šฉํ•ด ์ „์ฒด ๊ณผ์ •์„ ๋‹ค์‹œ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ ๋‚ด๊ณ ์žฅ์„ฑ ๋ณด์žฅ

flink๋Š” checkpoint๋ฅผ ์„ค์ •ํ•˜๊ณ  checkpoint ์ดํ›„๋กœ datasource์—์„œ replayํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‚ด๊ณ ์žฅ์„ฑ ๋ณด์žฅ

์ฒ˜๋ฆฌ๋˜๋Š” stream ์ค‘๊ฐ„์— checkpoint barrier๋ฅผ ๋ผ์›Œ๋„ฃ์Œ (job manager๊ฐ€ ์š”์ฒญ)

sink์— barrier๊ฐ€ ๋„์ฐฉํ•˜๋ฉด barrier ์ด์ „์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์—ฐ์‚ฐ์ด ์™„๋ฃŒ๋œ ๊ฒƒ์‘๋กœ ๊ฐ„์ฃผ,  source๋กœ ํ•ด๋‹น barrier์— ๋Œ€ํ•ด ack๋ฅผ ๋ณด๋ƒ„ ( source๋Š” ack ๋ฐ›์€ barrier ์ด์ „์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์‚ญ์ œ )

๋งŒ์•ฝ ์ŠคํŠธ๋ฆผ์ด ์ฒ˜๋ฆฌ๋˜๋Š” ์ค‘๊ฐ„์— fault๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ack๋ฅผ ๋ฐ›์€ barrier ์ดํ›„์˜ ๋ฐ์ดํ„ฐ๋ถ€ํ„ฐ ๋‹ค์‹œ ์ฒ˜๋ฆฌ 

→ ๋ชจ๋“  ๋ ˆ์ฝ”๋“œ๋งˆ๋‹ค replayํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋น ๋ฅธ ์„ฑ๋Šฅ

→ exactly-once ๋ณด์žฅ