[Spark] Spark Structured Streaming

Notice

Recent Posts

« 2024/12 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

관리 메뉴

🐥

[Spark] Spark Structured Streaming - Fault Tolerance 본문

데이터/Spark

[Spark] Spark Structured Streaming - Fault Tolerance

•8• 2024. 4. 4. 00:25

Background

실시간 stream 처리는 지속적인 input stream의 특성 상 중단되지 않고 24시간 실행되므로 여러 오류 원인으로 failure이 발생할 수 있다.

input stream이 작성된 코드로 처리될 수 없는 경우 (잘못된 형식의 데이터)
시스템/cluster 오류의 경우

Spark Streaming에서의 Fault Tolerance 정의

스파크의 목표는 end-to-end Exactly Once 보장이다.

(참고)

At Most Once: 각 레코드는 한 번만 처리되거나 아예 처리되지 않는다.
At Least Once: 각 레코드가 한 번 이상 처리된다. 데이터가 손실되지 않도록 보장하므로 At Most Once보다 강력하지만 중복이 있을 수 있다.
Exactly Once: 각 레코드는 정확히 한 번 처리된다. 데이터가 손실되지 않으며 데이터가 여러 번 처리되지 않는다. 가장 강력한 보장이다.

아래와 같은 조건에서는 spark structured streamng은 어떤 조건에서도 End-to-End Exactly Once를 보정할 수 있다.

replayable sources (재생 가능한 소스)

모든 스트리밍 소스에는 스트림의 읽기 위치를 추적하기 위한 Offset이 있는 것으로 가정된다.

streaming 엔진은 Checkpoint와 WAL(write-ahead logs)를 사용하여 각 트리그에서 처리되는 데이터의 offset 범위를 기록한다.

idempotent sinks (멱등성 싱크)

streaming sink는 reprocessing 처리하기 위해 멱등성을 갖도록 설계되었다.

문서에 잘 정리되어 있다..</p

Checkpoint

아래와 같이 checkpoint 위치를 지정해 줄 수 있다.

aggDF \
    .writeStream \
    .outputMode("complete") \
    .option("checkpointLocation", "path/to/some/dir") \
    .format("memory") \
    .start()

checkpoint는 아래의 두 종류의 오류 시나리오를 처리하는데에 사용된다.

Driver 프로세스 오류
stateful transformations failures: 이전 데이터 배치에 의존하는 각 micro-batch 데이터 처리를 포함하는 변환. 상태가 저장되지 않으면 이전 종속 상태가 전체적으로 다시 계산될 수 있다.

체크포인트는 4가지 유형의 데이터를 저장한다.

sources
offsets
commits
metadata

metadata를 제외한 파일들은 microbatch별로 각각 저장된다.

sources

스트리밍 쿼리에 사용되는 다양한 소스에 대한 정보가 포함되어 있다.

v1
{"path":"file:///home/jovyan/work/streaming_sample/impression_click.json","timestamp":1712156292816,"batchId":0}
{"path":"file:///home/jovyan/work/streaming_sample/impression_click_leftouter.json","timestamp":1712156292879,"batchId":0}
{"path":"file:///home/jovyan/work/streaming_sample/login_event_sample.json","timestamp":1712156292936,"batchId":0}
{"path":"file:///home/jovyan/work/streaming_sample/price_sample.json","timestamp":1712156292988,"batchId":0}
{"path":"file:///home/jovyan/work/streaming_sample/sample.json","timestamp":1712156293055,"batchId":0}
{"path":"file:///home/jovyan/work/streaming_sample/time_window_sample.json","timestamp":1712156293111,"batchId":0}

offsets

주어진 micro-batch 실행에서 처리될 데이터에 대한 정보(offset)를 저장한다. 이는 배치가 물리적으로 실행되기 전에 생성된다.

예시:

v1
{"batchWatermarkMs":0,"batchTimestampMs":1713249204941,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.stateStore.rocksdb.formatVersion":"5","spark.sql.streaming.statefulOperator.useStrictDistribution":"true","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}

commits

배치별로 완전히 처리된 오프셋이 per batch로 분리되어 있다. micro-batch에 사용되는 watermark 정보가 포함된 일종의 marker file이다.

offsets는 배치의 물리적 실행 전, commits는 배치의 성공적인 처리 후에 write 되므로 spark가 failure 발생 하고 작업이 동일한 체크포인트 위치로 다시 제출될 때 배치를 어디에서부터 시작해야하는지를 알 수 있게 된다.

예시:

v1
{"nextBatchWatermarkMs":0}

metadata

스트리밍 쿼리의 ID 값이 저장되어 있다. 이 값은 어플리케이션 실행 중 바뀌지 않는다.

예시:

{"id":"a4f94e97-8c2e-48a1-a2cc-e7aa4c230364"}

https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read

1. 처리할 데이터는 쿼리 실행 전에 offset log에 기록된다.

2. 과거 배치에 의존 쿼리의 경우 state store를 불러온다.

3. 쿼리를 실행한다.

4. 쿼리 종료 후 state에 결과를 커밋한다.

5. watermark 정보를 기록한다.

Write-ahead Logs

(과거에) Driver 프로세스 오류 복구 시에 사용되었다.

structured streaming 은 WAL 에서 데이터를 복사하고 캐시해서 사용했다. (state)

그러나 아래와 같은 변경으로 현재는 checkpoint에 편입된 듯 하다.

Retrieve Less로 인한 변경사항
retrive less가 도입되며 structured streaming 동작 방식이 개선되었다.
이에 따라 전체 데이터를 WAL에 복사하지 않고 오프셋만 저장한다.

아래 주석에서도 offsets 관련 변수에서 write-ahead-log를 언급하고 있는 것을 확인할 수 있다.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L232C7-L232C17

/**
   * A write-ahead-log that records the offsets that are present in each batch. In order to ensure
   * that a given batch will always consist of the same data, we write to this log *before* any
   * processing is done.  Thus, the Nth record in this log indicated data that is currently being
   * processed and the N-1th entry indicates which offsets have been durably committed to the sink.
   */
  val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))

  /**
   * A log that records the batch ids that have completed. This is used to check if a batch was
   * fully processed, and its output was committed to the sink, hence no need to process it again.
   * This is used (for instance) during restart, to help identify which batch to run next.
   */
  val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))

참고

https://spark.apache.org/docs/2.4.8/streaming-programming-guide.html#fault-tolerance-semantics

https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read

https://community.databricks.com/t5/data-engineering/wal-for-structured-streaming/td-p/63727

'데이터 > Spark' 카테고리의 다른 글

[Spark] 스파크 스트리밍의 이벤트 시간 처리와 Watermark (0)	2024.04.21
[Spark] Structured Streaming - stateful transformation과 Window operation (0)	2024.04.16
[Spark] Spark Structured Streaming 개요 (0)	2024.04.04
[Spark] Accumulator와 Broadcast (공유변수) (0)	2024.04.01
[Spark] SQL Hint (0)	2024.04.01

'데이터/Spark' Related Articles

🐥