[Spark] Spark Structured Streaming 개요

Spark Streaming 이란

https://www.databricks.com/kr/glossary/what-is-spark-streaming

core spark API의 확장 프로그램으로 분산 스트림 처리 프로세싱을 지원한다.

streaming 타입으로는 아래와 같이 두 종류가 있는데 spark streaming은 RDD 베이스 엔진으로, 2.x버전까지 지원하고 이후 더이상 업데이트 되지 않는 레거시 프로젝트이다.

Spark Streaming 종류

Spark Streaming: RDD 기반의 micro-batch 수행
Spark Structured Streaming: Dataframe 기반 micro-batch 수행, 저지연 처리 모드를 도입함으로써 실시간에 가까운 처리가 가능해짐.

Spark Structured Streaming

프로그래밍 모델

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

structured streaming에서는 트리거 간격마다 수신되는 모든 input data stream은 `input table` 에 업데이트 된다.
이후 output sync에 기록될 결과 테이블을 계산하기 위해 마치 static table 인 것처럼 input table에 대한 쿼리를 정의한다.
spark에서는 이 일괄 처리를 스트리밍 실행 계획으로 자동변환한다(= `incrementalization`).
레코드가 도착할 때마다 결과를 업데이트 하기위해 어떤 상태를 유지해야 하는지 파악한다.
마지막으로 지정된 트리거에 따라 실행될 때마다 spark에서는 새로운 데이터를 확인하고 결과를 업데이트 한다.

이때 structured streaming은 배치 처리가 아닌 데이터를 데이터 스트림에 계속 추가하여 처리함으로써 (실시간에 가까운) 저지연 처리가 가능해진다.

https://www.databricks.com/kr/glossary/what-is-structured-streaming

3가지 Output Mode

complete mode

결과 테이블의 전체 상태를 출력한다.

world count 예제로 살펴보면 아래와 같이 input stream이 수신되었을 때,

hello		# first input stream
hello world # second input stream
hi world	# third input stream

매 배치마다 과거에 있었던 모든 결과도 함께 output으로 출력되는 것을 확인할 수 있다.

모든 결과 테이블을 유지해야 하므로 더 많은 리소스가 필요하며,

모든 데이터가 계속해서 변경될 수 있는 상태 기반 데이터를 다룰 때 사용할 수 있다.

update mode

이전 출력 결과에서 변경된 레코드만 출력한다.

집계 연산을 하지 않는다면 append 모드와 동일하다.

hello		# first input stream
hello world # second input stream
hi world	# third input stream

각 트리거마다 업데이트된 결과만 보여주는 것을 확인할 수 있다.

append mode

새로운 레코드가 결과 테이블에 추가되면 사용자가 명시한 트리거에 맞춰 출력한다.

집계 연산을 하지 않는다면 update mode와 동일하지만,

집계 연산이 포함되어 있다면 반드시 watermark를 사용해야 한다. (관련 정리: 게시글 참고)

Input Sources

input source로는 아래와 같은 내장 소스가 있다.

각 소스 별 세부 옵션은 문서 참고

File Source

디렉토리에 기록된 파일을 데이터 스트림으로 읽는다. 파일 수정 시간 기반으로 파일이 처리된다.

지원되는 파일 형식은 다음과 같다: txt, csv, json, orc, parquet

Kafka Source

카프카로부터 데이터를 읽는다. kafka broker version 0.10.0 이상부터 호환된다.

Socket Source (테스트용)

소켓 연결에서 UTF8 텍스트 데이터를 읽는다. end point 간 fault tolerance를 보장하지 않는다.

Rate Source (테스트 용)

초당 지정된 행 수로 데이터를 생성한다. 각 output row에는 value와 timestamp(메시지 발송 시간)가 포함된다.

Rate Per Micro-Batch source (테스트 용)

마이크로 배치 당 지정된 행 수로 데이터를 생성한다.

5가지 Trigger Type

unspecified (기본 동작)

stream write을 위한 트리거 옵션을 설정하지 않으면 spark는 현재의 micro batch가 완료되는 즉시 다음 레코드 set을 처리하려고 시도한다. micro-batch에서는 수신 레코드가 작은 window로 그룹화되어 주기적으로 처리된다.

https://medium.com/@sdjemails/spark-trigger-options-cd90e3cf6166

# without any trigger
query = wordCounts \
        .format("console")\
        .option("checkpointLocation", "some_path")\
        .start()

One-time micro-batch (deprecated)

한 번만 처리한 후 스트림을 종료한다. 스트림이 한 번 생성되면 보류중인 레코드가 모두 처리된 후 스트림이 종료된다.

# trigger once
query = wordCounts \
        .format("console")\
        .trigger(once=True)
        .option("checkpointLocation", "some_path")\
        .start()

보통 클러스터를 계속 가동중이 아닌, 주기적으로 클러스터를 업-다운 하는 시나리오에 사용된다.

이 경우 클러스터가 가동중일 때 처리해야 하는 모든 스트림을 처리하고 종료한다.

Available-now micro-batch

one-time micro-batch와 유사하게 쿼리는 사용 가능한 모든 데이터를 처리한 다음 자체적으로 중지된다.

차이점은 소스 옵션(maxFilesperTrigger 등)을 기반으로 여러 마이크로 배치로 데이터를 처리하므로 쿼리 확장성이 향상된다는 데에 있다.

이전 실행에서 남은 배치 수에 관계없이 실행 시 사용가능한 모든 데이터가 종료되기 전에 처리되도록 보장한다.

# available-now
query = wordCounts \
        .format("console")\
        .trigger(availableNow=True)
        .option("checkpointLocation", "some_path")\
        .start()

Fixed interval micro-batch

쿼리가 micro-batch 모드로 실행되며, 여기서 micro-batch는 사용자가 지정한 간격으로 시작된다.

이전 micro-batch 가 interval 내에 완료되면 엔진은 다음 batch가 돌아올 때까지 기다린다.
이전 micro-batch가 interval 내에 완료되지 않는다면 이전 batch가 완료되자마자 다음 micro-batch가 시작된다.
새로운 데이터가 존재하지 않는다면 micro-batch가 시작되지 않는다.

가장 널리 사용되며 권장되는 방법이다.

# processing time
query = wordCounts \
        .format("console")\
        .trigger(processingTime='2 seconds')
        .option("checkpointLocation", "some_path")\
        .start()

Continuous with fixed checkpoint interval (experimental)

spark 2.3에서 experimental로 도입되었다.

countinuous 옵션에서는 레코드가 micro-batch로 처리되지 않고 장기 실행 작업이 write stream별로 생성되어 최대한 빨리 처리된다.

`Exactly Once`는 지원되지 않으며, `At Least Once`만 지원된다.

Driver에서 장기 실행 작업(long running task)을 생성한다.
Input Records가 처리된다.
처리된 Records는 Target에 저장된다.
작업은 offset으로 비동기적으로 유지한다.
offset은 WAL(Write Ahead Log)에 커밋된다.

Spark Structured Streaming 예제

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("StructuredStreamingSum") \
    .config("spark.streaming.stopGracefullyOnShutdown", "true") \
    .config("spark.sql.streaming.schemaInference", "true") \
    .config("maxFilesPerTrigger", 1) \
    .getOrCreate()

# Create DataFrame representing the stream of input lines from connection to localhost:9999
df = spark \
    .readStream \
    .format("json") \
    .option("path", "streaming_sample") \
    .load()

df1 = df.select("city")

query = df1 \
            .writeStream \
            .format("json") \
            .option("path", "streaming_output") \
            .option("checkpointLocation", "checkpoint") \
            .outputMode("append") \
            .trigger(processingTime='5 seconds') \
            .start()

query.awaitTermination()

json 파일을 읽어와 output path에 쓰는 예제이다.

output path로 설정한 경로에 쌓이는 것을 확인할 수 있었다.

하위 디렉토리에 `_spark_metadata`라는 디렉토리를 추가로 발견할 수 있는데, 매 배치마다 생성되며 output 파일에 대한 정보값을 확인할 수 있다.

v1
{"path":"file:///home/jovyan/work/streaming_output/part-00000-a2d3bcce-2148-46c1-b188-6dc20bb722f9-c000.json","size":15,"isDir":false,"modificationTime":1713250145443,"blockReplication":1,"blockSize":33554432,"action":"add"}

_spark_metadata

output sync가 HDFS, S3 등 파일시스템에 쓰는 file sync 라면 structured streaming에서는 `_spark_metadata` 디렉터리를 생성한다. 아래와 같은 특성을 갖는다.

작업 간에 공유할 수 없으며, 작업 당 하나의 디렉토리만 있다.
`_spark_metadata` 는 동일한 위치에 있는 둘 이상의 spark structured streaming 쓰기를 방지한다.
output path 내에 생성된다.
spark가 파일 싱크 시 exactly-once 를 보장할 수 있는 방법이다.

이 디렉토리를 삭제하면 Exception이 발생한다.

`java.lang.IllegalStateException: /home/jovyan/work/streaming_output/_spark_metadata/0 doesn't exist when compacting batch <batchNumber>`

두 개 이상의 쿼리 sync

file_writer = concat_df \
                .writeStream \
                .queryName("transformed json") \
                .format("json") \
                .outputMode("append") \
                .option("path", "transformed") \
                .option("checkpointLocation", "chk/json") \
                .start()

kafka_writer = output_df \
                .writeStream \
                .queryName("transformed kafka") \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "kafka:9092") \
                .option("checkpointLocation", "chk/kafka") \
                .option("topic", "transformed") \
                .outputMode("append") \
                .start()


# case 1. 쿼리별로 start
file_writer.start()
kafka_writer.start().awaitTermination()

# case 2. built-int function 사용
sparkSession.streams.awaitAnyTermination()

하나의 application에서 두 개 이상의 output sync가 있을 때는

checkpoint 위치를 다르게 해주어야 한다.
두 개의 쿼리 실행을 위해 case1, case2 중 하나 사용. 왠만하면 내장함수 사용하자.

참고

https://vanducng.dev/2020/10/10/Getting-started-with-spark-structure-streaming/#Anatomy-of-structured-streaming

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

https://medium.com/@sdjemails/spark-trigger-options-cd90e3cf6166

https://dev.to/sukumaar/what-is-sparkmetadata-directory-in-spark-structured-streaming--3i42