DataFusion Streaming Demo (2024-07-03)

At Denormalized, we've been working on adding support for streaming usecases to Apache DataFusion. This gist provides instructions for testing the current state.

Prerequisites

Docker + docker compose
Rust/Cargo installed

Configure Kafka

Clone our docker compose files for running kafka. If you already have a different kafka cluster running, you can skip this step.

git clone [email protected]:probably-nothing-labs/kafka-monitoring-stack-docker-compose.git
cd kafka-monitoring-stack-docker-compose
docker compose -f denormalized-benchmark-cluster.yml up

This will spin up a 3 node kafka cluster in docker along with an instance of kafka-ui that can be viewed at http://localhost:8080/

Generate some sample data to the kafka cluster

We wrote a small rust tool that will send fake traffic to the locally run rust program.

git clone [email protected]:probably-nothing-labs/benchmarking.git
cd benchmarking
cargo run -- -d 60 -a 1000

This will start a simulation for 60s and will create two topics: driver-imu-data and trips which should have around ~58k and ~500 messages accordingly. There are several other knobs that can be tuned to change the amount of traffic which can be viewed with cargo run -- --help. There are also several other knobs that are not exposes but can be changed in the src/main.rs file

Run a Streaming Datafusion job

To do this, you'll need to clone down our fork of the Datafusion project here: https://github.com/probably-nothing-labs/arrow-datafusion. Note the code is in active development so a working commit hash is specificed.

git clone [email protected]:probably-nothing-labs/arrow-datafusion.git
cd arrow-datafusion

# This commit hash should 
git checkout 4d042a421895238d26df94ab3c10b9520c4f5fa2

# Start an example streaming aggregation
cargo run --example kafka_rideshare

# Start an example stream join
cargo run --example streaming_join

Once everything is setup and one of the two streaming jobs is running, it is recommend to re-run the kafka data generation tool so that live data is produced. This is because watermark tracking of streaming data makes it difficult to properly aggregate older data that lives in the kafka topic.

The code for both examples lives in the examples/ folder: kafka_rideshare and stream_join

In the kafka_rideshare example, there are several other sinks you can try uncommenting including one that will sink the results back to a kafka topic

This project is still being actively developed. We hope to have checkpointing/restoration finished soon and we'll be opening a discussion in the Datafusion project to begin upstreaming these changes.

If you have any questions, use-cases, or want to get involved please reach out to us on twitter, we'd love to hear from you! @IAmMattGreen and @asc89

Enjoy!

emgeee/datafusion_streaming_instructions.md

DataFusion Streaming Demo (2024-07-03)

Prerequisites

Configure Kafka

Generate some sample data to the kafka cluster

Run a Streaming Datafusion job