Spark’s Structured Streaming offers a powerful platform to process high-volume data streams with low latency.
In Azure we use it to analyze data coming from Event Hubs and Kafka for instance.
As projects mature and data processing becomes more complex, unit-tests become useful to prevent regressions. This requires mocking the inputs and outputs of the
Spark queries to isolate them from the network and remove the need for an external Spark cluster. This blog goes through the various pieces needed to make that work.
The full code is available on GitHub.
It is written in Java, using IntelliJ as IDE, Maven as package manager, and JUnit as test framework.
Let’s test a simple stream enrichment query. It takes a stream of events as input
and adds human-friendly names to the events by joining with a reference table.
Besides running on remote clusters, Spark also supports running on the local machine. Let’s use that to create a Spark session in the tests.
The unit test follows the standard Arrange/Act/Assert pattern, which here requires creating two mock inputs (one streaming table, one reference static table) and one mock output.
MemorySink comes in handy to mock the output, collecting all the output data
into an Output table in memory that is then queried to obtain a List<Row>.
The input static table can be mocked in a couple of different ways. The first one is to read the data from an external CSV file.
The second one is to use an in-memory List<Row> with values part of the test code itself and to wrap this list in a DataFrame (a.k.a. Dataset<Row>).
Likewise for the input streaming table. The first option is to read the data from a folder containing CSV files.
The second one is to use an in-memory List<String> that is wrapped in a MemoryStream,
and converted to DataFrame. Ideally this would be a List<Row> but setting up an Encoder did not look trivial, so the code relies instead on a list of CSV strings that are parsed & cast
using a SQL SELECT statement.
After that, it is just a matter of running the test and getting it to green!