This post is aimed at those who are new to stream processing, want to ensure a firm understanding of stream processing, and/or completely non-technical in nature.
The below example, while contrived, provides real-world definitions to the areas of:
It’s 1979, life as you know it, is great. A new job. Lovely home. Plus, a new addition to the family on the way.
You’re a data analyst who works at SportAgg, a company who aggregates historical sports data from all over the country.
After collecting the data, SportsAgg provides that data to clients all over the globe as a service.
Their service is consumed via different methods including: electronic mail (email), call-in phone service, and raven (shoutout HP).
SportsAgg receives files from all over the world via a new technology, email.
Your job is to receive messages, one-by-one, print them out and store those messages in an array of filing cabinets.
The above is an example of a stream. A stream, is an unbounded flow of data, in this case messages containing sports data.
One of the most critical aspects of a stream of information is the method of data consumption. There are multiple methods of consuming data e.g. one-by-one, batches of N size, etc.
How you receive messages can drastically change your approach to tackling a task. In this example, you are receiving messages (data) one-by-one, that means each message is viewed in isolation.
In the sections below, we will discuss the differences between stateless and stateful stream processing.
You start your first day and are doing great! You are printing. You are filing. You are pretty much dominating the data analyst space.
Every email you receive gets printed and stored in a filing cabinet in the order you receive it.
Life is great.
The above is an example of stateless stream processing. You receive a message, process it, and do something with it (in this case file the message). Nothing really is special here and you don’t have to maintain any “record” or “state” of previous messages.
This allows you to handle individual messages in isolation and significantly reduces operational overhead and complexity.
When developing data pipelines, this is the ideal scenario because you can increase parallelization without having to introduce complexity to your pipeline.
When paired with a technology such as Apache Kafka or Apache Pulsar, this is a powerful paradigm. Your consumers can scale independently and operate in isolation without an intermediate “state store”.
Your boss comes to you after a few weeks and gives you a new requirement. From now on, you can only store one file per player/team combination. Also, you have to aggregate all previous filed data per player/team combination so you have a prior record of the data.
After cleaning and aggregating all the old files, you begin your work.
Here is your new process:
Print out file
Walk to appropriate filing cabinet
Find the player/team folder
If a previous file exists, place the aggregat
Store the new file
After doing this for awhile, you realize you can optimize your workflow.
You decide you can save time if you don’t have to make a trip to the filing cabinet everytime you receive an email.
Instead, you decide to keep a ledger of each email you receive and only process the latest email for each player/team combination once a day.
At the end of the day, you print out the files and then file them accordingly.
Your new process looks like this:
Check the ledger if you have an entry for that player/team combination
If you do, aggregate data and overwrite the previous email
If not, save that email into the ledger
End of the day, print and store the latest emails from your ledger.
This is an example of stateful stream processing. You receive messages one-by-one and process them, but it is stateful because you have to know whether or not you have received anything for this player/team combination.
Your ledger, also referred to as a “state store”, is something that holds a record of what you have received and when.
In today’s world of modern computing this is usually handled by an in-memory data store for fast lookups, however, a state store can also be an external depedency such as a traditional database or read optimized cache.
The act of reacting on the last event per day and key, in this example player/team combination, is known as the “falling-edge”.
The term “falling-edge” relates to circuit design and correlates to reacting to the last event within a given window.
If we were to act on the first key/event, we see we would refer to that as the “rising-edge”. This plays a large factor when designing distributed messaging systems and defining windowing operations.
Think of a “window” as, what a repairman gives you when setting up an appointment. “We will be there between 1-3 PM”, but no specific time, however you expect them arrive within that timeframe.
You have been doing great work, but in order to keep up with client demand, SportsAgg needs to hire more Data Analysts.
Your boss decides to hire two more data analysts.
In order to keep your same workflow you have to devise a way to maintain your ledger across the new team members.
To handle this, you and your coworkers devise a plan to break up the player/team combinations into 3 distinct groups.
Employee A → Handles group 1
Employee B → Handles group 2
Employee C → Handles group 3
This allows you to maintain a copy of your own individual ledger without having to maintain them across the entire team.
Your workflow stays the same, but you have drastically increased your throughput.
The above is an example of parallel stream processing. Parallel stream processing allows applications to segment resposibilites for processing messages across multiple instances (threads or processes) of an application.
In order to do this, you need some sort of routing capabilities. Meaning, you need a system for making sure that messages from group 1 (in the example above) make it to Employee A.
Streaming platforms such as Apache Kafka and Apache Pulsar, handle this by splitting the storage of messages into partitions. Think of a partition as an ordered log of messages.
Those partitions are then assigned to consumers via a consumer group (quite literally a group of consumers).
This is a contrived example (on purpose), to help explain stream processing. There is a lot of minutia that goes into stream processing that is not discussed in this article and that is on purpose.
The further we move away from the core abstractions and principles of a process or technology the more convulted and difficult it is to get started and explain.
If you are unfamiliar with stream processing, I hope that this gave you brief explanation about the ideas behind stream processing and the differences between stateless and stateful streams.
If you have used stream processing before, I hope this article affirms what you knew already and gives you a platform to discuss streams with colleagues who are:
New to stream processing
Non-technical team members (e.g. BA, QA, Directors, Executives, etc.)
Check out the next section if you are interested in furthering your reading on stream processing.