This post originally appeared on VoltDB.com in December, 2014.
Evolving and Simplifying the Lambda
Architecture
Introduction
The Lambda Architecture defines a robust
framework for ingesting streams of Fast Data while providing efficient
real-time and historical analytics. In Lambda, immutable data flows in one
direction: into the system. The architecture’s main goal is to execute
OLAP-type processing faster - in essence,
reduce
columnar analytics from every couple of seconds to 100ms or so, without
actually enabling interesting new applications like real time application of
user segments/scoring. As Lambda was conceived, it wasn’t designed to transact
and make per-event decisions on the Fast Data, nor to be responsive to the
events coming in, as they arrive.
What is
Lambda?
The Lambda Architecture is a new Big Data
architecture designed to ingest, process and query both fresh and historical
(batch) data in a single data architecture. In his book “Big Data - Principles and best practices of scalable
realtime data systems”, Nathan Marz introduces the Lambda Architecture and
states that:
“The Lambda Architecture.. provides a general purpose approach to
implementing an arbitrary function on an arbitrary dataset and having the
function return its results with low latency.”
Nathan further defines the system as having
both a batch (historical) layer, as well as a speed layer:
“The Lambda Architecture solves the problem of computing arbitrary
functions
on arbitrary data in realtime by decomposing the problem into three
layers: the
batch layer, the serving layer, and the speed layer. “
The batch layer is usually a “data lake”
system like Hadoop, though it could also be an OLAP data warehouse like Vertica
or Netezza. This historical archive is used to hold all of the data ever
collected. The batch layer supports batch query; batch processing is used to
generate analytics, either predefined or ad hoc.
The speed layer is defined as a combination of
queuing, streaming and operational data stores. In the Lambda Architecture, the
speed layer is similar to the batch layer in that it computes similar analytics
- except that it computes those analytics in real-time on only the most recent
data. The analytics the batch layer calculates, for example, may be based on
data one hour old. It is the speed layer’s responsibility to calculate
real-time analytics based on fast moving data - data that is zero to one hour
old.
As you can see, if you combine the analytics
produced by the batch layer as well as the speed layer, you have a complete
view of the analytics across all data, fresh and historical. The third layer of Lambda, the serving layer,
is responsible for serving up results, combined from both the speed and batch
layer.
To summarize, Lambda defines a Big Data
architecture that allows arbitrary queries and computations on both fast moving
data as well as historical data.
“Typical”
Lambda Applications
The Lambda Architecture is an emerging
paradigm in Big Data computing. As such, new Lambda-based applications are
emerging seemingly weekly. However, one of the more common use cases of
Lambda-based applications is log ingestion and accompanying analytics. “Logs”
in this context could be general log collection, website clickstream logging,
VPN access logs, or the popular Twitter tweet stream collection.
Log messages often are created at a high
velocity. They are immutable and usually are time-tagged or time ordered. This
is the "fast data" that is captured and harvested - it is this data
that is ingested by both Lambda’s speed layer and batch layer, usually in
parallel, by way of message queues and streaming systems (like Kafka and
Storm). The ingestion of each log message does not require a response to the
entity that delivered the data - it is a one-way data pipeline.
A log message’s final resting place is the
data lake, where batch metrics are [re]computed. The fast layer computes similar results for
the most recent "window", staying ahead of the Hadoop/batch
layer.
Analytics at the speed and batch layer can be
predefined or ad hoc. Should new analytics be desired, Lambda suggests that you
can re-run the entire data set, from the data lake or from the original log
files, to recompute the new metrics. For example, analytics for website click
logs could be counting page hits and page popularity. For tweet streams they
could be computing trending topics.
VoltDB
and the Lambda Architecture
VoltDB, a clustered, in-memory, relational
database. It supports fast ingest of
data, real-time ad hoc analytics and rapid export of data to downstream systems
like Hadoop and OLAP offerings. It fits
squarely and solidly into the Lambda Architecture’s speed layer. Like popular
streaming systems, VoltDB is horizontally scalable, highly available, and fault
tolerant — all while sustaining transactional ingestion speeds of hundreds of
thousands to millions of events per second. In the standard Lambda
Architecture, the inclusion of VoltDB greatly simplifies the speed layer by
replacing both the streaming and the operational data store portions of the
speed layer.
In the outlined Lambda Architecture, a queuing
system like Kafka would feed both VoltDB and Hadoop, or VoltDB directly, which
would then in turn immediately export the event to the data lake.
Future-proofing
Lambda
As defined today, the Lambda Architecture is
very focused on fast data collection and read-only queries on both fast and
historical data. In Lambda, data is immutable - it never changes. Data comes
into the system in streams and metrics, both historical and real time, and is
calculated and maintained. External systems make use of the Lambda-based
environment to query the computed analytics. These analytics are then used to
alert, should metric thresholds be crossed, or harvested, for example in the
case of Twitter trending topics.
When considering improvements to the Lambda
Architecture, what if you could react, per event, to the incoming data stream?
In essence, you’d have the ability to act on the incoming feed, in addition to
performing real-time analytics.
Here at VoltDB we have a lot of experience
building streaming applications for Fast Data, another term for the
Lambda-defined “speed layer”. Most of our customers are building Fast Data
applications, providing us with unique insight into the Lambda speed layer. These systems ingest events from log files,
the Internet of Things (IoT), user clickstreams, online game play, and
financial applications. While some of these applications passively ingest
events and provide real-time analytics and alerting on the data streams (in typical
Lambda style), many of these applications have begun interacting with the
stream, adding per-event decisioning and transactions in addition to real-time
analytics.
Additionally, another characteristic of these
systems is that the speed layer analytics can differ from the batch layer
analytics. Often the data lake is used to mine intelligence via exploratory
queries. This intelligence, when
identified, is then fed to the speed layer as input to the per-event decisions.
Scott Jarr describes this fully interactive Lambda-like application evolution
here, http://voltdb.com/blog/youd-better-do-fast-data-right/
in his Fast Data blog.
In this diagram you can see the additions to
the architecture:
1.
Data arrives at a high rate and is
ingested. It is immediately exported to
the Batch Layer, the Data Lake.
2.
Historical intelligence can be
mined from the Data Lake and the aggregate “intelligence” can be delivered to
the Speed Layer for per-event real-time decisioning (for instance, to determine
which ad to display for a segmented/categorized web browser/user).
3.
Fast Data is either passively
ingested, or a response can be computed by the new decisioning layer, using
both real-time data as well as historical “mined” intelligence.
For a working code example of the simplified
speed layer, reference the
Conclusion
The Lambda Architecture is a powerful Big Data
analytics framework that serves queries from both fast and historical
data. However, the architecture emerged
from a need to execute OLAP-type processing faster, without considering a new class of
applications that require per-event decisioning: applications like real time
application of user segments/scoring, fraud detection, denial of service
attacks, policy and billing, etc. In its current form, Lambda is
limited: immutable data flows in one direction, into the system, for analytics
harvesting.
Adding VoltDB, a linearly scalable in-memory
relational database, into the Lambda Architecture greatly simplifies the speed
layer by reducing the number of components needed.
Lambda’s shortcoming is the inability to build
responsive, event-oriented applications.
In addition to simplifying the architecture,
VoltDB provides future-proof functionality to Lambda, specifically, the ability
to execute transactions and per-event decisions on Fast Data as it arrives.
Rather than a one-way streaming system feeding events into the speed layer,
adding an ingestion engine like VoltDB provides developers with the ability to
place applications in front of the event stream to capture value the moment the
event arrives, rather than capturing value at some point after the event
arrived on an aggregate-basis.
VoltDB improves the Lambda architecture
by:
●
Reducing the number of moving
pieces, the products and components, needed.
Specifically, major components of the speed layer can be replaced by a
single component, VoltDB. Further,
VoltDB can be used as a data store for the serving layer.
●
Enables the ability for the
application to make per-event decisioning and transactional behavior, without
re-implementing the architecture once deployed.
●
Providing the traditional
relational database interaction model, with ad hoc SQL capabilities, on Fast
Data. Applications can use standard SQL
providing agility to their query needs without requiring complex programming
logic.
●
Providing access to standard
analytics tooling, such as Tableau, MicroStrategy, and Actuate BIRT, on top of
Fast Data.