This post originally appeared on VoltDB.com in January, 2015. It has been lightly updated in this post.
Three Fast Data Application Patterns
The focus of many developers and architects in
the past few years has been on Big Data, specifically mining historical
intelligence from the Data Lake (usually a Hadoop stack containing terabytes to
petabytes of data). Now, product architects are asking how they can use this
business intelligence for competitive advantage. As a result, application
developers have come to see the value of using and acting in real-time on
streams of fast data; using OLAP reporting wisdom, they can realize the
benefits of both fast data and Big Data. As a result, a new set of application
patterns have emerged. The applications are designed to capture value from
fast-moving streaming data, before it reaches Hadoop.
At VoltDB we call this new breed of
applications “fast data” applications. The goal of these fast data applications
is to do more than just push data into Hadoop asap, but also to capture real-time value from the data the moment
the data arrives.
Because traditional databases historically
haven’t been fast enough, developers have been forced to go to great effort to
build fast data applications - they build complex multi-tier systems often
involving a handful of tools typically utilizing a dozen or more servers. However, a new class of database technology,
especially NewSQL offerings, has changed this equation.
If you have a relational database that is fast
enough, highly available, and able to scale horizontally, the ability to build
fast data applications becomes less esoteric and much more manageable. Three
new real-time application patterns have emerged as the necessary dataflows to
implement real-time applications. These patterns, enabled by new, fast database
technology, are:
1.
Real-time Analytics
2.
Real-time Decision Engine
3.
Fast Data Pipeline
Let’s take a look at the characteristics of
each of these fast data application patterns and how a NewSQL database like
VoltDB can improve and simplify building applications.
Real-time
Analytics
This application pattern processes streaming
data from one or many sources and performs real-time analytic computations on
that fast data. Today this application pattern is often combined with Big Data
analytics, producing analytics on both fast data and big data.
In this pattern, the value captured from the
fast data stream is primarily real-time analytics. The streaming engine tracks
counts and metrics derived from each message. Applications tap these stored
results and display dashboard state and possibly offer real-time alerts.
Important features in this application pattern
include:
- Pre-built connectors are required to easily feed the streams of data into the database.
- The database needs to compute real-time analytics by pre-computing materialized views on a per-message basis.
- Standard reporting tools such as Tableau or MicroStrategy can use SQL to query real-time state and compute ad hoc analytics.
- The database needs the ability to perform analytics and aggregations over time windows of data, such as by the second, minute, hour, day, etc.
- The database needs the ability to discard, or age-out, data after it is processed. VoltDB accomplishes this with “capped tables,” the ability to define a table constraint to limit the number of rows a table has, and when that constraint is violated, to automatically execute delete statements to remove older rows.
Real-time
Decision Engine
This application pattern processes inbound
requests from many clients, perhaps tens of thousands simultaneously, and
returns a low latency response or decision to the client. This is a classic
OLTP application pattern but running at scale against high velocity incoming
data.
Scaling to support per-event high velocity
transactions enables applications that evaluate campaign, policy, authorization
and other business logic, to respond in real-time, in milliseconds, to
applications. In this pattern, the business value is providing “smart,” or
calculated, responses to high velocity requests. Applications that make use of
this model today include digital ad-tech campaign balance processing as well as
ad choice (based on precomputed user segmentation or other mined heuristics),
smart grid electrical grids, and telecom billing and policy decisioning. In all
cases, the database is processing incoming requests at exceptionally high
rates. Each incoming request runs a transaction to calculate a decision and
return a response to the calling application.
For example, an incoming telecom request (a new Call Data Record), may
need to decide, “Does this user have enough balance to process this call?” A
digital ad platform may ask, “Which of my ads should I serve to this mobile
device, based on campaign available balance?”
Important features in this application pattern
include:
- ACID Transactions. In this pattern, the decision engine (the database) is updating state in a consistent and durable manner. The state is often a balance of some kind (usually monetary). Additionally, consistency is important. These responses (decisions) are based on data that must be correct, thus consistent.
- Low predictable latency responses. These applications need a response in real-time. Often there is a budget for database processing that ranges in the single-digit millisecond range, 99.999% of the time.
- Durable data. In this pattern the database is often a system of record, at least for a time window. Should the system go down, data would need to be durable and recoverable.
- Ability to use standard tooling, i.e. SQL, to query and compute real-time state. Real-time dashboards capturing the state of the system (balances, transactions/second) as well the ability to ad hoc query state are important to this application pattern.
For a real-time decisioning (high velocity
transactions) code example demonstrating an ‘American Idol’-like voting system,
with per-vote validation, see
https://github.com/VoltDB/voltdb/tree/master/examples/voter.
Fast
Data Data Pipeline
This application pattern processes streaming
data from one or many sources and performs real-time ETL (Extract, Transform,
Load) on the data, delivering the result to a historical archive. In a
streaming data pipeline, incoming data may be sessionized, enriched, validated,
de-duped, aggregated, counted, discarded, cleansed, etc. by the database before
being delivered to the Data Lake.
Important features in this application pattern
include:
- Pre-built connectors to feed the streams of data into the pipeline. VoltDB includes connectors to import Kafka streams, relational data, and also supports HadoopOutputFormat results from Hive and Pig.
- The ability to process data, to aggregate, de-dupe, or transform, as part of ETL workflow. VoltDB transactions, implemented as Java Stored Procedures, allow developers to transactionally execute SQL, combined with Java business logic, to process each message individually or in aggregate.
- Pre-built export connectors to stream data downstream to historical archive as fast as it arrived. VoltDB includes export connectors to stream data to Kafka, RabbitMQ, Hadoop, Vertica, Netezza, or any relational data store via JDBC.
Applications making use of this pattern often
are processing continuous streams of data that must be validated, transformed
and archived in some manner. One example is processing device ids (usually in
the form of cookies). The pipeline computes segmentation output intelligence,
providing correlation data to be used for advanced decisioning applications,
often in the digital ad tech arena.
For a fast data pipeline code example
demonstrating click stream processing based on user segmentation, using VoltDB
as the ingestion and processing engine, see https://github.com/VoltDB/app-fastdata.
VoltDB
and the Fast Data Pipeline
The fast data processing layer must have the
following properties across all use cases:
- High ingestion (write) rate. The pipeline must ingest data at historically challenging transaction rates. Transactions occurring at hundreds of thousands to millions of times per second are not uncommon.
- Analyze incoming data in real-time. Real-time analytics enable users to derive seasonal patterns, statistical summaries, scoring models, recommendation models, rankings, leader boards and other artifacts for use in user-facing applications.
- Real-time decisions. Enabling real-time transactions against new incoming data makes it possible to respond back to users or drive downstream processes based on the output of the analysis activity and the context and content of the present event.
In addition, the three patterns require a
system architected to deliver:
- High availability. The pipeline processing engine, in this case VoltDB, can survive machine loss or (most) networking failures, either for routine maintenance or due to environmental issues or errors, and continue to operate correctly.
- Elastically and horizontally scalable. As throughput increases, more machines can be added to the pipeline processing engine to accommodate the additional traffic without interrupting the running system.
VoltDB is an in-memory, relational database
that is fully durable and maintains strict ACID properties. VoltDB is highly
available and elastically scales-out on commodity hardware. VoltDB includes
a set of pre-built integrations, both streaming importers and exporters, all designed to
help you ingest streaming data, process it within VoltDB, and, once processed,
export data seamlessly to a historical data warehouse.
Whether you view your problem as OLTP with
real-time analytics or as stream processing, VoltDB is the only system that
combines ingestion, speed, robustness and strong consistency to make developing
and supporting apps easier than ever. Everything we build enables fast data
apps. We’ve helped our customers develop and deploy hundreds of them.
No comments:
Post a Comment