This post originally appeared on InfoWorld on December 1, 2017.

Are you caught up with the real-time (r)evolution?

Real-time analytics have been around since well before computers. Here's where the evolution of real time is headed in 2018

People often think of “real time” as a new concept, but the need for real-time notification and decisioning has been around forever. Gartner defines real-time analytics as “the discipline that applies logic and mathematics to data to provide insights for making better decisions quickly.” This definition has always held true, it is just the meaning and measurement of real time that has evolved. As we approach the end of 2017, it’s a good time to look at the evolution of real-time decision making and ask yourself—are you caught up?

Real time B.C. (before computers)

Before there were computers, people were finding ways to use real-time information. For instance, in 500 BC, Philippides ran 25 miles from Marathon to Athens, Greece, to alert the magistrates of the victory at the Battle at Marathon, and in 200 BC, smoke signals were used on the Great Wall of China to warn those further down the wall of enemy size and attack direction.

Fast-forward to 1775 when Paul Revere developed a real-time alerting system using lanterns. We all know the famous phrase “one if by land, two if by sea,” and after initiating this real-time alert, he rode on horseback to deliver the information to towns around Boston along with 40 other horseback riders to spread the word throughout eastern Massachusetts.

Real-time A.D. (after digital)

With the invention of computers, we upgraded from “human speed” to “machine speed.” For example, telephone call routing was initially executed by human switchboard operators, but by 1983, computers automated this process, enabling real-time voice communication for almost everyone.

People also began interacting with real-time systems, typically through an intermediary. The stock market is a good example—if you wanted to buy or sell securities, you had to make a phone call to your stockbroker who would execute the trade on your behalf using the real-time stock exchange trading software.

From an organizational standpoint, computers enabled companies to gather transactions into a database and produce monthly, weekly or even daily reports. At this early stage, nightly batch processing of data was generally viewed as “real-time data.”

Today in real time

Every day most of us interact with real-time systems and execute real-time decisions, possibly without even realizing it. For instance, going back to our stock example, trades can be executed in seconds by the consumer without a broker.

Take credit card fraud as another example. What was previously fraud detection has evolved into fraud prevention thanks to advancements in real-time technology, and preventing the unauthorized transaction from being processed can save corporations millions of dollars.

Smartphones have helped advance the capabilities of real-time processing thanks to the personalized data they can ingest. Tracking the location of consumers in real time allows companies to deliver personalized offers based on what they are close to. Mobile traffic applications like Waze provide real-time incident alerts and traffic updates to redirect drivers to the fastest route.

Tomorrow in real time

We are witnessing the commoditization of real-time information, alerts, and decisions, and these capabilities will proliferate across all application domains, including online gaming, casino gaming, financial compliance and risk, fraud prevention, digital ad technology, telecom policy and billing, and more.

As we look ahead to 2018, we’ll see real-time advancements in the following areas:

Self-driving cars will process data from countless sensors and ultimately get passengers to their destinations safely and efficiently, saving thousands of lives a year in the process.
Real-time threat detection and remedy for data security breaches and DDoS attacks will alert consumers as they occur.
Wearable health sensors (they already exist in your smartwatch) will not only monitor health statistics, but will also combine that telemetry with personal medical history to know when and how to notify doctors if necessary.
Augmented reality will become part of our everyday lives. When traveling in a foreign country, AR on cell phones and tablets will provide “live” translations of street signs and shops and integrate reviews, recommendations and advertisements on whatever signposts or buildings you scan.

The value of real time is the ability to make decisions as soon as possible based on the most readily available information. Information that used to take days to receive can now be ingested and analyzed in fractions of a second, and 2018 will only cut that time down even more, empowering companies to leverage real-time information to increase corporate profit, protect resources and even save lives.

This post originally appeared on VoltDB.com in December, 2014.

Evolving and Simplifying the Lambda Architecture

Introduction

The Lambda Architecture defines a robust framework for ingesting streams of Fast Data while providing efficient real-time and historical analytics. In Lambda, immutable data flows in one direction: into the system. The architecture’s main goal is to execute OLAP-type processing faster - in essence, reduce columnar analytics from every couple of seconds to 100ms or so, without actually enabling interesting new applications like real time application of user segments/scoring. As Lambda was conceived, it wasn’t designed to transact and make per-event decisions on the Fast Data, nor to be responsive to the events coming in, as they arrive.

What is Lambda?

The Lambda Architecture is a new Big Data architecture designed to ingest, process and query both fresh and historical (batch) data in a single data architecture. In his book “Big Data - Principles and best practices of scalable realtime data systems”, Nathan Marz introduces the Lambda Architecture and states that:

“The Lambda Architecture.. provides a general purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency.”

Nathan further defines the system as having both a batch (historical) layer, as well as a speed layer:

“The Lambda Architecture solves the problem of computing arbitrary functions

on arbitrary data in realtime by decomposing the problem into three layers: the

batch layer, the serving layer, and the speed layer. “

The batch layer is usually a “data lake” system like Hadoop, though it could also be an OLAP data warehouse like Vertica or Netezza. This historical archive is used to hold all of the data ever collected. The batch layer supports batch query; batch processing is used to generate analytics, either predefined or ad hoc.

The speed layer is defined as a combination of queuing, streaming and operational data stores. In the Lambda Architecture, the speed layer is similar to the batch layer in that it computes similar analytics - except that it computes those analytics in real-time on only the most recent data. The analytics the batch layer calculates, for example, may be based on data one hour old. It is the speed layer’s responsibility to calculate real-time analytics based on fast moving data - data that is zero to one hour old.

As you can see, if you combine the analytics produced by the batch layer as well as the speed layer, you have a complete view of the analytics across all data, fresh and historical. The third layer of Lambda, the serving layer, is responsible for serving up results, combined from both the speed and batch layer.

To summarize, Lambda defines a Big Data architecture that allows arbitrary queries and computations on both fast moving data as well as historical data.

“Typical” Lambda Applications

The Lambda Architecture is an emerging paradigm in Big Data computing. As such, new Lambda-based applications are emerging seemingly weekly. However, one of the more common use cases of Lambda-based applications is log ingestion and accompanying analytics. “Logs” in this context could be general log collection, website clickstream logging, VPN access logs, or the popular Twitter tweet stream collection.

Log messages often are created at a high velocity. They are immutable and usually are time-tagged or time ordered. This is the "fast data" that is captured and harvested - it is this data that is ingested by both Lambda’s speed layer and batch layer, usually in parallel, by way of message queues and streaming systems (like Kafka and Storm). The ingestion of each log message does not require a response to the entity that delivered the data - it is a one-way data pipeline.

A log message’s final resting place is the data lake, where batch metrics are [re]computed. The fast layer computes similar results for the most recent "window", staying ahead of the Hadoop/batch layer.

Analytics at the speed and batch layer can be predefined or ad hoc. Should new analytics be desired, Lambda suggests that you can re-run the entire data set, from the data lake or from the original log files, to recompute the new metrics. For example, analytics for website click logs could be counting page hits and page popularity. For tweet streams they could be computing trending topics.

VoltDB and the Lambda Architecture

VoltDB, a clustered, in-memory, relational database. It supports fast ingest of data, real-time ad hoc analytics and rapid export of data to downstream systems like Hadoop and OLAP offerings. It fits squarely and solidly into the Lambda Architecture’s speed layer. Like popular streaming systems, VoltDB is horizontally scalable, highly available, and fault tolerant — all while sustaining transactional ingestion speeds of hundreds of thousands to millions of events per second. In the standard Lambda Architecture, the inclusion of VoltDB greatly simplifies the speed layer by replacing both the streaming and the operational data store portions of the speed layer.

In the outlined Lambda Architecture, a queuing system like Kafka would feed both VoltDB and Hadoop, or VoltDB directly, which would then in turn immediately export the event to the data lake.

Future-proofing Lambda

As defined today, the Lambda Architecture is very focused on fast data collection and read-only queries on both fast and historical data. In Lambda, data is immutable - it never changes. Data comes into the system in streams and metrics, both historical and real time, and is calculated and maintained. External systems make use of the Lambda-based environment to query the computed analytics. These analytics are then used to alert, should metric thresholds be crossed, or harvested, for example in the case of Twitter trending topics.

When considering improvements to the Lambda Architecture, what if you could react, per event, to the incoming data stream? In essence, you’d have the ability to act on the incoming feed, in addition to performing real-time analytics.

Here at VoltDB we have a lot of experience building streaming applications for Fast Data, another term for the Lambda-defined “speed layer”. Most of our customers are building Fast Data applications, providing us with unique insight into the Lambda speed layer. These systems ingest events from log files, the Internet of Things (IoT), user clickstreams, online game play, and financial applications. While some of these applications passively ingest events and provide real-time analytics and alerting on the data streams (in typical Lambda style), many of these applications have begun interacting with the stream, adding per-event decisioning and transactions in addition to real-time analytics.

Additionally, another characteristic of these systems is that the speed layer analytics can differ from the batch layer analytics. Often the data lake is used to mine intelligence via exploratory queries. This intelligence, when identified, is then fed to the speed layer as input to the per-event decisions. Scott Jarr describes this fully interactive Lambda-like application evolution here, http://voltdb.com/blog/youd-better-do-fast-data-right/ in his Fast Data blog.

In this diagram you can see the additions to the architecture:

1. Data arrives at a high rate and is ingested. It is immediately exported to the Batch Layer, the Data Lake.

2. Historical intelligence can be mined from the Data Lake and the aggregate “intelligence” can be delivered to the Speed Layer for per-event real-time decisioning (for instance, to determine which ad to display for a segmented/categorized web browser/user).

3. Fast Data is either passively ingested, or a response can be computed by the new decisioning layer, using both real-time data as well as historical “mined” intelligence.

For a working code example of the simplified speed layer, reference the

VoltDB “fast data” application posted here: http://voltdb.github.io/app-fastdata.

Conclusion

The Lambda Architecture is a powerful Big Data analytics framework that serves queries from both fast and historical data. However, the architecture emerged from a need to execute OLAP-type processing faster, without considering a new class of applications that require per-event decisioning: applications like real time application of user segments/scoring, fraud detection, denial of service attacks, policy and billing, etc. In its current form, Lambda is limited: immutable data flows in one direction, into the system, for analytics harvesting.

Adding VoltDB, a linearly scalable in-memory relational database, into the Lambda Architecture greatly simplifies the speed layer by reducing the number of components needed.

Lambda’s shortcoming is the inability to build responsive, event-oriented applications.

In addition to simplifying the architecture, VoltDB provides future-proof functionality to Lambda, specifically, the ability to execute transactions and per-event decisions on Fast Data as it arrives. Rather than a one-way streaming system feeding events into the speed layer, adding an ingestion engine like VoltDB provides developers with the ability to place applications in front of the event stream to capture value the moment the event arrives, rather than capturing value at some point after the event arrived on an aggregate-basis.

VoltDB improves the Lambda architecture by:

● Reducing the number of moving pieces, the products and components, needed. Specifically, major components of the speed layer can be replaced by a single component, VoltDB. Further, VoltDB can be used as a data store for the serving layer.

● Enables the ability for the application to make per-event decisioning and transactional behavior, without re-implementing the architecture once deployed.

● Providing the traditional relational database interaction model, with ad hoc SQL capabilities, on Fast Data. Applications can use standard SQL providing agility to their query needs without requiring complex programming logic.

● Providing access to standard analytics tooling, such as Tableau, MicroStrategy, and Actuate BIRT, on top of Fast Data.

Agile Making Progress

Monday, December 18, 2017

Are you caught up with the real-time (r)evolution?

This post originally appeared on InfoWorld on December 1, 2017.

Are you caught up with the real-time (r)evolution?

Real-time analytics have been around since well before computers. Here's where the evolution of real time is headed in 2018

Wednesday, August 2, 2017

Simplifying the Lambda Architecture