流数据处理的博文

发布时间：2021-01-06 12:57:40 所属栏目：大数据来源：网络整理

导读：The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is the firs

This approach breaks down even more when you try to use a batch engine to process unbounded data into more sophisticated windowing strategies,like sessions. Sessions are typically defined as periods of activity (e.g.,for a specific user) terminated by a gap of inactivity. When calculating sessions using a typical batch engine,you often end up with sessions that are split across batches,as indicated by the red marks in the diagram below. The number of splits can be reduced by increasing batch sizes,but at the cost of increased latency. Another option is to add additional logic to stitch up sessions from previous runs,but at the cost of further complexity.

流数据处理的博文

Figure 4: Unbounded data processing into sessions via ad hoc fixed windows with a classic batch engine. An unbounded data set is collected up front into finite,fixed-size windows of bounded data that are then subdivided into dynamic session windows via successive runs a of classic batch engine. Image: Tyler Akidau.

Either way,using a classic batch engine to calculate sessions is less than ideal. A nicer way would be to build up sessions in a streaming manner,which we’ll look at later on.

Unbounded data — streaming

Contrary to the ad hoc nature of most batch-based unbounded data processing approaches,streaming systems are built for unbounded data. As I noted earlier,for many real-world,distributed input sources,you not only find yourself dealing with unbounded data,but also data that are:

Highly unordered with respect to event times,meaning you need some sort of time-based shuffle in your pipeline if you want to analyze the data in the context in which they occurred.
Of varying event time skew,meaning you can’t just assume you’ll always see most of the data for a given event time X within some constant epsilon of time Y.

There are a handful of approaches one can take when dealing with data that have these characteristics. I generally categorize these approaches into four groups:

Time-agnostic
Approximation
Windowing by processing time
Windowing by event time

We’ll now spend a little bit of time looking at each of these approaches.

Time-agnostic

Time-agnostic processing is used in cases where time is essentially irrelevant — i.e.,all relevant logic is data driven. Since everything about such use cases is dictated by the arrival of more data,there’s really nothing special a streaming engine has to support other than basic data delivery. As a result,essentially all streaming systems in existence support time-agnostic use cases out of the box (modulo system-to-system variances in consistency guarantees,for those of you that care about correctness). Batch systems are also well suited for time-agnostic processing of unbounded data sources,by simply chopping the unbounded source into an arbitrary sequence of bounded data sets and processing those data sets independently. We’ll look at a couple of concrete examples in this section,but given the straightforwardness of handling time-agnostic processing,won’t spend much more time on it beyond that.

Filtering

A very basic form of time-agnostic processing is filtering. Imagine you’re processing Web traffic logs,and you want to filter out all traffic that didn’t originate from a specific domain. You would look at each record as it arrived,see if it belonged to the domain of interest,and drop it if not. Since this sort of thing depends only on a single element at any time,the fact that the data source is unbounded,unordered,and of varying event time skew is irrelevant.

流数据处理的博文

Figure 5: Filtering unbounded data. A collection of data (flowing left to right) of varying types is filtered into a homogeneous collection containing a single type. Image: Tyler Akidau.

Inner-joins

Another time-agnostic example is an inner-join (or hash-join). When joining two unbounded data sources,if you only care about the results of a join when an element from both sources arrive,there’s no temporal element to the logic. Upon seeing a value from one source,you can simply buffer it up in persistent state; you only need to emit the joined record once the second value from the other source arrives. (In truth,you’d likely want some sort of garbage collection policy for unemitted partial joins,which would likely be time based. But for a use case with little or no uncompleted joins,such a thing might not be an issue.)

流数据处理的博文

Figure 6: Performing an inner join on unbounded data. Joins are produced when matching elements from both sources are observed. Image: Tyler Akidau.

Switching semantics to some sort of outer join introduces the data completeness problem we’ve talked about: once you’ve seen one side of the join,how do you know whether the other side is ever going to arrive or not? Truth be told,you don’t,so you have to introduce some notion of a timeout,which introduces an element of time. That element of time is essentially a form of windowing,which we’ll look at more closely in a moment.

Approximation algorithms

流数据处理的博文

Figure 7: Computing approximations on unbounded data. Data are run through a complex algorithm,yielding output data that look more or less like the desired result on the other side. Image: Tyler Akidau.

The second major category of approaches is approximation algorithms,such as approximate Top-N,streaming K-means,etc. They take an unbounded source of input and provide output data that,if you squint at them,look more or less like what you were hoping to get. The upside of approximation algorithms is that,by design,they are low overhead and designed for unbounded data. The downsides are that a limited set of them exist,the algorithms themselves are often complicated (which makes it difficult to conjure up new ones),and their approximate nature limits their utility.

It’s worth noting: these algorithms typically do have some element of time in their design (e.g.,some sort of built-in decay). And since they process elements as they arrive,that element of time is usually processing-time based. This is particularly important for algorithms that provide some sort of provable error bounds on their approximations. If those error bounds are predicated on data arriving in order,they mean essentially nothing when you feed the algorithm unordered data with varying event-time skew. Something to keep in mind.

Approximation algorithms themselves are a fascinating subject,but as they are essentially another example of time-agnostic processing (modulo the temporal features of the algorithms themselves),they’re quite straightforward to use,and thus not worth further attention given our current focus.

Windowing

The remaining two approaches for unbounded data processing are both variations of windowing. Before diving into the differences between them,I should make it clear exactly what I mean by windowing since I’ve only touched on it briefly. Windowing is simply the notion of taking a data source (either unbounded or bounded),and chopping it up along temporal boundaries into finite chunks for processing. The following diagram shows three different windowing patterns:

（编辑：安卓应用网_ASP源码网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

4/7

首页

尾页