流数据处理的博文

发布时间：2021-01-06 12:57:40 所属栏目：大数据来源：网络整理

导读：The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is the firs

流数据处理的博文

Figure 1: Example time domain mapping. The X-axis represents event time completeness in the system,i.e. the time X in event time up to which all data with event times less than X have been observed. The Y-axis represents the progress of processing time,i.e. normal clock time as observed by the data processing system as it executes. Image: Tyler Akidau.

The black dashed line with a slope of one represents the ideal,where processing time and event time are exactly equal; the red line represents reality. In this example,the system lags a bit at the beginning of processing time,veers closer toward the ideal in the middle,then lags again a bit toward the end. The horizontal distance between the ideal and the red line is the skew between processing time and event time. That skew is essentially the latency introduced by the processing pipeline.

BOOK

流数据处理的博文

Data Algorithms

By Mahmoud Parsian

Shop now

Since the mapping between event time and processing time is not static,this means you cannot analyze your data solely within the context of when they are observed in your pipeline if you care about their event times (i.e.,when the events actually occurred). Unfortunately,this is the way most existing systems designed for unbounded data operate. To cope with the infinite nature of unbounded data sets,these systems typically provide some notion of windowing the incoming data. We’ll discuss windowing in great depth below,but it essentially means chopping up a data set into finite pieces along temporal boundaries.

If you care about correctness and are interested in analyzing your data in the context of their event times,you cannot define those temporal boundaries using processing time (i.e.,processing time windowing),as most existing systems do; with no consistent correlation between processing time and event time,some of your event time data are going to end up in the wrong processing time windows (due to the inherent lag in distributed systems,the online/offline nature of many types of input sources,throwing correctness out the window,as it were. We’ll look at this problem in more detail in a number of examples below as well as in the next post.

Unfortunately,the picture isn’t exactly rosy when windowing by event time,either. In the context of unbounded data,disorder and variable skew induce a completeness problem for event time windows: lacking a predictable mapping between processing time and event time,how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources,you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness,which puts them at a severe disadvantage when applied to unbounded data sets.

I propose that instead of attempting to groom unbounded data into finite batches of information that eventually become complete,we should be designing tools that allow us to live in the world of uncertainty imposed by these complex data sets. New data will arrive,old data may be retracted or updated,and any system we build should be able to cope with these facts on its own,with notions of completeness being a convenient optimization rather than a semantic necessity.

Before diving into how we’ve tried to build such a system with the Dataflow Model used in Cloud Dataflow,let’s finish up one more useful piece of background: common data processing patterns.

Data processing patterns

At this point in time,we have enough background established that we can start looking at the core types of usage patterns common across bounded and unbounded data processing today. We’ll look at both types of processing,and where relevant,within the context of the two main types of engines we care about (batch and streaming,where in this context,I’m essentially lumping micro-batch in with streaming since the differences between the two aren’t terribly important at this level).

Bounded data

Processing bounded data is quite straightforward,and likely familiar to everyone. In the diagram below,we start out on the left with a data set full of entropy. We run it through some data processing engine (typically batch,though a well-designed streaming engine would work just as well),such as MapReduce,and on the right end up with a new structured data set with greater inherent value:

流数据处理的博文

Figure 2: Bounded data processing with a classic batch engine. A finite pool of unstructured data on the left is run through a data processing engine,resulting in corresponding structured data on the right. Image: Tyler Akidau.

Though there are,of course,infinite variations on what you can actually calculate as part of this scheme,the overall model is quite simple. Much more interesting is the task of processing an unbounded data set. Let’s now look at the various ways unbounded data are typically processed,starting with the approaches used with traditional batch engines,and then ending up with the approaches one can take with a system designed for unbounded data,such as most streaming or micro-batch engines.

Unbounded data — batch

Batch engines,though not explicitly designed with unbounded data in mind,have been used to process unbounded data sets since batch systems were first conceived. As one might expect,such approaches revolve around slicing up the unbounded data into a collection of bounded data sets appropriate for batch processing.

Fixed windows

The most common way to process an unbounded data set using repeated runs of a batch engine is by windowing the input data into fixed-sized windows,then processing each of those windows as a separate,bounded data source. Particularly for input sources like logs,where events can be written into directory and file hierarchies whose names encode the window they correspond to,this sort of thing appears quite straightforward at first blush since you’ve essentially performed the time-based shuffle to get data into the appropriate event time windows ahead of time.

In reality,most systems still have a completeness problem to deal with: what if some of your events are delayed en route to the logs due to a network partition? What if your events are collected globally and must be transferred to a common location before processing? What if your events come from mobile devices? This means some sort of mitigation may be necessary (e.g.,delaying processing until you’re sure all events have been collected,or re-processing the entire batch for a given window whenever data arrive late).

流数据处理的博文

Figure 3: Unbounded data processing via ad hoc fixed windows with a classic batch engine. An unbounded data set is collected up front into finite,fixed-size windows of bounded data that are then processed via successive runs a of classic batch engine. Image: Tyler Akidau.

Sessions

（编辑：安卓应用网_ASP源码网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

3/7

首页

尾页