流数据处理的博文

发布时间：2021-01-06 12:57:40 所属栏目：大数据来源：网络整理

导读：The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is the firs

流数据处理的博文

Figure 8: Example windowing strategies. Each example is shown for three different keys,highlighting the difference between aligned windows (which apply across all the data) and unaligned windows (which apply across a subset of the data). Image: Tyler Akidau.

Fixed windows: Fixed windows slice up time into segments with a fixed-size temporal length. Typically (as in Figure 8),the segments for fixed windows are applied uniformly across the entire data set,which is an example of aligned windows. In some cases,it’s desirable to phase-shift the windows for different subsets of the data (e.g.,per key) to spread window completion load more evenly over time,which instead is an example of unaligned windows since they vary across the data.
Sliding windows: A generalization of fixed windows,sliding windows are defined by a fixed length and a fixed period. If the period is less than the length,then the windows overlap. If the period equals the length,you have fixed windows. And if the period is greater than the length,you have a weird sort of sampling window that only looks at subsets of the data over time. As with fixed windows,sliding windows are typically aligned,though may be unaligned as a performance optimization in certain use cases. Note that the sliding windows in the Figure 8 are drawn as they are to give a sense of sliding motion; in reality,all five windows would apply across the entire data set.
Sessions: An example of dynamic windows,sessions are composed of sequences of events terminated by a gap of inactivity greater than some timeout. Sessions are commonly used for analyzing user behavior over time,by grouping together a series of temporally-related events (e.g.,a sequence of videos viewed in one sitting). Sessions are interesting because their lengths cannot be defined a priori; they are dependent upon the actual data involved. They’re also the canonical example of unaligned windows since sessions are practically never identical across different subsets of data (e.g.,different users).

The two domains of time discussed — processing time and event time — are essentially the two we care about[2]. Windowing makes sense in both domains,so we’ll look at each in detail and see how they differ. Since processing time windowing is vastly more common in existing systems,I’ll start there.

Windowing by processing time

流数据处理的博文

Figure 9: Windowing into fixed windows by processing time. Data are collected into windows based on the order they arrive in the pipeline. Image: Tyler Akidau.

When windowing by processing time,the system essentially buffers up incoming data into windows until some amount of processing time has passed. For example,in the case of five-minute fixed windows,the system would buffer up data for five minutes of processing time,after which it would treat all the data it had observed in those five minutes as a window and send them downstream for processing.

There are a few nice properties of processing time windowing:

It’s simple. The implementation is extremely straightforward since you never worry about shuffling data within time. You just buffer things up as they arrive and send them downstream when the window closes.
Judging window completeness is straightforward. Since the system has perfect knowledge of whether all inputs for a window have been seen or not,it can make perfect decisions about whether a given window is complete or not. This means there is no need to be able to deal with “late” data in any way when windowing by processing time.
If you’re wanting to infer information about the source as it is observed,processing time windowing is exactly what you want. Many monitoring scenarios fall into this category. Imagine tracking the number of requests per second sent to a global-scale Web service. Calculating a rate of these requests for the purpose of detecting outages is a perfect use of processing time windowing.

Good points aside,there is one very big downside to processing time windowing: if the data in question have event times associated with them,those data must arrive in event time order if the processing time windows are to reflect the reality of when those events actually happened. Unfortunately,event-time ordered data are uncommon in many real-world,distributed input sources.

As a simple example,imagine any mobile app that gathers usage statistics for later processing. In cases where a given mobile device goes offline for any amount of time (brief loss of connectivity,airplane mode while flying across the country,the data recorded during that period won’t be uploaded until the device comes online again. That means data might arrive with an event time skew of minutes,hours,days,weeks,or more. It’s essentially impossible to draw any sort of useful inferences from such a data set when windowed by processing time.

As another example,many distributed input sources may seem to provide event-time ordered (or very nearly so) data when the overall system is healthy. Unfortunately,the fact that event-time skew is low for the input source when healthy does not mean it will always stay that way. Consider a global service that processes data collected on multiple continents. If network issues across a bandwidth-constrained transcontinental line (which,sadly,are surprisingly common) further decrease bandwidth and/or increase latency,suddenly a portion of your input data may start arriving with much greater skew than before. If you are windowing that data by processing time,your windows are no longer representative of the data that actually occurred within them; instead,they represent the windows of time as the events arrived at the processing pipeline,which is some arbitrary mix of old and current data.

What we really want in both of those cases is to window data by their event times in a way that is robust to the order of arrival of events. What we really want is event time windowing.

Windowing by event time

Event time windowing is what you use when you need to observe a data source in finite chunks that reflect the times at which those events actually happened. It’s the gold standard of windowing. Sadly,most data processing systems in use today lack native support for it (though any system with a decent consistency model,like Hadoop or Spark Streaming,could act as a reasonable substrate for building such a windowing system).

This diagram shows an example of windowing an unbounded source into one-hour fixed windows:

流数据处理的博文

Figure 10: Windowing into fixed windows by event time. Data are collected into windows based on the times they occurred. The white arrows call out example data that arrived in processing time windows that differed from the event time windows to which they belonged. Image: Tyler Akidau.

The solid white lines in the diagram call out two particular data of interest. Those two data both arrived in processing time windows that did not match the event time windows to which they belonged. As such,if these data had been windowed into processing time windows for a use case that cared about event times,the calculated results would have been incorrect. As one would expect,event time correctness is one nice thing about using event time windows.

（编辑：安卓应用网_ASP源码网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

5/7

首页

尾页