流数据处理的博文

发布时间：2021-01-06 12:57:40 所属栏目：大数据来源：网络整理

导读：The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is the firs

Another nice thing about event time windowing over an unbounded data source is that you can create dynamically sized windows,such as sessions,without the arbitrary splits observed when generating sessions over fixed windows (as we saw previously in the sessions example from the “Unbounded data — batch” section):

流数据处理的博文

Figure 11: Windowing into session windows by event time. Data are collected into session windows capturing bursts of activity based on the times that the corresponding events occurred. The white arrows again call out the temporal shuffle necessary to put the data into their correct event-time locations. Image: Tyler Akidau.

Of course,powerful semantics rarely come for free,and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:

Buffering: Due to extended window lifetimes,more buffering of data is required. Thankfully,persistent storage is generally the cheapest of the resource types most data processing systems depend on (the others being primarily CPU,network bandwidth,and RAM). As such,this problem is typically much less of a concern than one might think when using any well-designed data-processing system with strongly consistent persistent state and a decent in-memory caching layer. Also,many useful aggregations do not require the entire input set to be buffered (e.g.,sum,or average),but instead can be performed incrementally,with a much smaller,intermediate aggregate stored in persistent state.
Completeness: Given that we often have no good way of knowing when we’ve seen all the data for a given window,how do we know when the results for the window are ready to materialize? In truth,we simply don’t. For many types of inputs,the system can give a reasonably accurate heuristic estimate of window completion via something like MillWheel’s watermarks (which I’ll talk about more in Part 2). But in cases where absolute correctness is paramount (again,think billing),the only real option is to provide a way for the pipeline builder to express when they want results for windows to be materialized,and how those results should be refined over time. Dealing with window completeness (or lack,thereof),is a fascinating topic,but one perhaps best explored in the context of concrete examples,which we’ll look at next time.

Conclusion

Whew! That was a lot of information. To those of you that have made it this far: you are to be commended! At this point we are roughly halfway through the material I want to cover,so it’s probably reasonable to step back,recap what I’ve covered so far,and let things settle a bit before diving into Part 2. The upside of all this is that Part 1 is the boring post; Part 2 is where the fun really begins.

Recap

To summarize,in this post I’ve:

Clarified terminology,specifically narrowing the definition of “streaming” to apply to execution engines only,while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.
Assessed the relative capabilities of well-designed batch and streaming systems,positing that streaming is in fact a strict superset of batch,and that notions like the Lambda Architecture,which are predicated on streaming being inferior to batch,are destined for retirement as streaming systems mature.
Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch,those being correctness and tools for reasoning about time,respectively.
Established the important differences between event time and processing time,characterized the difficulties those differences impose when analyzing data in the context of when they occurred,and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.
Looked at the major data processing approaches in common use today for bounded and unbounded data,via both batch and streaming engines,roughly categorizing the unbounded approaches into: time-agnostic,approximation,windowing by processing time,and windowing by event time.

Next time

This post provides the context necessary for the concrete examples I’ll be exploring in Part 2. That post will consist of roughly the following:

A conceptual look at how we’ve broken up the notion of data processing in the Dataflow Model across four related axes: what,where,when,and how.
A detailed look at processing a simple,concrete example data set across multiple scenarios,highlighting the plurality of use cases enabled by the Dataflow Model,and the concrete APIs involved. These examples will help drive home the notions of event time and processing time introduced in this post,while additionally exploring new concepts,such as watermarks.
A comparison of existing data-processing systems across the important characteristics covered in both posts,to better enable educated choice amongst them,and to encourage improvement in areas that are lacking,with my ultimate goal being the betterment of data processing systems in general,and streaming systems in particular,across the entire big data community.

Should be a good time. See you then!

[1] One which I propose is not an inherent limitation of streaming systems,but simply a consequence of design choices made in most streaming systems thus far. The efficiency delta between batch and streaming is largely the result of the increased bundling and more efficient shuffle transports found in batch systems. Modern batch systems go to great lengths to implement sophisticated optimizations that allow for remarkable levels of throughput using surprisingly modest compute resources. There’s no reason the types of clever insights that make batch systems the efficiency heavyweights they are today couldn’t be incorporated into a system designed for unbounded data,providing users flexible choice between what we typically consider to be high-latency,higher-efficiency “batch” processing and low-latency,lower-efficiency “streaming” processing. This is effectively what we’ve done with Cloud Dataflow by providing both batch and streaming runners under the same unified model. In our case,we use separate runners because we happen to have two independently designed systems optimized for their specific use cases. Long-term,from an engineering perspective,I’d love to see us merge the two into a single system which incorporates the best parts of both,while still maintaining the flexibility of choosing an appropriate efficiency level. But that’s not what we have today. And honestly,thanks to the unified Dataflow Model,it’s not even strictly necessary; so it may well never happen. (Return)

[2] If you poke around enough in the academic literature or SQL-based streaming systems,you’ll also come across a third windowing time domain: tuple-based windowing (i.e.,windows whose sizes are counted in numbers of elements). However,tuple-based windowing is essentially a form of processing-time windowing where elements are assigned monotonically increasing timestamps as they arrive at the system. As such,we won’t discuss tuple-based windowing in detail here (though we will see an example of it in Part 2). (Return)

Article image: Three women wading in a stream gathering leeches (source: Wellcome Library,London).

流数据处理的博文

Tyler Akidau

（编辑：安卓应用网_ASP源码网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

6/7

首页

尾页