流数据处理的博文
|
Another nice thing about event time windowing over an unbounded data source is that you can create dynamically sized windows,such as sessions,without the arbitrary splits observed when generating sessions over fixed windows (as we saw previously in the sessions example from the “Unbounded data — batch” section):
Figure 11: Windowing into session windows by event time. Data are collected into session windows capturing bursts of activity based on the times that the corresponding events occurred. The white arrows again call out the temporal shuffle necessary to put the data into their correct event-time locations. Image: Tyler Akidau. Of course,powerful semantics rarely come for free,and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:
ConclusionWhew! That was a lot of information. To those of you that have made it this far: you are to be commended! At this point we are roughly halfway through the material I want to cover,so it’s probably reasonable to step back,recap what I’ve covered so far,and let things settle a bit before diving into Part 2. The upside of all this is that Part 1 is the boring post; Part 2 is where the fun really begins. Recap To summarize,in this post I’ve:
Next time This post provides the context necessary for the concrete examples I’ll be exploring in Part 2. That post will consist of roughly the following:
Should be a good time. See you then! [1] One which I propose is not an inherent limitation of streaming systems,but simply a consequence of design choices made in most streaming systems thus far. The efficiency delta between batch and streaming is largely the result of the increased bundling and more efficient shuffle transports found in batch systems. Modern batch systems go to great lengths to implement sophisticated optimizations that allow for remarkable levels of throughput using surprisingly modest compute resources. There’s no reason the types of clever insights that make batch systems the efficiency heavyweights they are today couldn’t be incorporated into a system designed for unbounded data,providing users flexible choice between what we typically consider to be high-latency,higher-efficiency “batch” processing and low-latency,lower-efficiency “streaming” processing. This is effectively what we’ve done with Cloud Dataflow by providing both batch and streaming runners under the same unified model. In our case,we use separate runners because we happen to have two independently designed systems optimized for their specific use cases. Long-term,from an engineering perspective,I’d love to see us merge the two into a single system which incorporates the best parts of both,while still maintaining the flexibility of choosing an appropriate efficiency level. But that’s not what we have today. And honestly,thanks to the unified Dataflow Model,it’s not even strictly necessary; so it may well never happen. (Return) [2] If you poke around enough in the academic literature or SQL-based streaming systems,you’ll also come across a third windowing time domain: tuple-based windowing (i.e.,windows whose sizes are counted in numbers of elements). However,tuple-based windowing is essentially a form of processing-time windowing where elements are assigned monotonically increasing timestamps as they arrive at the system. As such,we won’t discuss tuple-based windowing in detail here (though we will see an example of it in Part 2). (Return) Article image: Three women wading in a stream gathering leeches (source: Wellcome Library,London).
Tyler Akidau(编辑:安卓应用网_ASP源码网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |


