流数据处理的博文

发布时间：2021-01-06 12:57:40 所属栏目：大数据来源：网络整理

导读：The world beyond batch: Streaming 101 A high-level tour of modern data-processing concepts. By Tyler Akidau August 5,2015 Three women wading in a stream gathering leeches (source: Wellcome Library,London). Editor's note: This is the firs

Next up,let’s talk a bit about what streaming systems can and can’t do,with an emphasis on can; one of the biggest things I want to get across in these posts is just how capable a well-designed streaming system can be. Streaming systems have long been relegated to a somewhat niche market of providing low-latency,inaccurate/speculative results,often in conjunction with a more capable batch system to provide eventually correct results,i.e. the Lambda Architecture.

For those of you not already familiar with the Lambda Architecture,the basic idea is that you run a streaming system alongside a batch system,both performing essentially the same calculation. The streaming system gives you low-latency,inaccurate results (either because of the use of an approximation algorithm,or because the streaming system itself does not provide correctness),and some time later a batch system rolls along and provides you with correct output. Originally proposed by Twitter’s Nathan Marz (creator of Storm),it ended up being quite successful because it was,in fact,a fantastic idea for the time; streaming engines were a bit of a letdown in the correctness department,and batch engines were as inherently unwieldy as you’d expect,so Lambda gave you a way to have your proverbial cake and eat it,too. Unfortunately,maintaining a Lambda system is a hassle: you need to build,provision,and maintain two independent versions of your pipeline,and then also somehow merge the results from the two pipelines at the end.

As someone who has spent years working on a strongly-consistent streaming engine,I also found the entire principle of the Lambda Architecture a bit unsavory. Unsurprisingly,I was a huge fan of Jay Kreps’ Questioning the Lambda Architecture post when it came out. Here was one of the first highly visible statements against the necessity of dual-mode execution; delightful. Kreps addressed the issue of repeatability in the context of using a replayable system like Kafka as the streaming interconnect,and went so far as to propose the Kappa Architecture,which basically means running a single pipeline using a well-designed system that’s appropriately built for the job at hand. I’m not convinced that notion itself requires a name,but I fully support the idea in principle.

EBOOK

流数据处理的博文

Designing Data-Intensive Applications

By Martin Kleppmann

Shop now

Quite honestly,I’d take things a step further. I would argue that well-designed streaming systems actually provide a strict superset of batch functionality. Modulo perhaps an efficiency delta[1],there should be no need for batch systems as they exist today. And kudos to the Flink folks for taking this idea to heart and building a system that’s all-streaming-all-the-time under the covers,even in “batch” mode; I love it.

The corollary of all this is that broad maturation of streaming systems combined with robust frameworks for unbounded data processing will,in time,allow the relegation of the Lambda Architecture to the antiquity of big data history where it belongs. I believe the time has come to make this a reality. Because to do so,i.e. to beat batch at its own game,you really only need two things:

Correctness — This gets you parity with batch.

At the core,correctness boils down to consistent storage. Streaming systems need a method for checkpointing persistent state over time (something Kreps has talked about in his Why local state is a fundamental primitive in stream processing post),and it must be well-designed enough to remain consistent in light of machine failures. When Spark Streaming first appeared in the public big data scene a few years ago,it was a beacon of consistency in an otherwise dark streaming world. Thankfully,things have improved somewhat since then,but it is remarkable how many streaming systems still try to get by without strong consistency; I seriously cannot believe that at-most-once processing is still a thing,but it is.

To reiterate,because this point is important: strong consistency is required for exactly-once processing,which is required for correctness,which is a requirement for any system that’s going to have a chance at meeting or exceeding the capabilities of batch systems. Unless you just truly don’t care about your results,I implore you to shun any streaming system that doesn’t provide strongly consistent state. Batch systems don’t require you to verify ahead of time if they are capable of producing correct answers; don’t waste your time on streaming systems that can’t meet that same bar.

If you’re curious to learn more about what it takes to get strong consistency in a streaming system,I recommend you check out the MillWheel and Spark Streaming papers. Both papers spend a significant amount of time discussing consistency. Given the amount of quality information on this topic in the literature and elsewhere,I won’t be covering it any further in these posts.
Tools for reasoning about time — This gets you beyond batch.

Good tools for reasoning about time are essential for dealing with unbounded,unordered data of varying event-time skew. An increasing number of modern data sets exhibit these characteristics,and existing batch systems (as well as most streaming systems) lack the necessary tools to cope with the difficulties they impose. I will spend the remainder of this post,and the bulk of the next post,explaining and focusing on this point.

To begin with,we’ll get a basic understanding of the important concept of time domains,after which we’ll take a deeper look at what I mean by unbounded,unordered data of varying event-time skew. We’ll then spend the rest of this post looking at common approaches to bounded and unbounded data processing,using both batch and streaming systems.

Event time vs. processing time

Learning Path

流数据处理的博文

Real-Time Data Applications

This Learning Path provides an in-depth tour of technologies used in processing and analyzing real-time data.

Shop now

To speak cogently about unbounded data processing requires a clear understanding of the domains of time involved. Within any data processing system,there are typically two domains of time we care about:

Event time,which is the time at which events actually occurred.
Processing time,which is the time at which events are observed in the system.

Not all use cases care about event times (and if yours doesn’t,hooray! — your life is easier),but many do. Examples include characterizing user behavior over time,most billing applications,and many types of anomaly detection,to name a few.

In an ideal world,event time and processing time would always be equal,with events being processed immediately as they occur. Reality is not so kind,however,and the skew between event time and processing time is not only non-zero,but often a highly variable function of the characteristics of the underlying input sources,execution engine,and hardware. Things that can affect the level of skew include:

Shared resource limitations,such as network congestion,network partitions,or shared CPU in a non-dedicated environment.
Software causes,such as distributed system logic,contention,etc.
Features of the data themselves,including key distribution,variance in throughput,or variance in disorder (e.g.,a plane full of people taking their phones out of airplane mode after having used them offline for the entire flight).

As a result,if you plot the progress of event time and processing time in any real-world system,you typically end up with something that looks a bit like the red line in Figure 1.

（编辑：安卓应用网_ASP源码网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

2/7

首页

尾页