Wednesday, 27 January 2016

Introduction to Netezza


Introduction
Success in any enterprise depends on having the best available information in time to make sound decisions. Anything less wastes opportunities, costs time and resources and can even put the organization at risk. But finding crucial information to guide the best possible actions can mean analyzing billions of data points and petabytes of data, whether to predict an outcome, identify a trend or chart the best course through a sea of ambiguity. Companies that can get this type of intelligence on demand are able to react faster and make better decisions than their competitors.

Continuing innovations in analytics can provide companies with an intelligence windfall that benefits all areas of the business. But when people need critical information urgently, the platform that delivers it should be the last thing on their minds. It should be as simple, reliable and immediate as a light switch, able to handle almost incomprehensible data volumes and workloads without complexity getting in the way. It must also be built for longevity, with a technology foundation able to sustain performance as more users run increasingly complex workloads and data volumes continue to grow relentlessly, while offering the lowest total cost of ownership.
Extreme performance with appliance simplicity
The IBM® Netezza® data warehouse appliance transforms the data warehouse and analytics landscape with a platform built to deliver extreme, industry-leading price-performance with appliance simplicity for years to come. It’s a new frontier in advanced analytics, with the ability to carry out monumental processing challenges with blazing speed, without barriers or compromises. For users and their organizations, it means the best intelligence to all who need it – even as demands escalate from all directions.
The IBM Netezza data warehouse appliance with analytics has a revolutionary design based on principles that have allowed IBM to provide the best price-performance in the market. As a purpose-built appliance for high speed analytics, its power comes not from the most powerful and expensive components but from how the right components are assembled and work together to maximize performance. Massively parallel processing (MPP) combines multi-core CPUs with IBM’s unique FPGA Accelerated Streaming Technology (FAST) engines to deliver performance that much more expensive systems cannot match or even approach. And as an easy-to-use appliance, the system delivers its phenomenal results out of the box, with no indexing or tuning required. Appliance simplicity extends to application development, enabling rapid innovation and the ability to bring high performance analytics to the widest range of users and processes.
This paper introduces IBM’s Asymmetric Massively Parallel Processing(AMPP) architecture, and describes how the system orchestrates queries and analytics to achieve its unprecedented speed. We’ll see how IBM Netezza data warehouse appliance software and hardware come together to extract the maximum utilization from every critical component, and how a system optimized for tens of thousands of users querying huge data volumes really works. It’s a unique data warehouse and analytics platform with unparalleled price-performance, ready for today’s needs and tomorrow’s challenges.
Architectural principles
The IBM Netezza data warehouse appliance integrates database, processing and storage in a compact system optimized for analytical processing and designed for flexible growth. The system architecture is based on the following core tenets that have been a hallmark of IBM’s price-performance leadership in the industry:
Processing close to the data source
IBM Netezza data warehouse appliance architecture is based on a fundamental computer science principle: when operating on large data sets, do not move data unless you absolutely have to. The IBM architecture fully exploits this principle by utilizing commodity components called Field Programmable Gate Arrays (or FPGAs) to filter out extraneous data as early in the data stream as possible, as fast as data can be streamed off the disk. This process of data elimination close to the data source removes I/O bottlenecks and frees up downstream components such as the CPU, memory and network from processing superfluous data, thus having a significant multiplier effect on system performance.
Balanced, massively parallel architecture
The IBM Netezza data warehouse appliance architecture combines the best elements of Symmetric multi-processing (SMP) and MPP to create a purpose-built appliance for running blazing fast analytics on petabytes of data. Every component of the architecture, including the processor, FPGA, memory and network, is carefully selected and optimized to service data as fast as the physics of the disk allows, while minimizing cost and power consumption. The IBM Netezza data warehouse appliance software orchestrates these components to operate concurrently on the data stream in a pipeline fashion, thus maximizing utilization and extracting the utmost throughput from each MPP node. In addition to raw performance, this balanced architecture delivers linear scalability to more than a thousand processing streams executing in parallel, while offering a very economical total cost of ownership.
Platform for advanced analytics
The principles of MPP and data processing close to the source are equally applicable to advanced analytics on large data sets. IBM Netezza data warehouse appliances allow complex non-SQL algorithms to be easily embedded in the processing elements of its MPP streams without the typical intricacies of parallel or grid programming. The ability to run analytics of any complexity ”on stream” against huge data volumes eliminates the delays and costs of moving data to separate hardware. It also accelerates performance by orders of magnitude, making IBM Netezza data warehouse appliances the ideal platform for the convergence of data warehousing and advanced analytics.
Appliance simplicityThe architecture of IBM Netezza data warehouse appliances automates and streamlines day-to-day operations, shielding users from the underlying complexity of the platform. Simplicity rules whenever there is a design tradeoff with any other aspect of the appliance. Unlike other solutions, it just runs – handling demanding queries and mixed workloads with blistering speed, without the tuning required by other systems. Even normally time-consuming tasks such as installation, upgrades, and ensuring high-availability and business continuity are vastly simplified, saving precious time and resources.
Accelerated innovation and performance improvements
One of the key goals of the IBM Netezza data warehouse appliance architecture is to deliver price-performance improvements and innovative functionality faster than competing technologies over the long run. While the use of open, blade-based components allows IBM Netezza data warehouse appliances to incorporate technology enhancements very quickly, the turbocharger effect of the FPGA, a balanced hardware configuration, and tightly coupled intelligent software combine to deliver overall performance gains far greater than those of individual elements. In fact, IBM Netezza data warehouse appliances have delivered more than 4X performance improvement every two years (double that of Moore’s Law1) since its introduction, far outpacing other well-established vendors.2
Flexible configurations and extreme scalability
IBM Netezza data warehouse appliances scale modularly from a few hundred gigabytes to tens of petabytes of queryable user data. The system architecture is highly adaptable to serve the needs of different segments of the data warehouse and analytics market. The use of open blade-based components allows the disk-processor-memory ratio to be easily modified in configurations that cater to performance- or storage-centric requirements. The same architecture also supports memory-based systems that provide extremely fast, real-time analytics for mission-critical applications.
The following pages examine how IBM puts these principles into practice.
System building blocks
A major part of IBM Netezza data warehouse appliance’s performance advantage comes from its unique AMPP architecture, which combines an SMP front-end with a shared-nothing MPP back-end for query processing. Each component of the architecture is carefully chosen and integrated to yield a balanced overall system. Every processing element operates on multiple data streams, filtering out extraneous data as early as possible. More than a thousand of these customized MPP streams work together to “divide and conquer” the workload.
AMPP  Architecture

Let’s examine the key building blocks of the appliance:
IBM hosts:The SMP hosts are high-performance IBM servers running Linux that are set up in an active-passive configuration for high-availability. The active host presents a standardized interface to external tools and applications. It compiles SQL queries into executable code segments called snippets, creates optimized query plans and distributes the snippets to the MPP nodes for execution.
Snippet Blades (S-Blades):S-Blades are intelligent processing nodes that make up the turbocharged MPP engine of the appliance. Each S-Blade is an independent server that contains powerful multi-core CPUs, multi-engine FPGAs and gigabytes of RAM, all balanced and working concurrently to deliver peak performance. The CPU cores are designed with ample headroom to run complex algorithms against large data volumes for advanced analytics applications.
Disk enclosures:The disk enclosures contain high-density, high-performance IBM storage disks that are RAID protected. Each disk contains a slice of the data in a database table. The disk enclosures are connected to the S-Blades via high-speed interconnects that allow all the disks in IBM to simultaneously stream data to the S-Blades at the maximum rate possible.
Network fabric:All system components are connected via a high-speed network fabric. IBM runs a customized IP-based protocol that fully utilizes the total cross-sectional bandwidth of the fabric and eliminates congestion even under sustained, bursty network traffic. The network is optimized to scale to more than a thousand nodes, while allowing each node to initiate large data transfers to every other node simultaneously.
Note: All system components are redundant. While the hosts are active-passive, all other components in the appliance are hot-swappable. User data is fully mirrored, enabling better than 99.99 percent availability.

Where extreme performance happens – inside an S-Blade
A Snippet Processor (one of many): Commodity components and IBM Netezza data warehouse appliance software combine to extract the utmost throughput from each MPP node. A dedicated high-speed interconnect from the storage array allows data to be delivered to memory as quickly as it can stream off the disk. Compressed data is cached in memory using a smart algorithm, which ensures that the most commonly accessed data is served right out of memory instead of requiring a disk access. FAST Engines running in parallel inside the FPGAs uncompress and filter out 95-98 percent of table data at physics speed, keeping only the data that is relevant to answer the query. The remaining data in the stream is processed concurrently by CPU cores, also running in parallel. The process is repeated on more than a thousand of these parallel Snippet Processors running in an IBM Netezza data warehouse appliance. The result is performance that exceeds much more expensive systems by orders of magnitude.
IBM’s  massively parallel processing architecture

Turbocharging the S-Blades: The power of IBM FAST engines
The FPGA is a critical enabler of the price-performance advantages of IBM Netezza data warehouse appliances. Each FPGA contains embedded engines that perform filtering and transformation functions on the data stream. These FAST engines are dynamically reconfigurable, allowing them to be modified or extended through software. They are customized for every snippet through parameters provided during query execution and act on the data stream delivered by a Direct Memory Access (DMA) module at extremely high speed.
FAST engines include:
The Compress engine, an IBM Netezza data warehouse appliance innovation that boosts system performance by a factor of 4-8X.3 The engine uncompresses data at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse – the disk.
The Project and Restrict engines, engines further enhances performance by filtering out columns and rows respectively, based on the parameters in the SELECT and WHERE clauses in an SQL query.
The Visibility engine, plays a critical role in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds. It filters out rows that should not be “seen” by a query; e.g. rows belonging to a transaction that is not yet committed.
The IBM FAST engines also provide an extensible framework for innovative new functions to be added in the future through enhancements to IBM Netezza data warehouse appliance software. These new functions promise to boost system performance, security and reliability even further.


The  IBM FAST engine

Orchestrating queries on IBM
IBM Netezza data warehouse appliance hardware components and intelligent system software are closely intertwined. The software is designed to fully exploit the hardware capabilities of the appliance and incorporates numerous innovations to offer exponential performance gains, whether for simple inquiries, complex ad-hoc queries or deep analytics. In this section, we’ll examine the intelligence built into the system every step of the way.

Software  Architecture



IBM Netezza data warehouse appliance software components include:
A sophisticated parallel optimizer that transforms queries to run more efficiently and ensures that each component in every processing node is fully utilized
An intelligent scheduler that keeps the system running at its peak throughput, regardless of workload
Turbocharged Snippet Processors that efficiently execute multiple queries and complex analytics functions concurrentlyA smart network that makes moving large amounts of data through the IBM Netezza data warehouse appliance a breeze

Make an optimized query plan…When a user submits a query, the host compiles it and creates a query execution plan optimized for IBM’s AMPP architecture. The intelligence of the IBM optimization is one of the system’s greatest strengths. The optimization makes use of all the MPP nodes in the system to gather detailed, up-to-date statistics on every database table referenced in a query. A majority of these metrics are captured during query execution with very low overhead, yielding just-in-time statistics that are individualized per query. The appliance nature of an IBM Netezza data warehouse, with integrated components able to communicate with each other, allows the cost-based optimization to more accurately measure disk, processing and network costs associated with an operation. By relying on accurate data rather than heuristics alone, the optimizer is able to generate query plans that utilize all components with extreme efficiency.

Optimization intelligence: calculating join orderOne example of optimization intelligence is the ability to determine the best join order in a complex join. For example, when joining multiple small tables to a large fact table, the optimizer can choose to broadcast the small tables in their entirety to each S-Blade, while keeping the large table distributed across all Snippet Processors. The approach minimizes data movement while taking advantage of the AMPP architecture to parallelize the join.


Credits:

No comments:

Post a Comment