Success in any enterprise depends on having
the best available information in time to make sound decisions. Anything less
wastes opportunities, costs time and resources and can even put the
organization at risk. But finding crucial information to guide the best
possible actions can mean analyzing billions of data points and petabytes of
data, whether to predict an outcome, identify a trend or chart the best course
through a sea of ambiguity. Companies that can get this type of intelligence on
demand are able to react faster and make better decisions than their
competitors.
Continuing innovations in analytics can provide companies with an intelligence windfall that benefits all areas of the business. But when people need critical information urgently, the platform that delivers it should be the last thing on their minds. It should be as simple, reliable and immediate as a light switch, able to handle almost incomprehensible data volumes and workloads without complexity getting in the way. It must also be built for longevity, with a technology foundation able to sustain performance as more users run increasingly complex workloads and data volumes continue to grow relentlessly, while offering the lowest total cost of ownership.
Extreme
performance with appliance simplicity
The IBM® Netezza® data warehouse appliance transforms the data
warehouse and analytics landscape with a platform built to deliver extreme,
industry-leading price-performance with appliance simplicity for years to come.
It’s a new frontier in advanced analytics, with the ability to carry out
monumental processing challenges with blazing speed, without barriers or
compromises. For users and their organizations, it means the best intelligence
to all who need it – even as demands escalate from all directions.
The IBM Netezza data warehouse appliance with
analytics has a revolutionary design based on principles that have allowed IBM
to provide the best price-performance in the market. As a purpose-built
appliance for high speed analytics, its power comes not from the most powerful
and expensive components but from how the right components are assembled and
work together to maximize performance. Massively parallel processing (MPP)
combines multi-core CPUs with IBM’s unique FPGA Accelerated Streaming
Technology (FAST™) engines to deliver performance that much more expensive systems
cannot match or even approach. And as an easy-to-use appliance, the system
delivers its phenomenal results out of the box, with no indexing or tuning
required. Appliance simplicity extends to application development, enabling
rapid innovation and the ability to bring high performance analytics to the
widest range of users and processes.
This paper introduces IBM’s Asymmetric
Massively Parallel Processing™ (AMPP™) architecture, and describes how the system
orchestrates queries and analytics to achieve its unprecedented speed. We’ll
see how IBM Netezza data warehouse appliance software and hardware come
together to extract the maximum utilization from every critical component, and
how a system optimized for tens of thousands of users querying huge data
volumes really works. It’s a unique data warehouse and analytics platform with
unparalleled price-performance, ready for today’s needs and tomorrow’s
challenges.
Architectural
principles
The IBM Netezza data warehouse appliance
integrates database, processing and storage in a compact system optimized for
analytical processing and designed for flexible growth. The system architecture
is based on the following core tenets that have been a hallmark of IBM’s
price-performance leadership in the industry:
Processing
close to the data source
IBM Netezza data warehouse appliance
architecture is based on a fundamental computer science principle: when
operating on large data sets, do not move data unless you absolutely have to.
The IBM architecture fully exploits this principle by utilizing commodity
components called Field Programmable Gate Arrays (or FPGAs) to filter out
extraneous data as early in the data stream as possible, as fast as data can be
streamed off the disk. This process of data elimination close to the data
source removes I/O bottlenecks and frees up downstream components such as the
CPU, memory and network from processing superfluous data, thus having a
significant multiplier effect on system performance.
Balanced,
massively parallel architecture
The IBM Netezza data warehouse appliance
architecture combines the best elements of Symmetric multi-processing (SMP) and
MPP to create a purpose-built appliance for running blazing fast analytics on
petabytes of data. Every component of the architecture, including the
processor, FPGA, memory and network, is carefully selected and optimized to
service data as fast as the physics of the disk allows, while minimizing cost
and power consumption. The IBM Netezza data warehouse appliance software
orchestrates these components to operate concurrently on the data stream in a
pipeline fashion, thus maximizing utilization and extracting the utmost
throughput from each MPP node. In addition to raw performance, this balanced
architecture delivers linear scalability to more than a thousand processing
streams executing in parallel, while offering a very economical total cost of
ownership.
Platform for advanced analytics
The principles of MPP and data processing
close to the source are equally applicable to advanced analytics on large data
sets. IBM Netezza data warehouse appliances allow complex non-SQL algorithms to
be easily embedded in the processing elements of its MPP streams without the
typical intricacies of parallel or grid programming. The ability to run
analytics of any complexity ”on stream” against huge data volumes eliminates
the delays and costs of moving data to separate hardware. It also accelerates
performance by orders of magnitude, making IBM Netezza data warehouse
appliances the ideal platform for the convergence of data warehousing and
advanced analytics.
Appliance
simplicityThe architecture of IBM Netezza data warehouse appliances
automates and streamlines day-to-day operations, shielding users from the
underlying complexity of the platform. Simplicity rules whenever there is a
design tradeoff with any other aspect of the appliance. Unlike other solutions,
it just runs – handling demanding queries and mixed workloads with blistering
speed, without the tuning required by other systems. Even normally
time-consuming tasks such as installation, upgrades, and ensuring
high-availability and business continuity are vastly simplified, saving
precious time and resources.
Accelerated innovation and
performance improvements
One of the key goals of the IBM Netezza data
warehouse appliance architecture is to deliver price-performance improvements
and innovative functionality faster than competing technologies over the long
run. While the use of open, blade-based components allows IBM Netezza data
warehouse appliances to incorporate technology enhancements very quickly, the
turbocharger effect of the FPGA, a balanced hardware configuration, and tightly
coupled intelligent software combine to deliver overall performance gains far
greater than those of individual elements. In fact, IBM Netezza data warehouse
appliances have delivered more than 4X performance improvement every two years
(double that of Moore’s Law1) since its introduction, far outpacing other well-established
vendors.2
Flexible configurations and extreme
scalability
IBM Netezza data warehouse appliances scale
modularly from a few hundred gigabytes to tens of petabytes of queryable user
data. The system architecture is highly adaptable to serve the needs of
different segments of the data warehouse and analytics market. The use of open
blade-based components allows the disk-processor-memory ratio to be easily
modified in configurations that cater to performance- or storage-centric
requirements. The same architecture also supports memory-based systems that provide
extremely fast, real-time analytics for mission-critical applications.
The following pages examine how IBM puts
these principles into practice.
System building
blocks
A major part of IBM Netezza data warehouse
appliance’s performance advantage comes from its unique AMPP architecture,
which combines an SMP front-end with a shared-nothing MPP back-end for query
processing. Each component of the architecture is carefully chosen and
integrated to yield a balanced overall system. Every processing element operates
on multiple data streams, filtering out extraneous data as early as possible.
More than a thousand of these customized MPP streams work together to “divide
and conquer” the workload.
AMPP Architecture
Let’s examine the key building blocks of the
appliance:
IBM hosts:• The SMP hosts are
high-performance IBM servers running Linux that are set up in an active-passive
configuration for high-availability. The active host presents a standardized
interface to external tools and applications. It compiles SQL queries into
executable code segments called snippets, creates optimized query plans and
distributes the snippets to the MPP nodes for execution.
Snippet Blades
(S-Blades):• S-Blades are intelligent processing nodes that make up the
turbocharged MPP engine of the appliance. Each S-Blade is an independent server
that contains powerful multi-core CPUs, multi-engine FPGAs and gigabytes of
RAM, all balanced and working concurrently to deliver peak performance. The CPU
cores are designed with ample headroom to run complex algorithms against large
data volumes for advanced analytics applications.
Disk enclosures:• The disk enclosures
contain high-density, high-performance IBM storage disks that are RAID
protected. Each disk contains a slice of the data in a database table. The disk
enclosures are connected to the S-Blades via high-speed interconnects that
allow all the disks in IBM to simultaneously stream data to the S-Blades at the
maximum rate possible.
Network fabric:• All
system components are connected via a high-speed network fabric. IBM runs a
customized IP-based protocol that fully utilizes the total cross-sectional
bandwidth of the fabric and eliminates congestion even under sustained, bursty
network traffic. The network is optimized to scale to more than a thousand
nodes, while allowing each node to initiate large data transfers to every other
node simultaneously.
Note: All system components
are redundant. While the hosts are active-passive, all other components in the
appliance are hot-swappable. User data is fully mirrored, enabling better than
99.99 percent availability.
Where extreme
performance happens – inside an S-Blade
A Snippet Processor (one of many): Commodity
components and IBM Netezza data warehouse appliance software combine to extract
the utmost throughput from each MPP node. A dedicated high-speed interconnect
from the storage array allows data to be delivered to memory as quickly as it
can stream off the disk. Compressed data is cached in memory using a smart
algorithm, which ensures that the most commonly accessed data is served right
out of memory instead of requiring a disk access. FAST Engines running in
parallel inside the FPGAs uncompress and filter out 95-98 percent of table data
at physics speed, keeping only the data that is relevant to answer the query.
The remaining data in the stream is processed concurrently by CPU cores, also
running in parallel. The process is repeated on more than a thousand of these
parallel Snippet Processors running in an IBM Netezza data warehouse appliance.
The result is performance that exceeds much more expensive systems by orders of
magnitude.
IBM’s massively parallel processing architecture
Turbocharging
the S-Blades: The power of IBM FAST engines
The FPGA is a critical enabler of the
price-performance advantages of IBM Netezza data warehouse appliances. Each
FPGA contains embedded engines that perform filtering and transformation
functions on the data stream. These FAST engines are dynamically reconfigurable,
allowing them to be modified or extended through software. They are customized
for every snippet through parameters provided during query execution and act on
the data stream delivered by a Direct Memory Access (DMA) module at extremely
high speed.
FAST engines include:
The Compress engine• , an IBM Netezza data warehouse appliance
innovation that boosts system performance by a factor of 4-8X.3 The engine uncompresses
data at wire speed, instantly transforming each block on disk into 4-8 blocks in
memory. The result is a significant speedup of the slowest component in any
data warehouse – the disk.
The Project and Restrict engines• , engines further
enhances performance by filtering out columns and rows respectively, based on
the parameters in the SELECT and WHERE clauses in an SQL query.
The Visibility engine• , plays a critical role in maintaining ACID
(Atomicity, Consistency, Isolation and Durability) compliance at streaming
speeds. It filters out rows that should not be “seen” by a query; e.g. rows
belonging to a transaction that is not yet committed.
The IBM FAST engines also provide an
extensible framework for innovative new functions to be added in the future
through enhancements to IBM Netezza data warehouse appliance software. These
new functions promise to boost system performance, security and reliability
even further.
The IBM FAST engine
Orchestrating
queries on IBM
IBM
Netezza data warehouse appliance hardware components and intelligent system
software are closely intertwined. The software is designed to fully exploit the
hardware capabilities of the appliance and incorporates numerous innovations to
offer exponential performance gains, whether for simple inquiries, complex
ad-hoc queries or deep analytics. In this section, we’ll examine the
intelligence built into the system every step of the way.
Software Architecture
IBM Netezza data warehouse appliance software
components include:
A sophisticated parallel
optimizer that transforms queries to • run more efficiently and ensures that each
component in every processing node is fully utilized
An intelligent scheduler
that keeps the system running at its • peak throughput, regardless of workload
Turbocharged
Snippet Processors that efficiently execute • multiple
queries and complex analytics functions concurrentlyA smart network that makes
moving large amounts of data •
through the IBM Netezza data
warehouse appliance a breeze
Make
an optimized query plan…When
a user submits a query, the host compiles it and creates a query execution plan
optimized for IBM’s AMPP architecture. The intelligence of the IBM optimization
is one of the system’s greatest strengths. The optimization makes use of all
the MPP nodes in the system to gather detailed, up-to-date statistics on every
database table referenced in a query. A majority of these metrics are captured
during query execution with very low overhead, yielding just-in-time statistics
that are individualized per query. The appliance nature of an IBM Netezza data
warehouse, with integrated components able to communicate with each other,
allows the cost-based optimization to more accurately measure disk, processing
and network costs associated with an operation. By relying on accurate data
rather than heuristics alone, the optimizer is able to generate query plans
that utilize all components with extreme efficiency.
Optimization
intelligence: calculating join orderOne example of optimization
intelligence is the ability to determine the best join order in a complex join.
For example, when joining multiple small tables to a large fact table, the
optimizer can choose to broadcast the small tables in their entirety to each
S-Blade, while keeping the large table distributed across all Snippet
Processors. The approach minimizes data movement while taking advantage of the
AMPP architecture to parallelize the join.
Credits:
No comments:
Post a Comment