Teradata Database Indexes
An index is a physical mechanism used to store and access the rows
of a table. Indexes on tables in a relational database function much like
indexes in books, they speed up information retrieval.
In general, Teradata Database uses indexes to:
• Distribute
data rows.
• Locate
data rows.
• Improve
performance.
Indexed access is usually more efficient than searching all rows
of a table.
• Ensure
uniqueness of the index values.
Only one row of a table can have a particular value in the column
or columns defined as a unique index.
Teradata Database supports the following types of indexes:
• Primary • Partitioned
Primary
• Secondary • Join
• Hash • Special
indexes for referential integrity
These indexes are discussed in the following sections.
Primary Indexes
You can create a table with a Unique Primary Index (UPI), a
Non-Unique Primary Index (NUPI), or No Primary Index (NoPI).
IF you create a table
with ...
|
THEN ...
|
a UPI
|
the PI is a column, or
columns, that has no duplicate values.
|
a NUPI
|
the PI is a column, or
columns, that may have duplicate values.
|
NoPI
|
there is no PI column and
rows are not hashed based on any column values.
|
Primary Indexes and Data Distribution
Unique Primary Indexes (UPIs) guarantee uniform distribution of
table rows.
Nonunique Primary Indexes (NUPIs) can cause skewed data. While not
a guarantor of uniform row distribution, the degree of uniqueness of the index
will determine the degree of uniformity of the distribution. Because all rows
with the same PI value are distributed to the same AMP, columns with a small
number of distinct values that are repeated frequently do not make good PI
candidates.
The most efficient access methods to get data in a table is through
the PI. For this reason, choosing a PI should take the following design goal
into consideration: choosing a PI that gives good distribution of data across
the AMPs must be balanced against choosing a PI that reflects the most common
usage pattern of the table.
If you do not explicitly specify a primary index when a table is created,
Teradata Database uses the first column as the nonunique primary index by
default.
User tables can also be created explicitly without a primary index.
These NoPI tables are used for special purposes. NoPI tables are typically used
as staging tables to allow faster data loading. NoPI tables must be used for
tables that will use Teradata Columnar partitioning.
Primary Key
A Primary Key (PK), a term that comes from data modeling, defines
a column, or columns, that uniquely identify a row in a table. Because it is
used for identification, a PK cannot be null. There must be something in that
column, or columns, that uniquely identify it.
Moreover, PK values should not be changed. Historical information,
as well as relationships with others tables, may be lost if a PK is changed or
re-used.
A PK is a logical relational database concept. It may or may not
be the best column, or columns, to choose as a PI for a table.
Foreign Key
A Foreign Key (FK) identifies table relationships. They model the relationship
between data values across tables. Relational databases, like Teradata Database,
permit data values to associate across more than one table.
Thus each FK a table may have must exist somewhere as a PK. That
is, there must be referential integrity between FKs and PKs.
Relationships Between Primary Indexes and Primary Keys
The
following table describes some of the relationships between PKs and PIs.
Primary Key
|
Primary Index
|
Identifies a row
uniquely.
|
Distributes rows.
|
Does not imply access
path.
|
Defines most common
access path.
|
Must be unique.
|
May be unique or
nonunique.
|
May not be null.
|
May be null.
|
Causes a Unique Primary
Index (UPI) or Unique Secondary Index (USI) to be created.
|
N/A
|
Constraint used to ensure
referential integrity.
|
Physical access
mechanism.
|
Required by Teradata
Database only if referential integrity checks are to be performed.
|
Defined for most
production tables. Some staging tables may not have a primary index (NoPI
table).
|
•
If Teradata Database performs
referential integrity checks, then the column limit is 64.
•
If Teradata Database performs no
referential integrity checks, then there is no arbitrary column limit.
|
64-column limit.
|
Values should not be
changed if you want to maintain data integrity and preserve historical
relations among tables.
|
Values can be changed.
|
The
columns chosen for the UPI of a table are frequently the same columns
identified as the PK during the data modeling process, but no hard-and-fast
rule makes this so. In fact, physical database design considerations often lead
to a choice of columns other than those of the primary key for the PI of a
table.
Secondary Indexes
Secondary Indexes (SIs) allow access to information in a table by
alternate paths, and can improve performance by avoiding full table scans.
Although SIs add to table overhead, in terms of disk space and
maintenance, you can drop and recreate SIs as needed.
SIs:
• Do
not affect the distribution of rows across AMPs.
• Can
be unique or nonunique.
• Can
be created for complex data types, such as geospatial data.
• Are
used by the Optimizer when the indexes can improve query performance.
• Can be useful for NoPI tables.
Comparison
of Primary and Secondary IndexesThe following table provides a brief comparison of
PI and SI features.
Feature
|
Primary
|
Secondary
|
Can be unique or
nonunique
|
Both
|
Both
|
Affects row distribution
|
Yes
|
No
|
Create and drop dynamically
|
No
|
Yes
|
Improves access
|
Yes
|
Yes
|
Create using multiple
data types
|
Yes
|
Yes
|
Requires separate
physical structure
|
No
|
Yes, a subtable
|
Requires extra processing
overhead
|
No
|
Yes
|
Join Indexes
A Join Index (JI) is an indexing structure containing columns from
one or more base tables.
Some queries can be satisfied by examining only the JI when all
referenced columns are stored in the index. Such queries are said to be covered by the JI. Other queries may use the JI to qualify a few rows,
then refer to the base tables to obtain requested columns that are not stored in
the JI. Such queries are said to be partially-covered by the index.
Because Teradata Database supports multitable, partially-covering
JIs, all types of JIs, except the aggregate JI, can be joined to their base
tables to retrieve columns that are referenced by a query but are not stored in
the JI. Aggregate JIs can be defined for commonly-used aggregation queries.
Much like SIs, JIs impose additional processing on insert and
delete operations and update operations which change the value of columns
stored in the JI. The performance trade-off considerations are similar to those
for SIs.
Single-table Join Indexes
Join indexes are similar to base tables in that they support a primary
index, which can be used for direct access to one or a few rows.
A single-table JI is a an index structure that contains rows from
only a single-table. This type of structure has been found to be very useful by
Teradata Database users because it provides an alternative approach (primary
index) to directly accessing data.
Multitable Join Indexes
When queries frequently request a particular join, it may be beneficial
to predefine the join with a multitable JI. The Optimizer can use the
predefined join instead of performing the same join repetitively.
Aggregate Join Indexes
Aggregate operations calculate a single value from individual column
values in several rows of a table. Sums and averages calculated from sets of
column values are examples of aggregate operations. If the same aggregate
operation is frequently performed on the same columns, an aggregate JI can
provide improved query performance because Teradata Database can use the aggregate
index to satisfy queries, rather than repeating the aggregate calculations for
every query. You can define aggregate JIs on one or more tables.
Sparse Join Indexes
Indexes include a subset of the columns of one or more tables.
Typically, indexes include the column values from all rows in the table. Sparse
join indexes further limit the index to include only a subset of the table
rows, in addition to a subset of table columns. Sparse JIs can include one or
more tables, and can also be aggregate JIs.
If queries frequently are limited to a subset of the rows of a table,
for example, rows with a specific value in one column, a sparse JI that includes
only those rows can improve performance by providing a more limited data set to
be queried.
Comparison of Index Types
Teradata Database does not require or allow users to explicitly
dictate how indexes should be used for a particular query. The Optimizer costs
all of the reasonable alternatives and selects the one that is estimated to be
the least expensive.
The object of any query plan is to return accurate results as
quickly as possible. Therefore, the Optimizer uses an index or indexes only if
the index speeds up query processing. In some cases, the Optimizer processes
the query without using any index.
Optimizer index selection for a query plan:
• Can
have a direct impact on overall Teradata Database performance.
• Is
not always a straightforward process.
• Is
based partly on usage expectations.
The following table assumes execution of a simple SELECT statement
and explains the strengths and weaknesses of some of the various indexing
methods.
This access method…
|
Has
the following strengths…
|
And
the following possible drawbacks…
|
Unique
Primary Index (UPI)
|
• is the most efficient access method when the SQL
statement contains the PI value
• involves one AMP and one row
• requires no spool file (for a simple SELECT)
• can obtain the most granular locks
|
none,
in the context of a SELECT statement specifying a PI value. However, a poorly
chosen PI can cause poor overall performance in a large workload.
|
Nonunique
Primary Index (NUPI)
|
• provides efficient access when the SQL statement
contains the PI value
• involves one AMP
• can obtain granular locks
• may not require a spool file as long as the number of
rows returned is small
|
• may slow down INSERTs for a SET table with no USIs.
• may decrease the efficiency of SELECTs containing the
PI value when some values are repeated in many rows.
|
Unique
Secondary Index (USI)
|
• provides efficient access when the SQL statement
contains the USI values, and you do not specify PI values
• involves two AMPs and one row
• requires no spool file (for a simple SELECT)
|
requires
additional overhead for INSERT, UPDATE, MERGE, and DELETE statements.
|
Nonunique
Secondary Index (NUSI)
|
• provides efficient access when the number of rows per
value in the table is relatively small
• involves all AMPS and probably multiple rows
• provides access using information that may be more
readily available than a UPI value, such as employee last name, compared to
an employee number
• may require a spool file
|
• requires additional overhead for INSERT, UPDATE, MERGE,
and DELETE statements.
• will not be used by the Optimizer if the number of data
blocks accessed is a significant percentage of the data blocks in the table
because the Optimizer will determine that a full table scan is cheaper.
|
Full
table scan
|
• accesses each row only once
• provides access using any arbitrary set of column
conditions
|
• examines every row.
• usually requires a spool file possibly as large as the
base table.
|
Multitable
join index (JI)
|
• can eliminate the need to perform certain joins and
aggregates repetitively
• may be able to satisfy a query without referencing the
base tables
• can have a different PI from that of the base table
• can replace an NUSI or a USI
|
• requires additional overhead for INSERT, UPDATE, MERGE,
and DELETE statements for any of the base tables that contribute to the
multitable JI.
• usually is not suitable for data in tables subjected to
a large number of daily INSERT, UPDATE, MERGE, and DELETE statements.
• imposes some restrictions on operations performed on
the base table.
|
Single-table
join index (JI)
or
hash
index
|
• can isolate frequently used columns (or their
aggregates for JIs only) from those that are seldom used
• can reduce number of physical I/Os when only commonly
used columns are referenced
• can have a different PI from that of the base table
|
• requires additional overhead for INSERT, UPDATE, MERGE,
and DELETE statements.
• imposes some restrictions on operations performed on
the base table.
|
Sparse
join index (JI)
|
• can be stored in less space than an ordinary JI
• reduces the additional overhead associated with INSERT,
UPDATE, MERGE, and DELETE statements to the base table when compared with an
ordinary JI
• can exclude common values that occur in many rows to
help ensure that the Optimizer chooses to use the JI to access less common
values
|
• requires additional overhead for INSERT, UPDATE, MERGE,
and DELETE statements to the base table.
• imposes some restrictions on operations performed on
the base table.
|
Partitioned Tables
Partitioning stores related groups of data in physical proximity
to improve the performance of queries that are likely to require that data. To
satisfy a query, Teradata Database can employ an optimization called “partition
elimination” to limit data searches to only those partitions containing data
relevant to the query.
A row-partitioned table assigns rows to a particular partition
within an AMP based on a userdefined partitioning expression that defines how
the rows should be grouped for storage (horizontal partitioning). The
partitioning expression is defined when a table is created or altered.
The rows of a row-partitioned table are assigned to an appropriate
partition based on the value of the partitioning expression. Tables without primary
indexes can also be partitioned by column (vertical partitioning). Whereas row
partitioning allows sets of rows to be stored in separate partitions based on a
partitioning expression, column partitioning allows sets of columns
(including just a single column) to be stored in separate partitions. Like row
partitioning, column partitioning can improve performance for some types of
queries by allowing for partition elimination, whereby only the column data
relevant to a particular query is searched during the processing of that query.
Multilevel Partitioned Tables
A table or join index may be column partitioned, row partitioned,
or both, by using multilevel partitioning. Multilevel partitioning allows each
partition to be subpartitioned. Each level must define at least two partitions.
Hashing
Teradata Database uses hashing to distribute data for tables with
a PI to disk storage and uses indexes to access the data. Because the
architecture of Teradata Database is massively parallel, it requires an efficient
means of distributing and retrieving its data. That efficient method is
hashing. Virtually all
Teradata Database indexes are based on (or partially based on) row
hash values rather than table column values. For PIs, Teradata Database obtains
a row hash by hashing the values of the PI columns. The row hash and a sequence
number, which is assigned to distinguish between rows with the same row hash
within a table, are collectively called a row identifier and uniquely identify
each row in a table. A partition identifier is also part of the row identifier
in the case of partitioned tables. For SIs, Teradata Database computes a hash
value using the hash of the values of the SI columns. This value is used for
access when an SI value is specified in the SQL. The SI subtable records the
hash value for the SI, the actual value of the index columns (for synonym resolution),
and a list of primary index row identifiers for the table being indexed.
Identity Columns
Identity columns are columns that have unique values for every row
in the table. Because those values confer uniqueness on every row, the system
can use the values to identify each row in a table. When a column is defined as
an identity column, Teradata Database automatically generates a unique numeric
value for the column in every row that is added to the table.
Identity columns can be used to generate unique values for UPI,
USI, and primary key columns. However, creating a UPI from combinations of
frequently queried columns is preferable to adding identity columns to tables
to serve this function.
Normalization
Normalization is the process of reducing a complex database schema
into a simple, stable one. Generally this process involves removing redundant
attributes, keys, and relationships from the conceptual data model.
Normal Forms
Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the
constraints of a particular normal form, we say that relation is in normal
form.
By definition, a relational database is always normalized to first
normal form, because the column values are always atomic. That is, a column can contain one and only one value or null.
But to simply leave it at that invites a number of problems including
redundancy and potential update anomalies. The higher normal forms were
developed to correct those problems.
First, Second, and Third Normal Forms
First, second, and third normal forms are stepping stones to the
Boyce-Codd normal form and, when appropriate, the higher normal forms.
First Normal Form
First normal form (1NF) is definitive of a relational database. If
we are to consider a database relational, then all relations in the database
are in 1NF.
We say a relation is in 1NF if all fields within that relation are
atomic. We sometimes refer to this concept as the elimination of repeating
groups from a relation. Furthermore, first normal form allows no hierarchies of
data values.
Second Normal Form
Second normal form (2NF) deals with the elimination of circular
dependencies from a relation. We say a relation is in 2NF if it is in 1NF and
if every non-key attribute is fully dependent on the entire Primary Key.
A non-key attribute is any attribute that is not part of the
Primary Key for the relation.
Third Normal Form
Third normal form (3NF) deals with the elimination of non-key
attributes that do not describe the Primary Key.
For a relation to be in 3NF, the relationship between any two
non-Primary Key columns, or groups of columns, in a relation must not be one-to-one in either direction.
We say attributes are mutually independent if none of them is
functionally dependent on any combination of the others. This mutual
independence ensures that we can update individual attributes without any
danger of affecting any other attribute in a row.
The following list of benefits summarizes the advantages of
implementing a normalized logical model in 3NF.
• Greater
number of relations
• More
PI choices
• Optimal
distribution of data
• Fewer
full table scans
Referential Integrity
Traditional referential integrity is the concept of relationships
between tables, based on the definition of a primary key and a foreign key. The
concept states that a row cannot exist in a table with a value (not null) for a
referencing column if an equal value does not exist in a referenced column.
Using referential integrity, you can specify columns within a referencing table that are foreign keys for columns in some other referenced table. You must define referenced columns as either primary key
columns or unique columns.
Referential integrity is a reliable mechanism that prevents accidental
database inconsistencies when you perform inserts, merges, updates, and
deletes.
Referential Integrity Terminology
We
use the following terms to explain the referential integrity concept.
Term
|
Definition
|
Parent Table
|
The table referred to by
a Child table. Also called the “referenced table.”
|
Child Table
|
A table in which the
referential constraints are defined. Also called the “referencing table.”
|
Parent Key
|
A candidate key in the
parent table.
|
Primary Key
|
With respect to
referential integrity, a primary key is a parent table column set that is
referred to by a foreign key column set in a child table.
|
Foreign Key
|
With respect to
referential integrity, a foreign key is a child table column set that refers
to a primary key column set in a parent table.
|
Referencing (Child) Table
We call the referencing table the Child table, and we call the
specified Child table columns the referencing columns. Referencing columns
should be of the same number and have the same data type as the referenced
table key.
Referenced (Parent) Table
A Child table must have a parent table, and the referenced table
is referred to as the Parent table. The parent key columns are the referenced
columns.
Importance of Referential Integrity
Referential integrity is important, because it keeps you from
introducing errors into your database. Suppose you have an Order Parts table
like the following.
Order Number
|
Part Number
|
Quantity
|
PK
|
Not Null
|
|
FK
|
FK
|
|
1
|
1
|
110
|
1
|
2
|
275
|
2
|
1
|
152
|
Part
number and order number, each foreign keys in this relation, also form the
composite primary key.
Suppose
you were to delete the row defined by the primary key value 1 in the PART
NUMBER table. The foreign key for the first and third rows in the ORDER PART
table would now be inconsistent, because there would be no row in the PART
NUMBER table with a primary key of 1 to support it. Such a situation shows a
loss of referential integrity.
Teradata
Database provides referential integrity to prevent this from happening. If you
try to delete a row from the PART NUMBER table for which you have specified
referential integrity,
the
database management system will not allow you to remove the row if the part
number is referenced in child tables.
Besides data
integrity and data consistency, referential integrity provides these benefits.
Benefit
|
Description
|
Increases development
productivity
|
You do not need to code
SQL statements to enforce referential integrity constraints because Teradata
Database automatically enforces referential integrity.
|
Requires fewer written
programs
|
All update activities are
programmed to ensure that referential integrity constraints are not violated,
because Teradata Database enforces referential integrity in all environments.
Additional programs are not required.
|
Allows optimizations
|
Referential integrity
allows optimizations to occur, such as join elimination.
|
No comments:
Post a Comment