Friday, 2 October 2015

Teradata Data Distribution and Data Access Methods


Teradata Database Indexes
An index is a physical mechanism used to store and access the rows of a table. Indexes on tables in a relational database function much like indexes in books, they speed up information retrieval.
In general, Teradata Database uses indexes to:
Distribute data rows.
Locate data rows.
Improve performance.

Indexed access is usually more efficient than searching all rows of a table.
Ensure uniqueness of the index values.
Only one row of a table can have a particular value in the column or columns defined as a unique index.
Teradata Database supports the following types of indexes:
Primary                                 Partitioned Primary
Secondary                             Join
Hash                                     Special indexes for referential integrity
These indexes are discussed in the following sections.
Primary Indexes
You can create a table with a Unique Primary Index (UPI), a Non-Unique Primary Index (NUPI), or No Primary Index (NoPI).
IF you create a table with ...
THEN ...
a UPI
the PI is a column, or columns, that has no duplicate values.
a NUPI
the PI is a column, or columns, that may have duplicate values.
NoPI
there is no PI column and rows are not hashed based on any column values.
Primary Indexes and Data Distribution
Unique Primary Indexes (UPIs) guarantee uniform distribution of table rows.
Nonunique Primary Indexes (NUPIs) can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value are distributed to the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates.
The most efficient access methods to get data in a table is through the PI. For this reason, choosing a PI should take the following design goal into consideration: choosing a PI that gives good distribution of data across the AMPs must be balanced against choosing a PI that reflects the most common usage pattern of the table.
If you do not explicitly specify a primary index when a table is created, Teradata Database uses the first column as the nonunique primary index by default.
User tables can also be created explicitly without a primary index. These NoPI tables are used for special purposes. NoPI tables are typically used as staging tables to allow faster data loading. NoPI tables must be used for tables that will use Teradata Columnar partitioning.
Primary Key
A Primary Key (PK), a term that comes from data modeling, defines a column, or columns, that uniquely identify a row in a table. Because it is used for identification, a PK cannot be null. There must be something in that column, or columns, that uniquely identify it.
Moreover, PK values should not be changed. Historical information, as well as relationships with others tables, may be lost if a PK is changed or re-used.
A PK is a logical relational database concept. It may or may not be the best column, or columns, to choose as a PI for a table.
Foreign Key
A Foreign Key (FK) identifies table relationships. They model the relationship between data values across tables. Relational databases, like Teradata Database, permit data values to associate across more than one table.
Thus each FK a table may have must exist somewhere as a PK. That is, there must be referential integrity between FKs and PKs.
Relationships Between Primary Indexes and Primary Keys
The following table describes some of the relationships between PKs and PIs.
Primary Key
Primary Index
Identifies a row uniquely.
Distributes rows.
Does not imply access path.
Defines most common access path.
Must be unique.
May be unique or nonunique.
May not be null.
May be null.
Causes a Unique Primary Index (UPI) or Unique Secondary Index (USI) to be created.
N/A
Constraint used to ensure referential integrity.
Physical access mechanism.
Required by Teradata Database only if referential integrity checks are to be performed.
Defined for most production tables. Some staging tables may not have a primary index (NoPI table).
If Teradata Database performs referential integrity checks, then the column limit is 64.
If Teradata Database performs no referential integrity checks, then there is no arbitrary column limit.
64-column limit.
Values should not be changed if you want to maintain data integrity and preserve historical relations among tables.
Values can be changed.
The columns chosen for the UPI of a table are frequently the same columns identified as the PK during the data modeling process, but no hard-and-fast rule makes this so. In fact, physical database design considerations often lead to a choice of columns other than those of the primary key for the PI of a table.
Secondary Indexes
Secondary Indexes (SIs) allow access to information in a table by alternate paths, and can improve performance by avoiding full table scans.
Although SIs add to table overhead, in terms of disk space and maintenance, you can drop and recreate SIs as needed.
SIs:
Do not affect the distribution of rows across AMPs.
Can be unique or nonunique.
Can be created for complex data types, such as geospatial data.
Are used by the Optimizer when the indexes can improve query performance.
Can be useful for NoPI tables.
Comparison of Primary and Secondary IndexesThe following table provides a brief comparison of PI and SI features.
Feature
Primary
Secondary
Can be unique or nonunique
Both
Both
Affects row distribution
Yes
No
Create and drop dynamically
No
Yes
Improves access
Yes
Yes
Create using multiple data types
Yes
Yes
Requires separate physical structure
No
Yes, a subtable
Requires extra processing overhead
No
Yes
Join Indexes
A Join Index (JI) is an indexing structure containing columns from one or more base tables.
Some queries can be satisfied by examining only the JI when all referenced columns are stored in the index. Such queries are said to be covered by the JI. Other queries may use the JI to qualify a few rows, then refer to the base tables to obtain requested columns that are not stored in the JI. Such queries are said to be partially-covered by the index.
Because Teradata Database supports multitable, partially-covering JIs, all types of JIs, except the aggregate JI, can be joined to their base tables to retrieve columns that are referenced by a query but are not stored in the JI. Aggregate JIs can be defined for commonly-used aggregation queries.
Much like SIs, JIs impose additional processing on insert and delete operations and update operations which change the value of columns stored in the JI. The performance trade-off considerations are similar to those for SIs.
Single-table Join Indexes
Join indexes are similar to base tables in that they support a primary index, which can be used for direct access to one or a few rows.
A single-table JI is a an index structure that contains rows from only a single-table. This type of structure has been found to be very useful by Teradata Database users because it provides an alternative approach (primary index) to directly accessing data.
Multitable Join Indexes
When queries frequently request a particular join, it may be beneficial to predefine the join with a multitable JI. The Optimizer can use the predefined join instead of performing the same join repetitively.
Aggregate Join Indexes
Aggregate operations calculate a single value from individual column values in several rows of a table. Sums and averages calculated from sets of column values are examples of aggregate operations. If the same aggregate operation is frequently performed on the same columns, an aggregate JI can provide improved query performance because Teradata Database can use the aggregate index to satisfy queries, rather than repeating the aggregate calculations for every query. You can define aggregate JIs on one or more tables.
Sparse Join Indexes
Indexes include a subset of the columns of one or more tables. Typically, indexes include the column values from all rows in the table. Sparse join indexes further limit the index to include only a subset of the table rows, in addition to a subset of table columns. Sparse JIs can include one or more tables, and can also be aggregate JIs.
If queries frequently are limited to a subset of the rows of a table, for example, rows with a specific value in one column, a sparse JI that includes only those rows can improve performance by providing a more limited data set to be queried.
Comparison of Index Types
Teradata Database does not require or allow users to explicitly dictate how indexes should be used for a particular query. The Optimizer costs all of the reasonable alternatives and selects the one that is estimated to be the least expensive.
The object of any query plan is to return accurate results as quickly as possible. Therefore, the Optimizer uses an index or indexes only if the index speeds up query processing. In some cases, the Optimizer processes the query without using any index.
Optimizer index selection for a query plan:
Can have a direct impact on overall Teradata Database performance.
Is not always a straightforward process.
Is based partly on usage expectations.
The following table assumes execution of a simple SELECT statement and explains the strengths and weaknesses of some of the various indexing methods.
This access method…
Has the following strengths…
And the following possible drawbacks…
Unique Primary Index (UPI)
is the most efficient access method when the SQL statement contains the PI value
involves one AMP and one row
requires no spool file (for a simple SELECT)
can obtain the most granular locks
none, in the context of a SELECT statement specifying a PI value. However, a poorly chosen PI can cause poor overall performance in a large workload.
Nonunique Primary Index (NUPI)
provides efficient access when the SQL statement contains the PI value
involves one AMP
can obtain granular locks
may not require a spool file as long as the number of rows returned is small
may slow down INSERTs for a SET table with no USIs.
may decrease the efficiency of SELECTs containing the PI value when some values are repeated in many rows.
Unique Secondary Index (USI)
provides efficient access when the SQL statement contains the USI values, and you do not specify PI values
involves two AMPs and one row
requires no spool file (for a simple SELECT)
requires additional overhead for INSERT, UPDATE, MERGE, and DELETE statements.
Nonunique Secondary Index (NUSI)
provides efficient access when the number of rows per value in the table is relatively small
involves all AMPS and probably multiple rows
provides access using information that may be more readily available than a UPI value, such as employee last name, compared to an employee number
may require a spool file
requires additional overhead for INSERT, UPDATE, MERGE, and DELETE statements.
will not be used by the Optimizer if the number of data blocks accessed is a significant percentage of the data blocks in the table because the Optimizer will determine that a full table scan is cheaper.
Full table scan
accesses each row only once
provides access using any arbitrary set of column conditions
examines every row.
usually requires a spool file possibly as large as the base table.
Multitable join index (JI)
can eliminate the need to perform certain joins and aggregates repetitively
may be able to satisfy a query without referencing the base tables
can have a different PI from that of the base table
can replace an NUSI or a USI
requires additional overhead for INSERT, UPDATE, MERGE, and DELETE statements for any of the base tables that contribute to the multitable JI.
usually is not suitable for data in tables subjected to a large number of daily INSERT, UPDATE, MERGE, and DELETE statements.
imposes some restrictions on operations performed on the base table.
Single-table join index (JI)
or
hash index
can isolate frequently used columns (or their aggregates for JIs only) from those that are seldom used
can reduce number of physical I/Os when only commonly used columns are referenced
can have a different PI from that of the base table
requires additional overhead for INSERT, UPDATE, MERGE, and DELETE statements.
imposes some restrictions on operations performed on the base table.
Sparse join index (JI)
can be stored in less space than an ordinary JI
reduces the additional overhead associated with INSERT, UPDATE, MERGE, and DELETE statements to the base table when compared with an ordinary JI
can exclude common values that occur in many rows to help ensure that the Optimizer chooses to use the JI to access less common values
requires additional overhead for INSERT, UPDATE, MERGE, and DELETE statements to the base table.
imposes some restrictions on operations performed on the base table.
Partitioned Tables
Partitioning stores related groups of data in physical proximity to improve the performance of queries that are likely to require that data. To satisfy a query, Teradata Database can employ an optimization called “partition elimination” to limit data searches to only those partitions containing data relevant to the query.
A row-partitioned table assigns rows to a particular partition within an AMP based on a userdefined partitioning expression that defines how the rows should be grouped for storage (horizontal partitioning). The partitioning expression is defined when a table is created or altered.
The rows of a row-partitioned table are assigned to an appropriate partition based on the value of the partitioning expression. Tables without primary indexes can also be partitioned by column (vertical partitioning). Whereas row partitioning allows sets of rows to be stored in separate partitions based on a
partitioning expression, column partitioning allows sets of columns (including just a single column) to be stored in separate partitions. Like row partitioning, column partitioning can improve performance for some types of queries by allowing for partition elimination, whereby only the column data relevant to a particular query is searched during the processing of that query.
Multilevel Partitioned Tables
A table or join index may be column partitioned, row partitioned, or both, by using multilevel partitioning. Multilevel partitioning allows each partition to be subpartitioned. Each level must define at least two partitions.
Hashing
Teradata Database uses hashing to distribute data for tables with a PI to disk storage and uses indexes to access the data. Because the architecture of Teradata Database is massively parallel, it requires an efficient means of distributing and retrieving its data. That efficient method is hashing. Virtually all
Teradata Database indexes are based on (or partially based on) row hash values rather than table column values. For PIs, Teradata Database obtains a row hash by hashing the values of the PI columns. The row hash and a sequence number, which is assigned to distinguish between rows with the same row hash within a table, are collectively called a row identifier and uniquely identify each row in a table. A partition identifier is also part of the row identifier in the case of partitioned tables. For SIs, Teradata Database computes a hash value using the hash of the values of the SI columns. This value is used for access when an SI value is specified in the SQL. The SI subtable records the hash value for the SI, the actual value of the index columns (for synonym resolution), and a list of primary index row identifiers for the table being indexed.
Identity Columns
Identity columns are columns that have unique values for every row in the table. Because those values confer uniqueness on every row, the system can use the values to identify each row in a table. When a column is defined as an identity column, Teradata Database automatically generates a unique numeric value for the column in every row that is added to the table.
Identity columns can be used to generate unique values for UPI, USI, and primary key columns. However, creating a UPI from combinations of frequently queried columns is preferable to adding identity columns to tables to serve this function.
Normalization
Normalization is the process of reducing a complex database schema into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model.
Normal Forms
Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is in normal form.
By definition, a relational database is always normalized to first normal form, because the column values are always atomic. That is, a column can contain one and only one value or null.
But to simply leave it at that invites a number of problems including redundancy and potential update anomalies. The higher normal forms were developed to correct those problems.
First, Second, and Third Normal Forms
First, second, and third normal forms are stepping stones to the Boyce-Codd normal form and, when appropriate, the higher normal forms.
First Normal Form
First normal form (1NF) is definitive of a relational database. If we are to consider a database relational, then all relations in the database are in 1NF.
We say a relation is in 1NF if all fields within that relation are atomic. We sometimes refer to this concept as the elimination of repeating groups from a relation. Furthermore, first normal form allows no hierarchies of data values.
Second Normal Form
Second normal form (2NF) deals with the elimination of circular dependencies from a relation. We say a relation is in 2NF if it is in 1NF and if every non-key attribute is fully dependent on the entire Primary Key.
A non-key attribute is any attribute that is not part of the Primary Key for the relation.
Third Normal Form
Third normal form (3NF) deals with the elimination of non-key attributes that do not describe the Primary Key.
For a relation to be in 3NF, the relationship between any two non-Primary Key columns, or groups of columns, in a relation must not be one-to-one in either direction.
We say attributes are mutually independent if none of them is functionally dependent on any combination of the others. This mutual independence ensures that we can update individual attributes without any danger of affecting any other attribute in a row.
The following list of benefits summarizes the advantages of implementing a normalized logical model in 3NF.
Greater number of relations
More PI choices
Optimal distribution of data
Fewer full table scans
Referential Integrity
Traditional referential integrity is the concept of relationships between tables, based on the definition of a primary key and a foreign key. The concept states that a row cannot exist in a table with a value (not null) for a referencing column if an equal value does not exist in a referenced column.
Using referential integrity, you can specify columns within a referencing table that are foreign keys for columns in some other referenced table. You must define referenced columns as either primary key columns or unique columns.
Referential integrity is a reliable mechanism that prevents accidental database inconsistencies when you perform inserts, merges, updates, and deletes.
Referential Integrity Terminology
We use the following terms to explain the referential integrity concept.
Term
Definition
Parent Table
The table referred to by a Child table. Also called the “referenced table.”
Child Table
A table in which the referential constraints are defined. Also called the “referencing table.”
Parent Key
A candidate key in the parent table.
Primary Key
With respect to referential integrity, a primary key is a parent table column set that is referred to by a foreign key column set in a child table.
Foreign Key
With respect to referential integrity, a foreign key is a child table column set that refers to a primary key column set in a parent table.
Referencing (Child) Table
We call the referencing table the Child table, and we call the specified Child table columns the referencing columns. Referencing columns should be of the same number and have the same data type as the referenced table key.
Referenced (Parent) Table
A Child table must have a parent table, and the referenced table is referred to as the Parent table. The parent key columns are the referenced columns.
Importance of Referential Integrity
Referential integrity is important, because it keeps you from introducing errors into your database. Suppose you have an Order Parts table like the following.
Order Number
Part Number
Quantity
PK
Not Null
FK
FK
1
1
110
1
2
275
2
1
152
Part number and order number, each foreign keys in this relation, also form the composite primary key.
Suppose you were to delete the row defined by the primary key value 1 in the PART NUMBER table. The foreign key for the first and third rows in the ORDER PART table would now be inconsistent, because there would be no row in the PART NUMBER table with a primary key of 1 to support it. Such a situation shows a loss of referential integrity.
Teradata Database provides referential integrity to prevent this from happening. If you try to delete a row from the PART NUMBER table for which you have specified referential integrity,
the database management system will not allow you to remove the row if the part number is referenced in child tables.
Besides data integrity and data consistency, referential integrity provides these benefits.
Benefit
Description
Increases development productivity
You do not need to code SQL statements to enforce referential integrity constraints because Teradata Database automatically enforces referential integrity.
Requires fewer written programs
All update activities are programmed to ensure that referential integrity constraints are not violated, because Teradata Database enforces referential integrity in all environments. Additional programs are not required.
Allows optimizations
Referential integrity allows optimizations to occur, such as join elimination.


Credits:

No comments:

Post a Comment