Chapter 39
Hashing Functions
“Love, like a chicken salad or restaurant hash, must be taken with blind faith or it loses its flavor. “
Table of Contents Chapter 39 – Hashing Functions
– Hashing Functions on Teradata
– The HASHROW Function in a real-world Example
– The HASHBUCKET Function in a real-world Example
– The HASHAMP Function in a real-world Example
– A Great HASHAMP Function for Large Tables
– A Real-World HASBAKHAMP Function Example
– A Great way to see distribution for Primary and Fallback rows
Hashing Functions on Teradata
Teradata uses Parallel Processing and the most important aspect of this is to spread the rows of a table equally among the AMPs, who read and write data.
Below are the four Hashing Functions available to see the distribution of data:
HASHROW HASHAMP
HASHBUCKET HASHBAKAMP
Teradata uses parallel processing with its architecture of AMPs and PEs. The Primary Index (PI) must exist whenever you create a table because it is the sole determinant of which AMPs owns which rows. This concept pertains to data storage and data retrieval. Picking the proper column(s) for the PI is extremely important for distribution and therefore, performance. The hashing functions introduced in this section provide information pertaining to the selection of the AMP where each individual row is stored.
There are hashing functions, which can be incorporated into SQL, to produce and use the same hash value result for testing current or future distribution levels. In other words, these functions can be used to evaluate the distribution of the rows within any or all tables or determine the acceptability of other columns as a potential primary index.
The HASHROW Function
The basic syntax for using the HASHROW function follows:
HASHROW( [ <data-column-value> […, <data-column-value>…] ] )
Below is an actual example
SELECT HASHROW (NULL) AS NULL_HASH
,HASHROW(‘Coffing’) AS Name_HASH
,HASHROW() AS NO_HASH ;
NULL_HASH | NAME_HASH | NO_HASH |
00000000 | 4B1075E1 | FFFFFFFF |
If you ran this query over and over again you would get the same result.
Notice that you can Hash a Null and understand that the hash is consistent so running this query with those values will return the same answer time and time again. The HASHROW function is used to produce the 32-bit binary (BYTE(4) data type) Row Hash that is stored as part of the data row. It can return a maximum of 4,294,967,295 unique values. The values produced range from 0 to FFFFFFFF.
The HASHROW Function in a real-world Example
Now that the functionality has been demonstrated on the previous page, a more realistic use might be the following to examine the data distribution and determine the average number of rows per value:
SELECT
COUNT(*)/COUNT(DISTINCT(HASHROW(Student_Id))) AS
AVG_ROW_CT
FROM Student_Table;
AVG_ROW_CT |
1 |
From the above answer of 1 it is a great sign that the data is perfectly distributed or even Unique. As good as this is, the HASHROW function does not provide a lot more help in the evaluation process. However, when combined with the other Hashing Functions, it yields some very helpful data demographics.
The HASHBUCKET Function
The basic syntax for using the HASHBUCKET function follows:
HASHBUCKET( [ <row-hash-value> ] )
Below is an actual example
SELECT HASHBUCKET(NULL) AS NULL_BUCKET
,HASHBUCKET() AS NO_BUCKET;
NULL_BUCKET | NO_BUCKET |
? | 1048575 |
If you ran this query over and over again you would get the same result.
The HASHBUCKET function is used to produce the 16-bit binary Hash Bucket (the DSW) that is used with the Hash Map to determine the AMP that should store and retrieve the data row. It can return a maximum of just over 1,000,000 unique values. The values range from 0 to 1,048,575, not counting the NULL as a potential result. The input to the HASHBUCKET is the 32-bit Row Hash value.
The HASHBUCKET Function in a real-world Example
SELECT COUNT(*) AS NBR_ROWS
,HASHBUCKET(HASHROW (Student_ID)) AS Bucket_No
FROM Student_Table
GROUP BY 2 ;
NBR_ROWS | Bucket_No |
1 | 1007154 |
1 | 499675 |
1 | 1020598 |
1 | 1019019 |
1 | 1017675 |
1 | 579440 |
1 | 466041 |
1 | 975629 |
1 | 729631 |
1 | 542497 |
Now that the functionality has been demonstrated on the previous page, a more realistic use might be to see the number of rows in each Hash Bucket for a column in a table:
The HASHAMP Function
The basic syntax for using the HASHAMP function follows:
HASHAMP( <hash-bucket> )
Below is an actual example
SELECT HASHAMP(NULL) AS NULL_BUCKET
,HASHAMP () AS NO_BUCKET;
NULL_BUCKET | NO_BUCKET |
? | 1 |
The HASHAMP function returns the identification number of the primary AMP for any Hash Bucket number. The input to the HASHAMP function is an integer value. When no value is passed to the HASHAMP function, it returns a number that is one less than the number of AMPs in the current system configuration. If any other data type is passed to it, a run-time error occurs.
The HASHAMP Function in a real-world Example
SELECT Student_ID
,HASHBUCKET(HASHROW(Student_ID)) AS Bucket_No
,HASHAMP(HASHBUCKET(HASHROW(Student_ID))) AS AMP_No
FROM Student_Table
ORDER BY 1 ;
Student_ID | Bucket_No | AMP_No |
123250 | 1019019 | 0 |
125634 | 542497 | 1 |
231222 | 579440 | 2 |
234121 | 1017675 | 3 |
260000 | 975629 | 4 |
280023 | 499675 | 3 |
322133 | 1007154 | 2 |
324652 | 729631 | 1 |
333450 | 1020598 | 2 |
423400 | 466041 | 4 |
Now that the functionality has been demonstrated on the previous page, a more realistic use might be to see the AMP Number for each Student to check for distribution.
A Great HASHAMP Function for Large Tables
SELECT COUNT(*) “Count”
,HASHAMP(HASHBUCKET(HASHROW(Product_ID))) AS AMP_No
FROM Sales_Table
GROUP BY 2
ORDER BY 2 ;
Count | AMP_No |
14 | 0 |
7 | 1 |
The example above is shown using a smaller table on a small system, but the query will brilliantly show the row counts for each AMP. This is a very valuable report.
The HASHBAKAMP Function
The basic syntax for using the HASHBAKAMP function follows:
HASHBAKAMP ( <hash-bucket> )
Below is an actual example
SELECT HASHBAKAMP (NULL) AS NULL_BUCKET
,HASHBAKAMP () AS NO_BUCKET;
NULL_BUCKET | NO_BUCKET |
? | 1 |
The HASHBAKAMP function returns the identification number of the Fallback AMP for any Hash Bucket number. The input to the HASHBAKAMP function is an integer. When no value is passed to the HASHAMP function, it returns a number that is one less than the number of AMPs in the current system configuration. If any other data type is passed to it, a run-time error occurs.
A Real-World HASBAKHAMP Function Example
SELECT Student_ID
,HASHBUCKET(HASHROW(Student_ID)) AS Bucket_No
,HASHBAKAMP(HASHBUCKET(HASHROW(Student_ID)))
AS BAK_AMP_No
FROM Student_Table ORDER BY 3;
Student_ID | Bucket_No | BAK_AMP_No |
333450 | 1020598 | 0 |
322133 | 1007154 | 0 |
231222 | 579440 | 0 |
125634 | 542497 | 1 |
260000 | 975629 | 1 |
234121 | 1017675 | 1 |
123250 | 1019019 | 1 |
324652 | 729631 | 1 |
280023 | 499675 | 1 |
423400 | 466041 | 1 |
The example above shows the Student_ID, the Bucket_No and the AMP that contains the fallback rows for each Student_ID.
A Great way to see distribution for Primary and Fallback rows
SELECT SUM(NbrRows) AS “Rows Per AMP” ,AMP_Nbr
FROM (SEL COUNT(*),
HASHBAKAMP(HASHBUCKET(HASHROW(Student_ID)))
FROM Student_Table GROUP BY 2
UNION ALL
SEL COUNT(*) ,HASHAMP(HASHBUCKET(HASHROW(Student_ID)))
FROM Student_Table GROUP BY 2) DT (NbrRows, AMP_Nbr)
GROUP BY 2
ORDER BY 2 ;
Rows Per AMP | AMP_Nbr |
10 | 0 |
10 | 1 |
The following SELECT can help determine that situation by finding all the Primary rows with their AMP and all the FALLBACK rows with their AMPs and than adding them together for the total (notice it uses a derived table to consolidate the rows counts):