Chapter 35
The SAMPLE Function
“The universe extends beyond the mind of man, and is more complex than the small sample one can study,”
Table of Contents Chapter 35 – The SAMPLE Function
– The SAMPLE Function and Syntax
– A SAMPLE Example that asks for Multiple Samples
– A SAMPLE Example with the SAMPLEID
– A SAMPLE Example WITH REPLACEMENT
– A SAMPLE Example with Four 10% Samples
– A SAMPLE with Conditional Logic
– Aggregates and A SAMPLE using a Derived Table
– Using Random to SELECT a Percentage of Rows
– Using Random and Aggregations
The SAMPLE Function and Syntax
The syntax for the SAMPLE function:
SAMPLE [ WITH REPLACEMENT ] [
RANDOMIZED ALLOCATION ]
[ WHEN <condition> THEN ]
{<number-of-rows> | <percentage>}
[…,<number-of-rows> |
<percentage>]
[ ELSE {<number-of-rows> |
<percentage } END ]
The Sampling function (SAMPLE) permits a SELECT to randomly return rows from a Teradata database table. It allows the request to specify either an absolute number of rows or a percentage of rows to return. Additionally, it provides an ability to return rows from multiple samples.
SAMPLE Function Examples
Bring back 5 rows
SELECT *
FROM Student_Course_Table
SAMPLE 5 ;
5 Rows Returned
Student_ID | Course_ID |
280023 | 210 |
260000 | 400 |
125634 | 100 |
125634 | 220 |
333450 | 500 |
Bring back 25% of the rows
SELECT *
FROM Student CourseTable
SAMPLE .25 ;
4 Rows Returned
Student_ID | Course_ID |
125634 | 100 |
231222 | 210 |
234121 | 100 |
322133 | 300 |
The above example uses the SAMPLE to get a random sample of the sales table. Notice that five rows came back because we asked for a SAMPLE 5 in the first example. In the second example we got a percentage of 25% of the rows.
A SAMPLE Example that asks for Multiple Samples
Bring back two samples
of 25%, but no duplicates!
SELECT *
FROM Student_Course_Table
SAMPLE .25, .25
ORDER BY 1,2 ;
8 Rows Returned
Student_ID | Course_ID |
123250 | 100 |
125634 | 210 |
125634 | 220 |
231222 | 220 |
260000 | 400 |
280023 | 210 |
322133 | 300 |
333450 | 500 |
Sometimes, a single sampling of the data is not sufficient. The SAMPLE function can be used to request more than one sample by listing either the number of rows or the percentage of the rows to be returned. The above example uses the SAMPLE function to request multiple samples:
A SAMPLE Example with the SAMPLEID
SELECT Student_ID
,Course_ID
,SAMPLEID
FROM Student_Course_Table
SAMPLE 5, 5, 5
ORDER BY 3, 1, 2 ;
Bring back 3 Samples with
each sample having 5 rows.
Student_ID | Course_ID | SAMPLEID |
125634 | 100 | 1 |
125634 | 220 | 1 |
322133 | 300 | 1 |
324652 | 200 | 1 |
333450 | 400 | 1 |
123250 | 100 | 2 |
125634 | 200 | 2 |
231222 | 220 | 2 |
280023 | 210 | 2 |
322133 | 220 | 2 |
231222 | 210 | 3 |
234121 | 100 | 3 |
260000 | 400 | 3 |
333450 | 500 | 3 |
Although multiple samples were taken, the rows came back as a single answer set consisting of 5 rows, 5 rows and then 4 rows of the data. The SAMPLEID column name can be used to distinguish between each sample. The last sample only brought back 4 rows because there are only 14 rows in the table and there will be no duplicates.
A SAMPLE Example WITH REPLACEMENT
SELECT Student_ID
,Course_ID
,SAMPLEID
FROM Student_Course_Table
SAMPLE WITH REPLACEMENT 5, 5, 5
ORDER BY 3, 1, 2 ;
Bring back 3 Samples with each sample having 5 rows.
You can have duplicates now!
Student_ID | Course_ID | SAMPLEID |
125634 | 100 | 1 |
231222 | 210 | 1 |
231222 | 210 | 1 |
280023 | 210 | 1 |
280023 | 210 | 1 |
125634 | 100 | 2 |
125634 | 220 | 2 |
231222 | 220 | 2 |
322133 | 220 | 2 |
324652 | 200 | 2 |
234121 | 100 | 3 |
260000 | 400 | 3 |
322133 | 220 | 3 |
324652 | 200 | 3 |
333450 | 500 | 3 |
At the same time, you may wish for rows to be available for all samples. The above example uses the SAMPLE WITH REPLACEMENT function with the SAMPLEID to request multiple samples and denote which sample each row came from.
A SAMPLE Example with Four 10% Samples
SELECT Student_ID
,Course_ID
,SAMPLEID
FROM Student_Course_Table
SAMPLE .1, .1, .1, .1
ORDER BY SAMPLEID ;
Bring back 4 Samples with each sample having 10% of the rows.
Student_ID | Course_ID | SAMPLEID |
125634 | 220 | 1 |
123250 | 100 | 2 |
324652 | 200 | 3 |
280023 | 210 | 4 |
The above example uses the SAMPLE function with the SAMPLEID to request multiple samples as a percentage and denote which sample each row came from. Although 10% of 14 rows is 1.4, it can only return a whole row and therefore, 1 row is returned per sample. Also, since SAMPLEID is a column, it can be used as the sort key.
A Randomized SAMPLE
SELECT Student ID
,Course ID
,SAMPLEID
FROM Student_Course_Table
SAMPLE RANDOMIZED ALLOCATION .1, .1, .1, .1;
Bring back 4 Samples with each sample having 10% of the rows and do a random sample across the entire population.
Student_ID | Course_ID | SAMPLEID |
125634 | 100 | 2 |
125634 | 200 | 3 |
125634 | 220 | 1 |
234121 | 100 | 4 |
By default the SAMPLE function does a proportional sampling across all AMPs in the system. Therefore it is not a simple random sample across the entire population of rows. If you wish a random sample across the entire population, use the RANDOMIZED ALLOCATION as seen above.
A SAMPLE with Conditional Logic
SELECT Student_ID
,Course_ID
,SAMPLEID
FROM Student_Course_Table
SAMPLE RANDOMIZED ALLOCATION
WHEN Course_ID >200 THEN .1, .1 ELSE .2, .2
END
ORDER BY 3;
Bring back two Samples with one row per sample if the Course_ID is <= 200.
Bring back two Samples with two rows each if the Course_ID > 200.
Student_ID | Course_ID | SAMPLEID |
234121 | 100 | 1 |
125634 | 100 | 2 |
333450 | 500 | 3 |
231222 | 220 | 3 |
125634 | 220 | 4 |
322133 | 220 | 4 |
The above query brings back two Samples with one row per sample if the Course_ID is <= 200. Else it bring back two Samples with two rows each if the Course_ID > 200. This means it will attempt to bring back six records total in four different samples.
Aggregates and A SAMPLE using a Derived Table
SELECT count(distinct(Course_ID))
FROM (SEL Course_ID FROM Student_Course_Table
SAMPLE 5) DT ;
COUNT(Distinct(Course_ID)) |
4 |
A second run of the same SELECT might very well yield these results:
COUNT(Distinct(Course_ID)) |
5 |
Although they look like Aggregates, they are not normally compatible with them in the same SELECT list. As demonstrated here, aggregation can be performed, however, they must be calculated in a temporary or derived table.
The above example uses the SAMPLE function to request multiple samples to create a derived table (See Temporary Tables chapter). Then, the unique rows will be counted to show the random quality of the SAMPLE function:
Random Number Generator
The syntax for RANDOM is:
RANDOM(<low-literal-value>, <high-literal-value>)
The example below uses the RANDOM function to return a random number between 1 and 20:
SELECT RANDOM(1, 20);
Random(1,20) |
14 |
The RANDOM function generates a random number that is inclusive for the numbers specified in the SQL that is greater than or equal to the first argument and less than or equal to the second argument.
The RANDOM function may be used in the SELECT list, in a CASE, in a WHERE clause, in a QUALIFY, in a HAVING and in an ORDER BY. The RANSOM function can be used creatively to provide some powerful functionality within SQL.
Using Random to SELECT a Percentage of Rows
The next SELECT uses RANDOM to randomly select 5% of the rows from the table:
SELECT *
FROM Sales_Table
WHERE RANDOM(1, 100) = 5;
Product_ID | Sale_Date | Daily_Sales |
1000 | 09/28/2000 | 48850.40 |
There is roughly a 5% (1 out of 100) chance that a row will be returned using RANDOM in the WHERE clause, completely at random. Since SAMPLE randomly selects rows out of spool, currently RANDOM will be faster than SAMPLE. However, SAMPLE will be more accurate regarding the number of rows being returned with both the percent and row count.
Using Random and Aggregations
This example uses RANDOM to randomly generate a number that will determine which rows from the aggregation will be returned:
SELECT Product_ID, COUNT(Daily_Sales)
FROM Sales_Table
GROUP BY 1
HAVING COUNT(Daily_Sales) > RANDOM(1, 10) ;
Product_ID | Count(Daily_Sales) |
2000 | 7 |
3000 | 7 |
This last example uses RANDOM to randomly generate a number that will determine which rows from the aggregation will be returned. Whenever a random number is needed within the SQL, RANDOM is a great tool.