Chapter 3: Variables and Data Types

Variable Names

DECLARE Statement

DATA Step Conversion—Numerics

DATA Step Conversion—Characters

DATA Step Conversion—Date Types

DATA Step Conversion—Binary Types

DATA Step Conversion—Saving Your Table

More about Dates

Operations and Assignments

IF Operator

SELECT Operator

Arrays

Temporary Arrays

Variable Arrays

Deferred Dimensioning

Array Dimensions

Array Assignment

Missing Values and NULL Values

ANSI Mode and SAS Mode

Testing for Missing or NULL Values

The SAS DATA step is a powerful programming language based on two data types—numeric (double-precision-floating point) and fixed-length character. The data type can be explicitly stated using the LENGTH, FORMAT, or ATTRIB statement. Or, it can be implicitly determined based on the result of a calculation. Because there are only two data types, implicit determination is very common in DATA step programs. However, because there is no need to explicitly declare variables, subtle errors can creep into DATA step programs when variables are misspelled. This is not so in PROC DS2.

In PROC DS2, a new DECLARE statement is used to define the variable. Remember that identifiers in DS2 must be declared before they can be used. As a result, subtle errors because of misspelled variables are eliminated.

DS2 provides a rich array of data types. These data types match the data types found in the ANSI standards for RDBMS. Because there are now many more data types, the need for the DECLARE statement becomes obvious. When there are multiple numeric types, you need a way to tell DS2 which numeric type to use.

Note: Explicitly declaring variables can be overridden by setting the PROC DS2 option SCOND or the system option DS2COND to NONE, NOTE, or WARNING. This is not recommended.

Here are the data types that are available:

Character (Four Types)

•   CHAR(n)—fixed-length, maximum of n characters. CHAR(10) takes 10 bytes. This is the same as the DATA step character data type.

•   VARCHAR(n)—variable length, maximum of n characters. VARCHAR(10) takes a maximum of 10 bytes. If the string was ‘Peter’, only five bytes are allocated.

•   NCHAR(n), NVARCHAR(n)—this is the same as CHAR, but it uses a Unicode national character set. Depending on the platform, the Unicode national character set takes two to four bytes per character; NCHAR(10) could take 20 or 40 bytes to store.

Numeric (Seven Types)

•   TINYINT—signed integers from -128 to 127.

•   SMALLINT—signed integers from -32,768 to 32,767.

•   INTEGER—signed integers from -2,147,483,648 to 2,147,483,647 (10 digits).

•   BIGINT—signed integers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

•   DECIMAL(p,s)/NUMERIC(p,s)—a signed, exact, fixed-point decimal number. Precision (p) specifies the total number of digits to accept. Scale (s) is the number of digits after the decimal point. For example, decimal(10,2) stores decimal numbers up to 10 digits with a two-digit fixed-point fractional portion, such as 12345678.90.

•   DOUBLE/FLOAT—a signed, double-precision floating-point number. Numbers of large magnitude with an unknown number of (but many) digits of precision to the right of the decimal point. Unlike the integer and decimal data types, which store exact results, floating-point numbers result in approximate or non-exact results. This is the same as the DATA step numeric data type.

•   REAL—a signed, single-precision, floating-point number.

Binary (Two Types)

•   BINARY(n)—fixed-length binary, where n is the maximum number of bytes.

•   VARBINARY(n)—variable-length binary, where n is the maximum number of bytes.

Date and Time (Three Types)

•   DATE—a calendar date

•   TIME(p)—time (hh:mm:ss.nnnn), where p is the number of decimal precision on the seconds; precision is dependent on the back-end database.

•   TIMESTAMP(p)—both DATE and TIME with decimal precision; precision is dependent on the back-end database.

Variable Names

DS2 uses the generic term “identifier” to refer to variables. It also uses “identifier” to refer to any other element that has a name, such as method, table, or schema. In this book, the terms “variable” and “column” are used interchangeably when referring to a variable, and the terms “data set” and “table” are used interchangeably when referring to a data set.

All identifiers in DS2 follow the SQL:1999 standard, which means that names can be up to 256 character long! Of course, they must follow the naming conventions set by the underlying data store. So, for SAS data sets, you are limited to 32 characters. The same basic naming conventions of the SAS DATA step apply. Variables begin with a letter or underscore and can consist of letters, numbers, and underscores, but no special characters. However, there are some differences.

Unlike the DATA step, DS2 has reserved words (that is, names that cannot be used as identifiers). For example, the following code works in a DATA step:

data namesDS;

     name = "peter";

     date = '01jan2015'd;

     output;

run;

The same code in DS2 produces an error:

proc ds2;

data names (overwrite=yes);

DECLARE char(20)  name;

DECLARE date      DATE;

method init();

name   = 'peter';

date = date'2015-01-01';

output;

end;

enddata;

run;

quit;

ERROR: Compilation error.

ERROR: Parse encountered type when expecting identifier.

ERROR: Line 602: Parse failed: DECLARE date  >>> DATE <<< ;

You get an error because date is a reserved word. If you insist on having a reserved word as a variable name, you can create a delimited variable:

proc ds2;

data names (overwrite=yes);

DECLARE char(20)  name;

DECLARE date     "date";

method init();

name   = 'peter';

"date" = date'2015-01-01';

output;

end;

enddata;

run;

quit;

The variable is enclosed in double quotation marks, both in the declaration and in the assignment. There will be other times you need to use a delimited variable, specifically when an identifier contains special characters or when you are referencing a database schema or table that is case sensitive. Otherwise, all identifiers are case insensitive. The following variable names are all equivalent:

lastname

LASTNAME

lastName

lastnamE

“lastname”

Note the last one (“lastname”). Because the name “lastname” is not a reserved word and does not contain special characters, the delimiters (the double quotation marks) can be stripped.

DECLARE Statement

To let DS2 know how much memory is needed for a variable, all variables must be declared before they are used. Here is what the DECLARE statement looks like:

DECLARE <data type> <variable list> [HAVING <options>]

In this statement, <data type> is a valid data type. <variable list> is one or more variable names that are separated by spaces. Each variable that is listed has a specified data type. HAVING is optional. HAVING options are LABEL, FORMAT, and INFORMAT.

Here are some examples of valid DECLARE statements:

DECLARE date birthday;

DECLARE date birthday having format yymmdd10.;

DECLARE date birthday having format yymmdd10. label ‘Date of Birth’;

DECLARE date “date”   having format yymmdd10. label ‘Date as a Key Word’;

DECLARE tinyint i j k;

DECLARE decimal(15,2) profit format dollar15.2 label ‘Profit’;

Here is a valid DECLARE statement:

DECLARE char(10) a b c having format $20. label 'string a'

                              format $15. label 'string b'

                              format $10. label 'string c';

However, all three variables (a, b, and c) end up having the same format ($10.) and the same label (‘string c’). If you want each variable to have a different format and label, then each variable must have its own DECLARE statement.

When it executes, PROC DS2 allocates memory and correctly processes the data based on the data type specified. There are two issues that veteran DATA step programmers must be aware of—type conversion during processing and type conversion when writing data.

DATA Step Conversion—Numerics

What happens when you divide two numbers? In the DATA step, there is no question what happens. The dividend is divided by the divisor to return the quotient:

18 / 6 = 3

18 is the dividend, 6 is the divisor, and 3 is the quotient. In the DATA step, all three numbers are double-precision floating-point numbers, even though they all look like integers. In DS2, not all numbers are doubles. In the previous example, what if the dividend was declared as a double, and the divisor was declared was an integer? Before you can divide in DS2, both operands must be the same data type. This means that one of the two must be coerced (converted) to match the other. For numerics, this means that when one operand is higher precedence than the other, the lower order is promoted to the higher. If both operands are the same data type, there is no conversion and he result is the same data type as the operands. Here is the order of precedence (with 1 being the highest) when the operands are different data types:

1.   Double
If either operand is a double, the other is promoted to double and the result is a double. A real value is promoted to double for all operations.

2.   Decimal
If either operand is a decimal, the other is promoted to decimal and the result is a decimal.

3.   Bigint
If either operand is a bigint, the other is prompted to bigint and the result is a bigint.

4.   All others
Both operands are converted to integers and the result is an integer.

In the DATA step, none of this applies because all operands are doubles. But, in DS2, beware. You can end up with results that are computationally correct, but not the results that you expected. Consider a conversion from Celsius to Fahrenheit:

F = (9 / 5) * C + 32

This computation gives you an incorrect result for all temperature values except 0C. Why? Because 9 and 5 are both integers, the result of the division is also an integer. In this case, the result is 1, which is the correct result when you divide these integers and is computationally correct. However, this is not the answer that you expect. 0C converts correctly because any number multiplied by 0 results in 0. Even if the variable C were specified as a double, the expression in the parentheses in the equation takes precedence. You could improve the result by changing the equation:

F = (9 * C) / 5 + 32

If the variable C is an integer, all operations on the right side resolve to integers, so the result is an integer. This means that a temperature of 1C would resolve to 33F.

9 * 1=9       (two integers, result is integer)

9 / 5=1       (two integers, result is integer)

1 + 32=33     (two integers, result is integer)

If the value of C is a double, all operations on the right side resolve to doubles, so the result is a double. This means that a temperature of 1C would resolve to 33.8F:

9 * 1.=9.      (9 is promoted to double, result is double)

9. / 5=1.8     (5 is promoted to double, result is double)

1.8 + 32=33.8 (32 is promoted to double, result is double)

You could improve the formula even more:

F = (9. * C) / 5 + 32

Again, the right side resolves to 33.8 even if the variable C is declared as an integer because you specified the 9 as a double.

Although the right side resolves to 33.8, you have to know how F should be declared. If F is also a double (real or decimal), then F is assigned 33.8. On the other hand, if F is an integer, then F is 33 and the decimal part is dropped.

There is also another potential problem. The previous equation converts typical room temperatures to and from Celsius and Fahrenheit using only integer values. Suppose that you are monitoring all the rooms in all the office towers owned by a global real estate conglomerate and you need to adjust temperatures up or down as needed. You could start by declaring variables C and F as tinyint. This would work until the day the remote temperature sensor was placed above the kettle used to boil water. At some point, the temperature would surpass 127F (~52.8C), and you would have an integer overflow—a number greater than the largest (or smallest) number that the data type can handle. The equation would give incorrect results.

DATA Step Conversion—Characters

What happens when you try to combine characters and numbers, for example, in a concatenation? Basically, all non-character data types have an implicit toString() method that converts the underlying data representation into a visual (character) representation. This means that any non-character data type in a string function is automatically converted to a character string without the conversion warnings that you get in a DATA step program:

NOTE: Numeric values have been converted to character values at the places given by:

      (Line):(Column).

This is a boon if you do not like to see warnings in your log! You can also use the put() function to explicitly convert the variable using a specific format.

What happens when you have a character string that looks like a number and you use it in a numeric operation? Character strings have an implicit toNumber() method that converts the visual (character) representation of the number to the data representation of the number. The keyword here is “number.” A character type can be converted to a numeric type, but not to a data or binary type.

DATA Step Conversion—Date Types

Date types (date, time, timestamp) can be automatically converted to any character type, but not to any other type. Similarly, a data type cannot be automatically converted to another date type. For DATA step programmers, this means that you cannot simply add numbers to a DATE variable to increment the date. You have to use a function to increment the date.

DATA Step Conversion—Binary Types

Binary types can be automatically converted to any character type and to other binary types, but not to another data type.

DATA Step Conversion—Saving Your Table

PROC DS2 converts data during processing. But do you know how PROC DS2 converts data when it writes it back to a data table? DS2 is designed to interact with SQL databases seamlessly. In general, DS2 data types match up with most RDBMS data types. The SAS 9.4 DS2 Language Reference Manual provides tables that show which DS2 data type matches a specific RDBMS data type. But, if you work with SAS tables, it is a different story. SAS tables can store only double-precision floating-point numbers and fixed-length character variables.

Binary data are simplest. Binary data cannot be saved to a SAS data set. If you use binary data, you must first explicitly assign it to a character variable, and then you can save the character variable.

Here are some characteristics of character variables:

•   CHAR(n) is saved as is.

•   VARCHAR(n) is promoted to CHAR(n) and saved.

•   NCHAR(n) is saved as a fixed-length character variable, typically encoded as Unicode UTF-8.

•   NVARCHAR(n) is saved as a fixed-length character variable, typically encoded as Unicode UTF-8.

Unicode UTF-8 is usually a multiple-byte encoding. When you are reading and writing UTF-8 from or to a SAS data set, ensure that no truncation takes place.

The date, time, and timestamp data types are implicitly converted to a SAS date, time, and datetime variable and saved with a format. If a format is specified when the variable is declared, it is used. Otherwise, the default formats date9., time8., and datetime19.2 are used. Remember, all SAS dates are numbers, so DS2 dates are converted to doubles and saved.

All DS2 numeric data types are converted to doubles and saved. You must make sure that precision is not lost when you save high-precision numerics (bignint, decimal) to a SAS data set.

More about Dates

To beginner SAS users and even advanced SAS users, SAS dates are a common source of confusion. A SAS date is simply a number. The number represents the number of days since 1960-01-01. To increment a SAS date by one week, you add 7 to the date. To determine the number of days between two dates, you subtract one from the other. To turn a numeric representation (for example, 20454) into a visual representation (2016-01-01), you need only to apply the appropriate SAS format. If you want to display the same date in a report in more than one format, you do not need multiple variables, you need only different formats.

Then, ISO 8601 date formats and functions were introduced. ISO 8601 is a standard designed to clarify and simplify the exchange of date values. In addition to setting standards for the representation of date and time values, the standard introduced the ability to manage incomplete dates. (Having some information is better than having no information, right?) For example, you might know that an event happened on the first of the month, but you do not know the year or the month. With this information, you could not create a SAS date. However, ISO 8601 enables you to create a date variable even when you do not have a fully specified date. When the incomplete variable value is passed to another system, that system might have more information to complete the value.

One of the features of ISO 8601 is the inclusion of time zone as part of the date. SAS dates and the new date types do not have a time zone attribute. To process dates and times using time zones, DS2 offers a pre-defined package, TZ. TZ has several methods that handle time zones and ISO 8601 dates. See the SAS 9.4 DS2 Language Reference Manual for more information about this package.

PROC DS2 does not offer any of the ISO 8601 date formats to display dates in ISO 8601 format. It does not support the IS8601_CONVERT() call routine to convert dates and durations into other dates durations. If your application uses ISO 8601 dates and durations, you have to continue using the DATA step.

Operations and Assignments

PROC DS2 follows the same rules of operation as the DATA step except for character constants. In the DATA step, a character constant can be enclosed in single or double quotation marks. PROC DS2 follows the SQL:1999 standard where double quotation marks are used to delimit identifiers. Here is an example in DS2:

name = “Peter”;

This code assigns the value of the variable Peter to the variable name. It does not assign the string Peter to the variable name, which is what you would expect. To assign the string Peter to the variable name, the string must be enclosed in single quotation marks:

name = ‘Peter’;

If you are a seasoned DATA step programmer, you are probably asking yourself, “How do I embed a macro variable in the string to resolve it properly?”

SAS would not be SAS without the ability to have macro variables properly resolve issues. SAS has a new autocall macro, %TSLIT(), that helps resolve macros in DS2 strings. Because %TSLIT() is an autocall macro, it can also be used in the DATA step.

%let paul   = PAUL;

%let number = 1;

proc ds2;

data trio (overwrite=yes);

dcl varchar(30) a b ;

dcl varchar(50) "Peter&Paul" ;   

dcl varchar(50) peter paul;

dcl varchar(50) both;

dcl char(1) x;

method init();

    "peter&paul" = 'Where is Mary';  

    peter = 'peter';

    paul  = 'paul';

both = "Peter&Paul" ; put '3. ' both=;   

    both = 'Peter&Paul' ;   put '4. ' both=;   

    both = %tslit(Peter&Paul); put '5. ' both=;   

    both = %tslit(we have Peter&Paul but no Mary); put '6. ' both=;  

    x = 'X';

    b = %tslit(x&number. x&number. x); put '7. ' b=;  

end;

enddata;

run;

quit;

   A variable name with an embedded special character is declared.

   The string Where is Mary is assigned to the variable Peter&Paul. The variable name is not case sensitive.

   The value of the variable Peter&Paul is assigned to the string variable both. Now, the string variable both contains the string Where is Mary.

   The string Peter&Paul is assigned to the string variable both. Now, the string variable both contains the string Peter&Paul.

   The macro variable &Paul is concatenated to the string variable Peter. Now, the variable both contains the string PeterPaul.

   More verbiage is added to the concatenation of the macro variable &Paul to the string variable Peter. Now, the variable both contains the string we have PeterPaul but no Mary.

   Multiple macro variables are included. The variable b contains the string x1 x1 x.

Here is the log from the previous code:

3.  both=Where is Mary

4.  both=Peter&Paul

5.  both=PeterPaul

6.  both=we have PeterPaul but no Mary

7.  b=x1 x1 x

IF Operator

PROC DS2 offers a new way to evaluate a series of IF/THEN statements: the IF operator. The IF operator works much like the DATA step IFN()/IFC() functions in that the IF operator assigns a value to a variable like the IFN()/IFC() functions can be used to assign a value.

proc ds2;

data showIF (overwrite=yes);

DECLARE double  dResult;

method run();

    DECLARE integer i;

    do i = 1 to 110;

       dResult = IF i <= 50 then 1 else 2;  

       output;

    end;

end;

enddata;

run;

quit;

1.   The IF operator assigns a value of 1 or 2 to dResult.
The IF operator can be extended to have multiple checks like the DATA step’s IF/THEN/ELSE. The IF operator stops checking conditions as soon as it finds a “true” result. If it is possible, arrange your tests so that the most likely to be true are evaluated first.

proc ds2;

data showIF (overwrite=yes);

DECLARE double  dResult;

method run();

    DECLARE integer i;

    do i = 1 to 110;

       dResult =     IF i <=  10  then 1               

                     else   if  i <=  20  then 2       

                     else   if  i <=  30  then 3

                     else   if  i <=  60  then 4

                     else   if  i <= 100 then 5

                     else   NULL;                      

       output;

    end;

end;

enddata;

run;

quit;

   The IF operator assigns a value to dResult.

   else if is used for multiple statements.

   You need a catchall. In this case, NULL is assigned when no conditions are met.

Because the IF operator returns a result, it can be used in more complex expressions, much like a function can.

proc ds2;

data ficoFactor (overwrite=yes);

DECLARE decimal(6,2)  cFactor;

method run();

    set ficoScores;

    cFactor = ficoScore * (IF ficoScore >= 750   then 1.5

                           else if ficoScore >= 700   then 1.1

                           else if ficoScore >= 600   then 1.0

                           else if ficoScore >= 0     then 0.75

                           else NULL);

end;

enddata;

run;

quit;

   The value of ficoScore from the SAS data set ficoFactor is multiplied by the result of the IF operator. No intermediate variables are required.

SELECT Operator

The SELECT operator works like the DATA step’s SELECT statement (which is not to be confused with the SQL SELECT statement). There is one exception—like the IF operator, the SELECT operator returns a value. Like the DATA step’s SELECT statement, the SELECT operator can be specified with a SELECT expression:

result =  SELECT (grade)

when (‘A’) ‘Honours’

when (‘B’) ‘Very Good’

when (‘C’) ‘Just Pass’

when (‘D’) ‘Needs Remedial Help’

otherwise ‘Fail’

end;

In this code, each when clause is an exact (true or false) match. When used without a SELECT expression, the when clauses have expressions that are evaluated. The SELECT operator code can be rewritten as follows:

proc ds2;

data ficoFactor (overwrite=yes);

DECLARE decimal(6,2)  cFactor;

method run();

    set ficoScores;

    cFactor = ficoScore *

            (SELECT

                when( ficoScore >= 750 )  1.5

                when( ficoScore >= 700 )  1.1

                when( ficoScore >= 600 )  1.0

                when( ficoScore >= 0   )  0.75

                otherwise NULL

            END);

end;

enddata;

run;

quit;

Just like the IF operator, the SELECT operator stops checking conditions as soon as it finds a “true” result.

After some non-rigorous testing consisting of a few runs of the ficoFactor data (which has 516,233,127 rows), there were negligible differences in CPU time and elapsed time (less than one second) between the IF operator and the SELECT operator.

Arrays

PROC DS2 supports both temporary arrays and variable arrays. As in the DATA step, all elements of the array must be of same data type.

Temporary Arrays

As the name indicates, temporary array elements are not written to the PDV and consequently are not written to an output data set. In this sense, they operate like a DS2 local variable. A temporary array can be defined with a local or global scope. However, even when the temporary array is declared with a global scope, the elements of a temporary array are not written to the PDV or saved to the output data set.

In a DATA step, you use _TEMPORARY_ when you want a temporary array. This is not necessary or allowed in DS2. In DS2, to create a temporary array, you use a DECLARE statement:

DECLARE <data type> arrayName[dim]  [HAVING…];

The difference between a scalar declare and an array declare is the addition of the array dimension [dim]. Here is how to declare an array called months with 12 integers:

DECLARE integer months[12];

You do not need to specify _TEMPORARY_. The HAVING clause is optional. Because the elements of a temporary array are not written to the PDV, for a temporary array, the HAVING clause is primarily used for documentation purposes.

Variable Arrays

As the name indicates, variable arrays are based on DS2 global variables, which means the VARARRAY statement must have global scope (that is, outside the method). To declare a variable array, you use a variation of the declare statement called VARARRAY:

VARARRAY <data type> arrayName[dim] [variable list] [HAVING…];

Here is how to declare a variable array called months with 12 integers:

VARARRAY integer months[*] month1 – month12;

This automatically creates 12 new integer variables, named month1 through month12, if they do not exist in the input data.

Deferred Dimensioning

DS2 offers a new (and welcomed) way to dimension an array—deferred dimensioning. Unlike in the DATA step, where you must explicitly define the dimensions of an array either by explicitly stating the number of elements or by explicitly listing the variables that form the array elements, DS2 can defer dimensioning. That is, dimensioning the array is deferred until the entire PROC DS2 program is parsed and all possible array elements are identified. Deferred dimensioning enables you to dimension one array based on the dimension of another array or based on variables declared later in the PROC DS2 program.

Let’s create a data set and explicitly define the array elements as you would in a DATA step:

proc ds2;

data in (overwrite=yes);

     vararray int    ax[*] x1-x3;

     vararray double months[*] jan  mar  may  jul  sep  dec;

     vararray double month[*]  mon1 mon3 mon5 mon7 mon9 mon12;

     method run();

         declare int i;

         do i = 1 to dim(ax);

            ax[i] = i;

         end;

         do i = 1 to dim(months);

             months[i] = i + 5;

             month[i]  = i + 10;

         end;

         output;

         put ax[*]=;

         put months[*]=;

     end;

enddata;

run;

quit;

Read in the data and create arrays with deferred dimensioning:

proc ds2;

data array (overwrite=yes);

    vararray double x[*] x:;  

    vararray int out[dim(x)];  

    vararray double months[*] mon:;  

    vararray int nmonth[*] jan feb mar apr may jun jul aug sep oct nov dec;  

    declare double tx[dim(months)] ;  

    method run();

      set in;  

end;

enddata;

run;

quit;

   DS2 creates an array named x with variables that begin with x. Because no variables have been declared, the vararray statement is deferred. The vararray x is not dimensioned immediately. If there was a HAVING clause, it would be deferred, too.

   DS2 creates an array named out with the same number of variables as the x array. Its variables are named out1, out2, …outn. Because x has yet to be dimensioned, the array out is not dimensioned immediately.

   DS2 creates an array named months with variables that begin with month. Because no variables have been declared, the vararray statement is deferred. The vararray months is not dimensioned immediately.

   The nmonth array is dimensioned immediately because all elements are specified.

   A temporary array is declared with the same number of elements as the months array.

   The data set in is opened. Here is what DS2 sees:

•   x1 to x3. The vararray x statement is completed and x is dimensioned to 3 elements. If there was a HAVING clause, it would be processed.

•   mon1 mon3 mon5 mon7 mon9 mon12. The vararray months statement is completed and months is dimensioned to 6 elements. If there was a HAVING clause, it would be processed.

Now that the array x dimension is known, the vararray out statement is completed. The vararray out is dimensioned to 3 elements.

Now that the array months dimension is known, the declare tx statement is completed. The array tx is dimensioned to 6 elements.

Deferred dimensioning simplifies processes where data are denormalized. For example, every day a new column is added to the data to represent the sales for that day. Each time the input data are read, the arrays are automatically dimensioned, meaning there is no need to change the program and no need to query the input table to determine the array dimensions.

Array Dimensions

Both temporary and variable arrays can be multi-dimensioned. Furthermore, the array bounds can be specified in PROC DS2 as they can in the DATA step:

proc ds2;

data test (overwrite=yes);

   vararray double x[*]   x: ;    

   vararray double dx2[4,2] x: ;    

   vararray double dx3[2011:2014, 5:6] x:;   

   method run();

        set in;

   end;

enddata;

run;

quit;

   A one-dimension array is created.

   A two-dimension array is created.

   A two-dimension array with specified row and column indices is created.

In the previous example, all three arrays have a total of 8 elements (based on the input table created).

Array Assignment

One array can be assigned to another array using the := operator.

dx2 := x;

dx3 := x;

If both arrays are the same data type, then an element-by-element transfer is made. If the arrays are different data types, DS2 uses the rules of data conversion to do the assignment. In the following example, eight double variables are created in one step. In the next step, they are read into an array of doubles. That array is then assigned to two different arrays.

proc ds2;

data in (overwrite=yes);

     vararray double    ax[*] x1-x8;

     method run();

         declare int i;

         do i = 1 to dim(ax);

            ax[i] = i + 0.88;

         end;

         output;

         put ax[*]=;

     end;

enddata;

run;

quit;

proc ds2;

data test (overwrite=yes);

   vararray double x[*]   x:;

   vararray int dx2[4,2];

   vararray char(10) dx3[2011:2014, 5:6];

   method run();

        set in;

        put 'x';

        put x[*]=;

        dx2 := x;

        put;

        put 'dx2';

        put dx2[*]=;

        dx3 := x;

        put;

        put 'dx3';

        put dx3[*]=;

   end;

enddata;

run;

quit;

Here are the results:

x

x[1]=1.88 x[2]=2.88 x[3]=3.88 x[4]=4.88 x[5]=5.88 x[6]=6.88 x[7]=7.88 x[8]=8.88

dx2

dx2[1,1]=1 dx2[1,2]=2 dx2[2,1]=3 dx2[2,2]=4 dx2[3,1]=5 dx2[3,2]=6 dx2[4,1]=7 dx2[4,2]=8

dx3

dx3[2011,5]=1.88       dx3[2011,6]=2.88       dx3[2012,5]=3.88       dx3[2012,6]=4.88

dx3[2013,5]=5.88       dx3[2013,6]=6.88       dx3[2014,5]=7.88       dx3[2014,6]=8.88

PROC DS2 follows the DATA step convention of row major assignment. In other words, in a 4x2 array, the eight values map into one-dimension array elements as follows:

1 2

3 4

5 6

7 8

Keep the following facts in mind when you are assigning uneven- sized arrays to one another:

•   If the left side has fewer elements, only that number of elements are copied.

•   If the left side has more elements, the elements from the right side are copied and the remaining elements on the left side are set to missing.

proc ds2;

data test (overwrite=yes);

   vararray double x[*]   x:;

   vararray int dx2[4,2];

   vararray char(10) dx3[2011:2014, 5:6];

   declare double tx[3] ;

   method run();

        set in;

        put 'x';

        put x[*]=;   

        dx2 := x;

        put;

        put 'dx2';

        put dx2[*]=;  

        dx3 := x;

        put;

        put 'dx3';

        put dx3[*]=;   

        tx := (1. 2. 3.);    

        dx3 := tx;    

        put;

        put 'dx3';

        put dx3[*]=;  

   end;

enddata;

run;

quit;

Here are the results:

X    

x[1]=1.88 x[2]=2.88 x[3]=3.88 x[4]=4.88 x[5]=5.88 x[6]=6.88 x[7]=7.88 x[8]=8.88

dx2  

dx2[1,1]=1 dx2[1,2]=2 dx2[2,1]=3 dx2[2,2]=4 dx2[3,1]=5 dx2[3,2]=6 dx2[4,1]=7 dx2[4,2]=8

dx3       

dx3[2011,5]=1.88      dx3[2011,6]=2.88     dx3[2012,5]=3.88       dx3[2012,6]=4.88

dx3[2013,5]=5.88      dx3[2013,6]=6.88     dx3[2014,5]=7.88       dx3[2014,6]=8.88

dx3  

dx3[2011,5]=1         dx3[2011,6]=2         dx3[2012,5]=3         dx3[2012,6]=

dx3[2013,5]=           dx3[2013,6]=           dx3[2014,5]=           dx3[2014,6]=

   Vararray x has eight floating-point elements.

   Vararray dx2 has eight integer elements. Note that the assignment dx2 := x caused the elements from x to be truncated.

   Vararray dx3 has eight char(10) elements. Note that the assignment dx3 := x coerced the floating-point elements to be the character representation.

   The temporary array tx is assigned three floating-point values.

   The temporary array tx has only three elements. These are assigned to the first three elements of dx3. The remaining five elements are set to missing.

   Vararray dx3 has eight elements. The first three are values from the tx array, and the last five are missing as a result of the dx3 := tx assignment.

In this example, after the array tx2 was assigned to dx3 (with dx3 := tx), the last five elements of dx3 are missing.

Missing Values and NULL Values

For most DATA step programmers, SAS missing values and NULL values are the same. In terms of the DATA step programs in this book, missing and NULL values act the same but are fundamentally different. A SAS missing value actually has a value (for a number, it can be up to 28 different values). From the lowest to highest value, here are the values:

Value Description
._ Dot underscore. The smallest numeric missing value.
. Dot. The standard numeric missing value
.A - .Z Dot A through dot Z.

For a character variable, a missing value is an empty string or blank string.

You can use missing values to convey valuable information. But, you wouldn’t want a missing value to take part in any statistical calculation. For example, in a medical study, the third-week follow-up visit for some patients might not be required. Rather than code the third-week visit as the standard dot (.), it could be coded as .N for Not Applicable. When you calculate the mean of third-week visits, patients for whom the third-week visit was not applicable are not used. And, you can easily select these patients by filtering using .N. Using these missing values adds value that would otherwise be lost.

The good news is that you still have SAS missing values. The bad news is that they are limited to the SAS character data types containing double and fixed-length.

On the other hand, database systems have NULL values. NULL means that the column has no value. This is fundamentally different from a SAS missing value, which does have a value.

ANSI Mode and SAS Mode

SAS provides two modes for processing data:

•   SAS mode

•   ANSI mode

The difference between these two modes is the way that they handle missing and NULL values when reading from and writing to a database.

SAS Mode

This is the default mode. In SAS mode, when you are writing to a SAS data set, the missing values (both numeric and character) are preserved. When you are writing to an ANSI SQL database, all missing numeric values (…, .A-.Z) are converted to NULL. A missing character value is stored as a blank.

ANSI Mode

In ANSI mode, all SAS missing numeric values are converted to NULL when both reading and writing. A missing character value is converted to a blank, which is a known value. When processing a SAS table in ANSI mode, information based on missing values can be lost.

To process in ANSI mode, use the ANSIMODE option in PROC DS2:

PROC DS2 ANSIMODE;

Testing for Missing or NULL Values

One way to help you understand the fundamental difference between a SAS missing value and a NULL value is to test for a value. Suppose you submit this code:

if varName = . then do;

Because . is a value, this test returns a 1 (was equal) or a 0 (was not equal). Suppose you submit this code:

if varName = NULL then do;

You will never get a true result because NULL is nothing, and nothing cannot match anything.

Both of these statements have a problem, and they can both be easily fixed. Recall the first statement:

if varName = . then do;

This statement should be changed:

if missing(varName) then do;

This capture all of the missing values and it captures NULL values. If you want to only get results for the one missing value, then you do not need to change the code!

Recall the second statement:

if varName = NULL then do;

This statement should be changed:

if null(varName) then do;

If you want only missing values and not NULL values, then submit the following statement:

if missing(varName) and not (null(varName)) then do;

Although this statement seems confusing, keep in mind that SAS missing values affect only two SAS data types—DOUBLE and CHAR. All other data types can have only NULL values. In addition, missing values are SAS only. Once the data leave SAS and are written to an SQL table, all of the missing values become NULL values. Similarly, because SAS data tables do not support NULL values, all NULL values are converted to SAS missing values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset