In this chapter we discuss Python “types” along with string and numeric formatting. Python has different “types,” some of which are built into the language, some of which are added by third parties, and some of which are created by Python code. These types can represent all kinds of data and allow Python to be used as a general-purpose programming language in addition to its use as a data processing language. As a general rule, SAS programmers need not concern themselves with data types. This is because the data model used to store variables in Foundation SAS datasets (.sas7bdat) is a simple one. The data “types” for SAS are either numeric or character. Internally SAS uses floating-point representation for numeric values.
The SAS language handles a tremendous amount of details without user intervention when reading and writing numeric or character data. For example, SAS informats (both SAS-supplied and user-defined ones) provide the mappings for reading various data types for numeric and character data inputs. Similarly, SAS formats provide the mappings needed to write various data types. We will begin by discussing numerics, followed by strings (character variable values in SAS). Further in the chapter, we examine how Python formats both numeric and character data types.
We begin by examining numerics which include Boolean type operators used for truth testing. This is followed by an overview of what Python refers to as strings and SAS refers to as character variables. Finally we will discuss formatting for both numerics and strings.
Numerics
- 1.
Integers
- 2.
Floating point
- 3.
Complex numbers
In addition, Booleans are a subtype of integers which is discussed in detail. Complex numbers will not be discussed as they are outside the scope of this book. In Python numbers are either numeric literals or created as the result of built-in operators or functions. Any numeric literal containing an exponent sign or a decimal point is mapped to a floating-point type. Whole numbers including hexadecimals, octal, and binary numbers are mapped as integer types.
Mixed Types
In this example, x is an integer, y is a float. The product of x and y, is assigned to z which Python then cast as a float. This illustrates the rule Python uses for mixed arithmetic operations. This example also neatly illustrates the compactness of Python as a language. Similar to the SAS language, there is no need to declare variables and their associated data types as they are inferred from their usage.
SAS Data Types
You are not likely to encounter issues related to data type differences when working with SAS and Python. For the most part, issues related to mapping data types arise when reading data from external environments, particularly with relational databases. We will discuss these issues in detail in Chapter 6, “pandas Readers and Writers.”
Python Operators
Similar to SAS the Python interpreter permits a wide range of mathematical expressions and functions to be combined together. Python’s expression syntax is very similar to the SAS language using the operators +, –, *, and / for addition, subtraction, multiplication, and division, respectively. And like SAS parentheses (()) are used to group operations for controlling precedence.
Python Arithmetic Operations Precedence1
Precedence | Operation | Results |
---|---|---|
1 | x ** y | x to the power y |
2 | divmod(x, y) | The pair (x // y, x % y) |
3 | float(x) | x converted to floating point |
4 | int(x) | x converted to integer |
5 | abs(x) | Absolute value of x |
6 | +x | x unchanged |
7 | –x | x negated |
8 | x % y | Remainder of x / y |
9 | x // y | Floor of x and y |
10 | x * y | Product of x and y |
11 | x / y | Quotient x by y |
12 | x – y | Difference of x and y |
13 | x + y | Sum of x and y |
Boolean
Boolean Value Tests for 0 and 1
In contrast SAS does not have a Boolean data type. As a result, SAS Data Step code is often constructed as a Series of cascading IF-THEN/DO blocks used to perform Boolean-style truth tests. SAS does have implied Boolean test operators, however. An example is the END= variable option on the SET statement. This feature is used as an end-of-file indicator when reading a SAS dataset. The value assigned to the END= variable option is initialized to 0 and set to 1 when the SET statement reads the last observation in a SAS dataset.
Other SAS functions also use implied Boolean logic. For example, the FINDC function used to search strings for characters returns a value of 0, or false, if the search excerpt is not found in the target string. Every Python object can be interpreted as a Boolean and is either True or False.
None
False
0 (for integer, float, and complex)
Empty strings
Empty collections such as “ ”, ( ), [ ], { }
Comparison Operators
Python Comparison Operations
Operation | Meaning | |
---|---|---|
< | Strictly less than | |
<= | Less than or equal | |
> | Strictly greater than | |
>= | Greater than or equal | |
== | Equal | |
!= | Not equal | |
is | Object identity | |
is not | Negated object identity |
Python Equivalence Test
In this example, x is assigned the value 32.0 and y is assigned 32. Lines 3 through 6 illustrate the Python IF/ELSE construct. As one expects x and y evaluate to the same arithmetic value. Note Python uses == to test for equality in contrast to SAS which uses =.
Python IS Comparison
Python IS Comparison 2
Boolean Tests for Empty and Non-empty Sets
The first Boolean test returns False given the string is empty or null. The results from the second Boolean test returns True. This is a departure for how SAS handles missing character variables. In SAS, zero or more whitespaces (ASCII 32) assigned to a character variable is considered a missing value.
Chapter 3, “pandas Library,” goes into further detail on missing value detection and replacement.
Boolean Chained Comparisons
Boolean Chained Comparisons 2
In Listing 2-9, 10 < x evaluates True and x < 20 evaluates False making the expression False.
Python Numeric Inequality
SAS Numeric Inequality
Using a NULL Data Step, variables x and y are assigned numeric values 2 and 3, respectively. The IF-THEN/ELSE statement is used along with a PUT statement to write to the SAS log. Since the value 2 does not equal 3, the inequality test with ^= evaluates true and ‘True’ is written. The ELSE condition is not executed.
Boolean String Equality
This Boolean comparison returns False since the first character in object s1 is “S” and the first character in object s2 is “s”.
SAS String Equality
Using a NULL Data Step, the variables s1 and s2 are assigned the character values ‘String’ and ‘string’, respectively. The IF-THEN/ELSE statement is used along with a PUT statement to write to the SAS log. Since the character variable s1 value of ‘String’ does not match the character variable s2 value of ‘string’, the IF statement evaluates false. The ELSE statement is executed resulting in ‘False’ written to the SAS log.
IN/NOT IN
IN and NOT IN Comparisons
IN evaluates to True if a specified sequence is found in the target string. Otherwise it evaluates False. not in evaluates False if a specified sequence is found in the string. Otherwise it evaluates True.
AND/OR/NOT
Python Boolean Operations Precedence
Precedence | Operation | Results |
---|---|---|
1 | not x | If x is false, then True; False otherwise. |
2 | x and y | If x is false, its value is returned; otherwise y is evaluated and the resulting value is returned False. |
3 | x or y | If x is true, its value is returned; otherwise, y is evaluated and the resulting value is returned. |
The operator not yields True if its argument is false; otherwise, it yields False.
The expression x and y first evaluates x; if x is False, its value is returned; otherwise, y is evaluated and the resulting value is returned.
The expression x or y first evaluates x; if x is True, its value is returned; otherwise, y is evaluated and the resulting value is returned.
Boolean AND/OR Precedence
The Boolean and operator precedence has a higher priority than that of or. In the first pair, True and False evaluate False. Therefore, the second evaluation becomes False or True which evaluates True. The second example in Listing 2-15 illustrates the use of parentheses to further clarify the Boolean operation. Parentheses have the highest precedence order.
Python Boolean and
SAS Boolean AND Operator
The FINDC function searches the character variable s3 left to right for the character ‘r’. This function returns the location for the first occurrence where the character ‘r’ is found, in this case, position 6. This causes the first half of the IF predicate to evaluate true. Following AND is the second half of the IF predicate using the FINDC function to search for a blank character (ASCII 32) which is found at position 7. This predicate evaluates true. Since both IF predicates evaluate true, this results in the statement following THEN to execute and write ‘True’ to the SAS log.
Python Boolean or
SAS Boolean OR
The FINDC function searches the character variable s4 left to right for the character ‘y’. This function returns the location for the first occurrence of where the character ‘y’ is found, in this case, position 6. This results in the first half of the IF predicate to evaluate true. Since the first IF predicate evaluates true, this results in the statement following THEN statement to execute and write ‘True’ to the SAS log. The ELSE statement is not executed.
Numerical Precision
Boolean Equivalence
Contents of List x
Python Sum Operation
As an aside, you can see another example illustrating the simplicity of Python. There are no variables to declare and in this case no expressions or assignments made.
The explanation for these results is how floating-point numbers are represented in computer hardware as base 2 fractions. And as it turns out 0.1 cannot be represented exactly as a base 2 fraction. It is an infinitely repeating fraction.2
Fortunately, there are straightforward remedies to this challenge. Similar to SAS, the Python Standard Library has a number of built-in numeric functions such as round().3 A built-in function means it is available to the Python interpreter and does not require the importing of any additional packages.
Python’s round() function returns a number rounded to a given precision after the decimal point. If the number of digits after the decimal is omitted from the function call or is None, the function returns the nearest integer to its input value.
Python round Function
The print() function contains a Boolean comparison operation similar to the one found in Listing 2-20. Without the use of the built-in round() function, this Boolean equivalency test returns False.
The round() function rounds the total object to the nearest integer. In contrast to the line above it, here the Boolean equality operator == returns True since 0.999… has been rounded to the integer value of 1.
SAS Round Function
This SAS program uses a DO/END loop to accumulate values into the total variable. The inc variable, set to a numeric value of 0.1, is a stand-in for the items in the Python list. The first IF statement on line 15 performs a comparison of the accumulated values into variable total with the numeric variable one having an integer value of 1. Similar to the Python example, the first half of this IF predicate (.999…= 1) evaluates false and the second half of the IF predicate executes indicating the comparison is false.
The IF statement on line 18 uses the ROUND function to round the variable total value (.999…) to the nearest integer value (1). This IF predicate now evaluates true and writes to the log. Line 19 does not execute.
The last line of the program writes the value of the variable total using the SAS-supplied 8.3 format which displays the value 1.000. The internal representation for the variable total remains .9999999999.
Strings
In Python strings are referred to as an ordered sequence of Unicode characters. Strings are immutable, meaning they cannot be updated in place. Any method applied to a string such as replace() or split() used to modify a string returns a copy of the modified string. Strings are enclosed in either single quotes (') or double quotes (").
If a string needs to include quotes as a part of the string literal, then backslash () is used as an escape character. Alternatively, like SAS, one can use a mixture of single quotes (‘) and double quotes (“) assuming they are balanced.
Python String Assignment and Concatenation
Python uses the plus symbol (+) for string concatenation operation. SAS provisions an extensive set of functions for string concatenation operations to provide finer controls for output appearances.
SAS Character Assignment and Concatenation
Python upper() Method
SAS UPCASE Function
Python Multiline String
Observe how three consecutive single quotes (‘) are needed to define a multiline string. A Docstring preserves the spacing and line breaks in the string literal.
Python count Method
The count() method illustrated in the preceding example is one of a number of methods used by Python for sophisticated string manipulation tasks. You can think of these methods as being similar to SAS functions. In the case of Python, methods are associated with and act upon a particular object. In the preceding example, s7 is a string object used to hold the value of the Docstring. For the remainder of this book, we use the Python nomenclature object rather than variable when referring to code elements assigned values.
The methods available for the built-in string object are found in the Python Standard Library 3.7 documentation at https://docs.python.org/3/library/stdtypes.html#string-methods .
String Slicing
Python uses indexing methods on a number of different objects having similar behaviors depending on the object. With a sequence of characters (string), Python automatically creates an index with a start position of zero (0) for the first character in the sequence and increments to the end position of the string (length –1). The index can be thought of as one-dimensional array.
Python string slicing is a sophisticated form of parsing. By indexing a string using offsets separated by a colon, Python returns a new object identified by these offsets. start identifies the lower-bound position of the string which is inclusive; stop identifies the upper-bound position of the string which is non-inclusive; Python permits the use of a negative index values to count from right to left; step indicates every nth item, with a default value of one (1). Seeing a few examples will help to clarify.
At times you may find it easier to refer to characters toward the end of the string. Python provides an “end-to-beginning” indexer with a start position of –1 for the last character in the string and decrements to the beginning position.
Python Sequence Indexing
Character | H | e | l | l | o | W | o | r | l | d | |
Index value | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Index value R to L | –11 | –10 | –9 | –8 | –7 | –6 | –5 | –4 | –3 | –2 | –1 |
A number of SAS character handling functions have modifiers to enable scanning from right to left as opposed to the default behavior of scanning left to right.
Python String Slicing, Example 1
In this example, the slicing operation contains no colon so the default is the “start” position being 0 as the index value which returns the first letter of the string.
SAS SUBSTR Function
In contrast to Python, with an index start position of 0, SAS uses an index start position of 1. The SAS SUBSTR function scans the character variable s (first argument), starts at position 1 (second argument), and extracts 1 character position (third argument).
Python String Slicing, Example 2
In other words, in this example, index position 5 maps to the whitespace (blank) separating the two words and is not returned.
Python String Slicing, Example 3
When the index start value is greater than the length of the sliced sequence, Python does not raise an error, rather, it returns an empty (null) string. Recall from the preceding discussion on Boolean comparisons that empty objects evaluate False.
Python String Slicing, Example 4
When the index stop value is greater than the length of the sliced sequence, then the entire sequence is returned.
Python String Slicing, Example 5
Since the stop index position is not inclusive, the last character in the sequence is not included.
Python String Slicing, Example 6
Python String Slicing, Example 7
With the first slice operation, because there is a single index value, it defaults to the start value. With a negative value, the slice operation begins at the end of the sequence and proceeds right to left decrementing the index value by 1 (assuming the step value remains the default value of 1).
In the second slice operation, a negative start value larger than the sequence length to be sliced is out of range and therefore raises an IndexError.
Python String Quoting
Formatting
In the day-to-day work of data analysis, a good deal of energy is devoted to formatting of numerics and strings in our reports and analysis. We often need values used in our program output formatted for a more pleasing presentation. This often includes aligning text and numbers, adding symbols for monetary denominations, and mapping numerics into character strings. In this section we introduce the basics of Python string formatting. Throughout the rest of the book, we will encounter additional examples.
In the preceding examples, we saw illustrations of basic string manipulation methods. Python also provisions string formatting method calls and methods.
Formatting Strings
Formatting Python strings involve defining a string constant containing one or more format codes. The format codes are fields to be replaced enclosed by curly braces ({ }). Anything not contained in the replacement field is considered literal text, which is unchanged on output. The format arguments to be substituted into the replacement field can use either keyword ({gender}, e.g.) or positional ({0}, {1} e.g.) arguments.
Format Method with a Positional Argument
The argument “Female” from the format() method is substituted into the replacement field designated by {0} contained inside the string constant literal text. Also notice the use of the backslash () to escape the single quote to indicate a possessive apostrophe for the string literal ‘subject’.
Format Method Specification
In Listing 2-41, the format specification in the replacement field uses the alignment option {0:>10} to force the replacement field to be right aligned with a width of ten characters. By default the field width is the same size as the string used to fill it. In subsequent examples we use this same pattern for format specifications to control the field width and appearances of numerics.
Format Method with Positional Arguments
Format Method with Keyword Arguments
Combining Format Method Keyword and Positional Arguments
Notice when combining positional and keyword arguments together, keyword arguments are listed first followed by positional arguments.
Beginning with Python 3.6, formatted string literals or f-strings were introduced as an improved method for formatting. f-strings are designated with a preceding f and curly braces containing the replacement expression. f-strings are evaluated at runtime allowing the use of any valid expression inside the string. Consider Listing 2-45.
In this example, the formula for calculating the area of a circle is enclosed within a set of curly braces ({ }). At execution time, the results are calculated and printed as a result of calling the print( ) function.
Formatting Integers
The pattern for applying formats to integers is similar to that of strings. The main difference being the replacement field deals with formatting numeric values. And as indicated previously, some format specifications have values independent of the data types to be formatted. For example, field padding is common to all data types, whereas a comma separator (to indicate thousands) is only applied to integers and floats.
Decimal Right Aligned
In this example, we use a positional argument for the format() method along with the format specification {:>20} to indicate we want the decimal value right aligned with a field width of 20.
Combining Format Specifications
In this example, the format specification {:>10,d} indicates the field is right justified with a width of 10. The ,d part of the specification indicates the digits use a comma as the thousands separator. This example uses a single print() function requiring a new line indicator after the first number in order to see the effect of the alignment.
Python Displaying Different Base Values
SAS Displaying Different Base Values
This example reads on input the numeric value 99 and uses a PUT statement to write this value to the log using 8., hex2., octal., and binary8. formats.
Python Format for Leading 0’s
SAS Format for Leading 0’s
This example uses the SAS-supplied z4. format shown on line 7.
Python Leading Plus Sign
The format specification {:+3d} indicates a preceding plus sign (+) using a field width of 3.
SAS Leading Plus Sign
PROC FORMAT is used to create a PICTURE format and is called on line 11.
Formatting Floats
Python Decimal Places
Python Percent Format
SAS Percent Format
Datetime Formatting
Strictly speaking, datetime is not a Python built-in type but instead refers to the datetime module and their corresponding objects which supply classes for manipulating date and datetime values. These next examples illustrate using the strftime(format) for date, datetime, and time handling. Python date, datetime, and time objects support the strftime(format) method which is used to derive a string representing either dates or times from date and time objects. This string is then manipulated with directives to produce the desired appearances when displaying output. In other words, the strftime(format) method constructs strings from the date, datetime, and time objects rather than manipulating these objects directly.
Python Import Datetime
Up to this point all of the Python examples we have seen are executed using a built-in interpreter. We have not needed to rely on additional Python modules or programs. In order to load other Python programs or modules, we use the import statement . Here, the first line in our example imports the objects datetime, date, and time from the Python module datetime.
In our example we also create the now object on line 2. In our program, the value associated with the now object is like a snapshot of time (assuming we did not execute line 2 again).
Calling the print() method for the now object displays the current data and time this program executed.
STRFTIME(format)
The strftime(format) directives are used to control the appearances of the value associated with the now object created in Listing 2-57. The now object holds the datetime returned from the datetime.now() function . These formatting directives are able to parse as well as format and control the appearances of the output.
Also notice the nl object assigned the value ‘ ’ used in this example. This is a new line indicator for the print() function to go to a new line enabling a single call of the print() function to print multiple, physical lines of output.
Formatting Directives Used in Listing 2-58
Directive | Meaning |
---|---|
%A | Weekday |
%B | Month |
%d | Day of month |
%Y | Century and year |
%c | Date and time |
SAS Datetime Example
The output for the SAS variable date of 20736 is obviously not the actual date, rather, it represents the number of days since January 1, 1960, which is the epoch start used by SAS. Likewise, the output for the SAS variable now 71660.228 is the number of seconds from the epoch start .
Summary
In this chapter we covered Python types and basic formatting. We understand how Python and SAS variable assignments can be made without declarations and that types need not be declared as they are inferred. We also introduced Python string slicing and formatting. Throughout the remainder of the book, we will build on these concepts.
Up to this point, we have discussed various features from the Python Standard Library related to data analysis. Chapter 3, “pandas Library,” describes the pandas data structure. The pandas library is a “higher-level” capability which makes Python an outstanding language for conducting real-world data analysis tasks.