Chapter 11
Security

¡Señor! ¡Señor! ¡Señor!

Emerging from behind a pile of rubble, having just swung down from a semi-prostrate ramón nut tree (Brosimum alicastrum), I was mid-selfie when a bellowing baritone brashly overtook the still breeze.

The agitated words were undeniably Spanish, but I was too preoccupied with the grandeur of the moment to attempt a translation.

Que? Como?!” I returned with a perfunctory cry, still focused solely on capturing the panoramas—if no longer the stillness—that the outcrop provided. After all, this was Tikal, Guatemala, famed pre-Columbian Maya metropolis and UNESCO World Heritage Site—and some squawker was not about to ruin my ruins!

Pocketing the camera and reeling around to more bellows, I at last saw the gentleman, a miniature security guard who had worked up quite a sweat to summit the pyramid, apparently in haste. Standing behind the security cordon—the chain-link fence to which he was wildly gesticulating—the guard's guttural screeches began to draw a congregation of onlookers.

As I clambered over, sacrificing vistas to confront the commotion, the situation grew somewhat clearer. I stood on rough, precarious terrain, while the guard and tourists stood on smooth, hewn stone. Yes, there were clear boundary markings at the top of Temple IV but, having scampered up the impassable backside, how was I to know I was out of bounds?

Reaching for what I assumed only could be a concealed machete, the guard pulled a radio from his hip, alerting the authorities that all was in order. At that moment, I hoped that my ascent up the back country at least had been mistaken for a troop of howler monkeys, as I half-leapt, half-crept up the tree-laden terrain, branches shaking and breaking in my wake—but the howlers unfortunately were absent from his radio report.

Now “safely” on the right side of the chain-link, I could see how from his vantage it might appear as though I'd illegally crossed the barrier (giving hope to wide-eyed tourists looking to do the same), but I wasn't about to attempt to explain in Spanish that Tikal needed better signage or security at the bottom of the pryamid.

After enjoying a few moments atop the temple taking in unparalleled views (and fondly reminiscing that my archaeology professor had once told me that the Maya built pyramids solely because mosquitos can't fly that high), I descended the contemporary wooden stairway, the guard mirroring my every step.

images

Countless security guards roam the expansive Tikal landscape—and thank God for their service. There have been periods in Tikal's distant past where bandidos (bandits) or even troops of howler monkeys have given tourists quite a scare, but Guatemala has invested significant resources in eliminating and mitigating threats to tourists to provide security.

Security aims to protect assets, and at Tikal, assets comprise not only tourists but also the environment, natural habitat, and untold archaeological treasures that may still lie unearthed. Thus, while the guard may have been bellowing because he thought I had intentionally crossed a security cordon, was he protecting me or the decrepit pile of 1,500-year-old rubble I had ascended? It was probably a little of both.

Software security is similarly poised not only to protect software but also to protect the environment in which it operates. While the guard didn't want me to fall and break myself, he was also acting instinctively to protect a site historically ravaged by treasure hunters.

The security cordon and accompanying signage—to which the guard was pointing, albeit the unlettered side that I could not view—acted as security controls, a barrier limiting the interaction between tourists and the environment to protect both. The chain-link and guard didn't eliminate but helped mitigate the risk of falling to one's death. Where threats to software can't be eliminated, software security controls act to reduce or eliminate vulnerabilities that threats might exploit, thus reducing risk.

Security and security controls also aren't intended to be static devices but should mature over time as threats or risks evolve. Each time I've returned to Tikal, there's been at least one additional temple that's been cordoned off and which I'm no longer able to scale due to safety concerns. Software security, too, should continue to incorporate new security measures as additional threats or vulnerabilities are identified, or as the level of acceptable risk changes over time.

Implementing security often comes with a hefty price tag. At Tikal, guard salaries, chain-link fences, signage, ancillary wooden staircases, and lighting all aim to promote security—none of which are free. Stakeholders in software development must also prioritize what combination of security controls will be emplaced to achieve an acceptable level of risk, thus essentially valuating security.

Designing and developing SAS software with security in mind helps stakeholders identify threats, calculate risks, and eliminate or mitigate known vulnerabilities. Similar to security cordons and security guards, SAS best practices can help facilitate secure software execution in a secure environment.

DEFINING SECURITY

Security is “the protection of system items from accidental or malicious access, use, modification, destruction, or disclosure.”1 Information security typically is described using the CIA triad that supports the confidentiality, integrity, and availability of software and resultant data products. Confidentiality ensures that only authorized users have access to software and data and, in SAS environments, is generally maintained through physical security, user authentication, and other access controls. Integrity ensures that software does what it says it will do, through development best practices, testing, and validation. Availability, discussed most extensively in chapter 4, “Reliability,” ensures that software and its components (including external macros, formats, data, and other files) are accessible whenever needed.

Software security is important because of the tremendous investment made to produce software within the software development life cycle (SDLC). After planning, design, development, testing, validation, acceptance, and a possible beta operational phase, stakeholders need to have confidence that software will function correctly when needed. Through integrity principles, stakeholders can be assured that software will not be altered after validation, thus preserving the benefits of quality assurance gained through formalized testing and validation. Moreover, integrity facilitates the development of software that does not damage itself or its environment. Through availability principles, SAS software can more flexibly adapt to locked data sets, missing data, invalid configuration files, and other exceptions that could otherwise cause software failure.

This chapter introduces and applies the CIA information security triad to SAS software development. Because SAS software is commonly executed within end-user development environments protected by access controls at the infrastructure and organizational level, the chapter focuses on integrity and availability rather than confidentiality. Best practices are introduced, such as the generation of checksums for stabilized SAS software, the encapsulation of SAS macros to prevent leakage, and fail-safe process flows. Because of the precisely choreographed interaction that must occur between SAS processes running in parallel, SAS techniques are demonstrated that facilitate secure parallel processing.

CONFIDENTIALITY

Confidentiality is “the degree to which the software product provides protection from unauthorized disclosure of data or information, whether accidental or deliberate.”2 While confidentiality typically refers to securing data, in software development environments that don't produce open-source code, software itself must be secure from penetration. Through compilation and encryption, code is rendered unreadable to users to prevent modification or reverse engineering of proprietary methods and techniques.

Base SAS, however, is much more likely to be utilized in end-user development environments in which software is developed and used by the same cadre of SAS practitioners. Because many of the users of SAS software are themselves—or work in conjunction with—developers, maintaining software confidentiality is typically not a priority. In fact, in data analytic development environments, code transparency is often more desirable than code confidentiality because stakeholders may need to review data modeling, transformation, and analytic techniques. In more extreme environments, such as is required in clinical trials and other medical research, software readability is paramount because code additionally must be audited.

Data confidentiality is more likely to be a concern or requirement in SAS environments than software confidentiality. SAS/SECURE provides both proprietary and industry-recognized encryption methods (e.g., RC2, RC4, DES, TripleDES, AES) which can be implemented manually as well as automatically through organizational policy. Passwords can also be encrypted using the PWENCODE procedure. Starting in SAS 9.4, the SAS/SECURE module is included in Base SAS rather than being licensed separately. For more information on data encryption, please reference Encryption in SAS® 9.4, Fifth Edition.3

INTEGRITY

Integrity is “the degree to which the accuracy and completeness of assets are safeguarded.”4 In a software development sense, the term assets typically refers to software itself, the data it utilizes, and the data or other products that it produces. Data integrity is not a focus of this text, but is discussed extensively in the comprehensive must-read Data Quality for Analytics.5 Many of the practices that facilitate higher quality data, however, can be implemented as quality controls for metadata (e.g., macro variables) utilized by SAS software.

Another aspect of integrity is the extent to which the asset has been safeguarded. In other words, has it been protected from intentional or accidental alteration? Data pedigree demonstrates the source of data to the specificity of the data steward, data set, or even observation. For example, if an observation requires further interrogation, metadata in the pedigree can direct stakeholders toward the original, raw data source for comparison or validation. The SDLC similarly seeks to ensure software pedigree, in that stakeholders questioning some aspect of software can review requirements documentation, versioning documentation, a software test plan, test cases, and test results to understand and gain confidence in SAS software.

Pedigree operates akin to chain-of-custody documentation, demonstrating transfer of ownership—of data through data processes or software through the SDLC. Checksums provide an alternative method to demonstrate that assets—whether software or data—have not been altered. A checksum uses a cryptographic hash function to reduce a data set, SAS program, or other file to a hexadecimal string that uniquely identifies the file. Thus, while checksums cannot demonstrate the source of files or speak to their pedigree, they can validate software or data products by demonstrating that the checksum of the current version of the file matches the checksum of the original file. Pedigree and checksums together can instill confidence in stakeholders that software and its required components demonstrate integrity.

Data Integrity

Data are often said to have integrity if they are valid, accurate, and have not been altered. Data accuracy is often difficult to determine or demonstrate without comparison to the data source or real-world constructs, which supports the importance of data pedigree. For example, to determine whether a data transformation procedure is operating correctly, developers often use test cases to validate input and expected output. However, the original data still could be inaccurate (e.g., missing, incorrect) despite an accurate transformation process.

Data validity, on the other hand, describes data that are of an expected type, structure, quantity, quality, and completeness. Thus, data validity doesn't speak to whether data are accurate but rather whether data prima facie appear as they should. Thus, a test data set created to test software functionality should be valid but cannot be accurate because the observations don't reflect any real-world constructs. Data integrity is typically enforced through data integrity constraints and quality controls that flag, expunge, or delete invalid data. While not demonstrated with regard to SAS data sets, quality controls that validate macro parameters and macro variables are demonstrated later in the “Macro Validation” section.

To demonstrate that a data set has not been altered, both data and accompanying metadata must be analyzed. It is not enough solely to show that data are equivalent because this leaves open the possibility that data were altered for some purpose and subsequently changed back to their original structure, content, and values. While the COMPARE procedure compares the contents of a data set (i.e., structure, variables, and values), it does not examine SAS metadata and thus cannot sufficiently demonstrate data integrity. Rather, metadata contained within SAS data sets and available through the DICTIONARY.Tables data set must be interrogated.

The following code creates a PERM.Dict data set that contains metadata describing the PERM.Mydata data set:

libname perm 'c:perm';
data perm.mydata;
   length quote $50;
   quote="Because that's how you get ants!";
run;
proc sql;
   create table perm.dict as
   select *
   from dictionary.tables
   where upcase(libname)="PERM" and upcase(memname)="MYDATA";
quit;

If the code is rerun, the data contained in Mydata will be identical to the data produced in the first run; however, the metadata from Mydata (contained in Dict) will differ because the creation date and modification date (CRDATE and MODATE, respectively) will have been updated. Because SAS metadata are contained within SAS data sets and cannot be separated from the data themselves, cryptographic checksums (discussed later in the “Checksums” section) unfortunately cannot be used to validate new copies of SAS data sets. For example, if you change a data set name, library, or create date—yet leave all actual data unchanged—the checksum will still be altered despite the identical data. This characteristic differs from SAS programs because the metadata in program files (e.g., file name, create date) are not included in the binary contents of the file but are maintained by the operating system (OS). The validation of SAS program files with checksums is demonstrated in the “Checksums” and “Implementing Checksums” sections.

Therefore, when comparing two data sets, validation of data and metadata must occur through SAS rather than through external methods. One limited use of checksums with SAS data sets is to validate that a static data set remains unchanged. For example, a team might want to demonstrate integrity of a SAS format STATES that it had created, which converts state acronyms to their full names. SAS data sets can be used to generate SAS formats with the CNTLIN option in the FORMAT procedure, as demonstrated in the “SAS Formatting” section in chapter 17, “Stability.” For example, the following code builds an abbreviated STATES format from the PERM.States data set:

data perm.states;
   input fmtname: $char8. type: $char1. start: $char4. label: $25.;
   datalines;
states c CA California
states c OH Ohio
states c PA Pennsylvania
states c MD Maryland
states c VA Virginia
;
run;
proc format cntlin=perm.states;
run;

Of note, the TYPE variable “c” is required for character data when using the CNTLIN option. In this scenario, once the format is created, the team wants to ensure that the format is not modified thereafter. One method to help provide this integrity would be to build the SAS format (using the FORMAT procedure) before using the format in software, but only after validating the checksum of the PERM.States data set. If the checksum matches the original checksum that was produced when the PERM.States data set was initially created, this demonstrates that the data set has not been modified, and further validates the integrity of the STATES format.

If the DATA step that produces PERM.States is ever rerun, however, the data set metadata will be modified, and a new checksum value will need to be generated and recorded. Thus, while this method demonstrates a theoretical use of checksums to facilitate data integrity, in practice, data integrity should be validated through SAS data and metadata. For example, a much simpler solution to ensure the integrity of the STATES format would be to run the COMPARE procedure on the current PERM.States data set and a baseline data set like BASELINE.States.

Macro Encapsulation

Encapsulation and loose coupling are principles of modular software design, discussed extensively in chapter 14, “Modularity.” Encapsulation strives to protect software from unintended results of individual code modules, as well as to isolate code modules from each other except where necessary. Therefore, encapsulation essentially builds black boxes around code modules (demonstrating why unit testing is referred to as black-box testing), while loose coupling ensures that tiny pinholes in those boxes prescribe limited and highly structured communication between various boxes.

To the extent possible, when a SAS parent process calls a child process, the child process should receive only the information and data required to function. The child process correspondingly should return only the information (and results, if applicable) that are required. Because the parent process in theory cannot see inside the black box (that surrounds the child process), SAS macros should contain return codes that demonstrate success, failure, and possibly gradations thereof that can be interpreted by the parent to drive program flow dynamically. While the use of return codes to support a quality assurance exception handling framework is discussed extensively in chapter 3, “Communication,” and chapter 6, “Robustness,” the following sections focus on ensuring that only necessary and valid information is passed back and forth between parent and child processes.

Leaky Macros

Anyone who has ever taken the SAS Advanced Programmer certification exam will be familiar with some variation of the following question type:

When the following SAS code is submitted, what is the final value of the macro variable &MAC?
%macro test(mac=3);
%let mac=4;
%mend;
%global mac;
%let mac=1;
%test(mac=2);
A) 1
B) 2
C) 3
D) 4

In this variation of the question, the answer is A (1) because the value of the parameter MAC (2), the default parameter value (3), and the local macro variable (4) within %TEST do not influence the global macro variable &MAC that was assigned outside the %TEST macro. Thus, because the %TEST macro contains a parameter MAC, references inside the macro to &MAC reflect the local macro variable &MAC that exists only inside the macro.

However, when the parameter MAC is removed from the macro invocation, references to &MAC inside %TEST now reference the global macro variable &MAC. This causes the assignment of &MAC inside %TEST to change the global macro variable &MAC to 4:

%macro test();
%let mac=4; * intended to be a local variable but read as global;
%mend;
%global mac;
%let mac=1;
%test;
%put MAC: &mac;
MAC: 4

To remedy this interference between local and global macro variables of the same name, the %LOCAL macro statement must be used to initialize all local macro variables that are not identified as parameters in the macro definition. By defining &MAC as a local macro variable inside %TEST, the assignment of the value 4 to the local macro variable &MAC now no longer conflicts with the value of the global macro variable &MAC;

%macro test();
%local mac;
%let mac=4;
%mend;
%global mac;
%let mac=1;
%test;
%put MAC: &mac;
MAC: 1

This potential conflict illustrates the importance of initializing all local macro variables inside macros with the %LOCAL statement. Macro variables that are required to be used by the parent process or other parts of the program external to the macro must be explicitly initialized with the %GLOBAL statement to ensure they persist outside the child process. While it's also good form to uniquely name macro variables to avoid potential confusion and conflict, use of %LOCAL and %GLOBAL statements remains a best practice.

Variable Variable Names

No, it's not a misprint, although Microsoft Word does despise this section heading. A security risk posed by global macro variables occurs when they are defined dynamically—that is, the macro variable name itself is mutable. For example, the %OHSOFLEXIBLE macro allows its parent process to specify the name of the return code that will be generated as a global macro variable:

* child process;
%macro ohsoflexible(rc=);
%let syscc=0;
%global &rc; * dynamic assignment;
%let &&rc=FAILURE;
* do something;
%if &syscc>0 %then %let &&rc=you broke it!;
%else %let &&rc=it worked!;
%put &&rc: &&&rc;
%mend;

By first assigning and later testing the value of &SYSCC, the macro can determine whether warnings or runtime errors occurred, and if the macro completed correctly, this is communicated through the dynamically named return code that is specified in the parameter RC:

* parent process;
%ohsoflexible(rc=MyReTuRnCoDe);
MyReTuRnCoDe: it worked!

At first glance, this appears to be a viable solution, but its flexibility can be its downfall. If the parent process needs to reference the macro variable &MYRETURNCODE, it cannot because while the macro variable value (it worked!) is encoded in the macro variable &MYRETURNCODE, the macro variable itself is not a global macro variable. In fact, a separate global macro variable would need to be created inside %OHSOFLEXIBLE just so the return code macro variable could be globally referenced.

Aside from the unnecessary complexity that this creates, dynamically named macro variables pose very real risks to software integrity. The first risk is that the variable name will be invalid, which can occur if a global macro variable initialization is attempted inside a macro in which a local variable of the same name already exists. A second risk is that an existing macro variable—typically a global macro variable—will be overwritten. For example, the previous parent process is not necessarily aware of individual return codes used by other macros or processes within the software or SAS session, so it could easily overwrite other return codes or values in the global symbol table.

One example of dynamically named global macro variables is demonstrated in the “Passing Parameters with SYSPARM” section in chapter 12, “Automation,” which flexibly parses the SYSPARM parameter to create one or more global macro variables. SAS doesn't get any more flexible than this, but the risks must be understood:

* accepts a comma-delimited list of parameters in VAR1=parameter one, VAR2=parameter two format;
%macro getparm;
%local i;
%let i=1;
%if %length(&sysparm)>0 %then %do;
   %do %while(%length(%scan(%quote(&sysparm),&i,','))>1);
      %let var=%scan(%scan(%quote(&sysparm),&i,','),1,=);
      %let val=%scan(%scan(%quote(&sysparm),&i,','),2,=);
      %global &var;
      %let &var=&val;
      %let i=%eval(&i+1);
      %end;
   %end;
%mend;

When %GETPARM is saved to Getparm.sas and executed in batch mode from the command line, if the SYSPARM value “tit1=Title One, tit2=Title Two” is passed, then the program creates and assigns two global macro variables, &TIT1 and &TIT2, to Title One and Title Two, respectively. The risks are identical to those described earlier—an invalid macro variable name could be supplied, or the variable could already exist and be overwritten. However, if macro variables truly need to be created dynamically, then input controls should be implemented to minimize potential risk.

Input controls should demonstrate that the intended name of the macro variable being dynamically created meets SAS variable naming conventions. This can be accomplished through a reusable macro (not demonstrated) that ensures the proposed macro variable name is 32 characters or less and contains no special characters. Additional quality controls could ensure that SAS automatic macro variable names were not included and that existing user-defined macros were not overwritten. Input validation is discussed more in the “Macro Validation” section, in which parameters can also be made more reliable through this quality control technique.

The %SYMEXIST macro function demonstrates whether a local or global macro variable exists and can be utilized to ensure that duplicate macro variables are not dynamically created. For example, in the %GETPARM macro, the %SYMEXIST function could be implemented immediately before the %GLOBAL statement to cause the macro to abort if the macro variable name passed through the parameter is already in use.

Macro Validation

In modular software design, SAS macros can exhibit childlike behavior, in that a parent process invokes a macro and temporarily transfers program control to that child process. Where communication is required from the parent, parameters can be passed through the macro invocation. Because most parent processes require some form of validation that the child completed correctly, return codes can be generated within the child and passed to the parent. Thus, the role of validation ensures that parameters have valid structure, format, and content, and that return codes accurately depict macro success, failure, or exceptions that may have been encountered.

Parameter Validation

End-user development is often immune to the extensive input validation required by more traditional software applications. For example, an entry form on a website must not only validate user responses, but also ensure that attackers are not attempting to gain access to the site through buffer overflow, SQL injection, or other hacking techniques. At the other end of the spectrum, software products with no front-facing components typically have no need for comprehensive user input controls because the threat of malice does not exist. Notwithstanding, as macro complexity grows and especially where modular software design is espoused, some level of parameter validation may be warranted in SAS software.

The following %DOSOMETHING macro alters output based on the parameter VAR and is intended to represent dynamic processing that might occur based on some parameterized value:

%macro dosomething(var=);
%if &var=YES %then %put Path1;
%else %if &var=NO %then %put Path2;
%mend;
%dosomething(var=YES);

Because the VAR parameter is hardcoded in the macro invocation, no attempt to validate the input is made. However, if the %DOSOMETHING macro is intended to have only two paths (initiated by the value of either YES or NO), then the lack of parameter validation introduces a vulnerability because the current conditional logic intimates a third path if neither YES nor NO is selected. In relatively static software in which parameters are rarely or never modified, the risk of failure is low and this vulnerability—the lack of validation—is often ignored. However, where parameter values are numerous, often modified, relatively complex, or generated dynamically from macro variables, the risk increases and parameter validation becomes more valuable.

Parameter validation becomes increasingly more important as modular software is developed, reused, and repurposed. As complexity grows, the likelihood increases that parameter values may represent macro variables passed from a parent to a grandchild through the child process. And, while this type of modular design can greatly increase software reuse, it also requires higher levels of validation to ensure that parameters being passed are valid. The following code demonstrates the typical relationship between parent, child, and grandchild processes in which the parameter VAR is passed from the parent through the child (where it is modified) and on to the grandchild:

* grandchild;
%macro print(printvar=);
%put VAR: &printvar;
%mend;
*child;
%macro define(var=);
%if &var=YES %then %print(printvar=&var2);
%else %if &var=NO %then %print(printvar=NO &var2);
%else %return;
%mend;
*parent;
%define(var=YES);

In this example, the child process (%DEFINE) is responsible for validating the parameters it passes to the %PRINT macro. This validation would be based on business rules that, for example, might require that unless the VAR parameter was YES or NO the macro would be aborted, as depicted. Other business rules might instead accept YES as valid and default all other parameter values to the NO path. Because parent processes have limited visibility into the inner workings of child processes, and typically no visibility into grandchild processes, validation is essential to achieving integrity in complex, modular design. In actual production software, in addition to aborting a process when an exception occurs, a return code should be generated that can be used by the parent process to drive program flow dynamically and possibly terminate the program or perform other actions.

Perseverating Macro Variables

Because the Base SAS language does not inherently provide a method to generate return codes from macros, return codes must be faked through global macro variables, as described and demonstrated in the “Faking It in SAS” section in chapter 3, “Communication.” In third-generation languages (3GLs), return codes are limited in scope and are passed directly from child process to parent. However, because SAS return codes must hitch a ride on global macro variables, the return codes effectively are broadcast to the entire program, thus violating both encapsulation and loose coupling principles of modularity, discussed throughout chapter 14, “Modularity.”

For example, the following code demonstrates two macros run in series in the same program. Each macro aims to provide integrity by validating software completion with a return code. However, contamination occurs because the same return code is used in both macros:

libname perm 'c:perm';
data perm.mydata;
run;
%macro one();
%global rc;
%if %sysfunc(exist(perm.mydata)) %then %let rc=found it!;
%else %let rc=missing!;
%mend;
%macro two();
%global rc; * intended to be a separate return code;
%put RC inside two: &rc;
%mend;
%one;
%two;

When executed, despite the global macro variable &RC being defined within each macro, the results are confounded because of interference between the macros. The following output demonstrates this error, as “found it!” has no business appearing inside the %TWO macro:

%one;
%put RC: &rc;
RC: found it!
%two;
RC inside two: found it!

To remedy this type of leaky macro, global macro variables should always be initialized to an empty string or some other meaningful value immediately after the %GLOBAL macro statement. These two additional statements now prevent contamination from the %ONE macro into the %TWO macro:

libname perm 'c:perm';
data perm.mydata;
run;
%macro one();
%global rc;
%let rc=;
%if %sysfunc(exist(perm.mydata)) %then %let rc=found it!;
%else %let rc=missing!;
%mend;
%macro two();
%global rc; * intended to be a separate return code;
%let rc=;
%put RC inside two: &rc;
%mend;

While the value of &RC in %ONE now can no longer interfere with the value of &RC in %TWO, the initialization of &RC in the %TWO macro does overwrite the value of &RC from %ONE. One vulnerability has been eliminated, but the threat still exists of macro variable collision. Thus, in addition to the best practice of always immediately initializing global macro variables after creation, return codes and other global macro variables should ideally have unique names to prevent possible confusion and collision. One method to eliminate this risk is to append _RC or RC after the macro name if a single return code is generated. In some cases, the return code can be imbedded in another macro variable by using the in-band signaling technique introduced in the “In-Band” section in chapter 3, “Communication.”

Even worse than global macro variables that perseverate between macros are global macro variables that perseverate between separate programs. When SAS programs are run in interactive mode, it's common to execute software, check the log, and then execute subsequent (possibly unrelated) programs without first terminating the SAS session. All global macro variables created by the first program will still exist and potentially contaminate the execution environment for all subsequent software run in that session. An example of this is demonstrated in the “Failure” section in chapter 4, “Reliability.” To eliminate this vulnerability, when production software is executed manually in interactive mode, a best practice is always to run software in a fresh SAS session in which no other code has been previously executed.

Default to Failure

A common pattern when operationalizing out-of-band return codes in SAS is to represent successful macro completion with an empty macro variable. This is preferred because the %LENGTH macro function can be used to test for process success—if the macro variable is empty, the length will be zero, and no exception, warning, or runtime error will have occurred. In the following code, a parent process calls a child process and must determine whether the child process succeeded or failed:

* child process;
%macro child();
%global childRC;
%let childRC=;
* do something;
%if 2=2 %then %do; * simulate failed process;
   %let childRC=you broke me!;
   %end;
%mend;
* parent process;
%macro parent();
%global parentRC;
%let parentRC=;
%child;
%if %length(&childRC)>0 %then %do;
   %let parentRC=blame my child!;
   %return;
   %end;
* later processes that depend on success of the child;
%mend;

Because 2 always equals 2, the child process will always fail (as currently hardcoded), which produces the following output when executed:

%parent;
%put child: &childRC;
child: you broke me!
%put parent: &parentRC;
parent: blame my child!

This technique is very effective and, as demonstrated in the “Exception Inheritance” section in chapter 6, “Robustness,” facilitates multiple levels of parent–child relationships so that exceptions, warnings, and runtime errors can be inherited appropriately through modular software. One caveat when using zero-length return codes to represent process success is the possibility of returning a false positive return code—that is, claiming victory when the process in fact failed. Thus, if a zero-length return code is intended to represent process success, it should only be assigned at process completion when it can be demonstrated that no exceptions or runtime errors were encountered.

For example, in the previous child process, if a runtime error or fault occurs in the “*do something;” line, the return code &CHILDRC already would have been assigned a value indicating macro success. If the macro abruptly exited and was unable to assign an appropriate error code to &CHILDRC, then &CHILDRC would be invalid, essentially lying to the parent and stating that no error had occurred when in fact one had. One best practice when using return codes to assess process success or failure is to assign a value like GENERAL FAILURE (rather than blank or SUCCESS) initially, as demonstrated in the revised child process:

* child process;
%macro child();
%global childRC;
%let childRC=GENERAL FAILURE;
* do something;
%if 2=2 %then %do; * simulate failed process;
   %let childRC=you broke me!;
   %end;
%else %let childRC=;
%mend;

Only in the final step of the macro (once success is confirmed) should the return code reflecting success be assigned. This often incorporates analysis of the &SYSCC macro variable to demonstrate that no warning or runtime error occurred within the macro or module. Thus, if &SYSCC is 0 and other prerequisites (or business rules) have been met, the return code is set to a value that represents successful completion.

Toward Software Stability

When software is tested, validated, and accepted by a customer, it effectively is certified to be accurate and meet all functional and performance requirements. A formalized test plan (discussed in chapter 16, “Testability”) can further enhance software integrity by demonstrating how the software was shown to meet stated testing requirements. Moreover, a risk register (discussed in chapter 1, “Introduction”) can demonstrate known vulnerabilities at the time of software release—essentially risk that has been accepted by stakeholders. These artifacts can collectively present a more comprehensive view of software reliability to stakeholders and represent a certification that software will perform as stated. Stakeholders in part place confidence in software commensurate with the level of quality with which it was developed and tested.

Once production software is modified, its integrity is depreciated until it can be retested, revalidated, and rereleased. If modified software is not subsequently tested and validated commensurate with the level of quality assurance with which it was originally developed, errors or latent defects may unknowingly introduce vulnerabilities and risk. Especially in development environments that espouse modular, reusable code, subtle modifications to one module could have much broader implications to other programs that rely on that code base. Even when modifications do not adversely affect software, until revised software can be retested to demonstrate this truth, software integrity is inherently diminished.

Software stability is a common objective and often a prerequisite for production software that is automated and scheduled. From a security standpoint, stakeholders need some guarantee that software has not been modified since its last testing and acceptance. In many end-user development environments, the guarantee is a SAS practitioner stating that the software has not been modified or that the modifications were insignificant. In more formalized development environments, stored processes, macros, and programs are saved in central repositories on servers rather than client machines to prevent ad hoc modification. When additional integrity is required to guarantee that software has not been modified in any way, checksums can be generated to validate bit-level comparison of SAS program files.

Stored Macros and Stored Processes

When software is developed through modular design, as individual modules are developed and tested, they can be hardened and placed into a central repository for general use. This practice greatly supports code reuse and repurposing because the code modules have integrity, having already been thoroughly tested and vetted. Code reuse libraries and reuse catalogs can help organize central code repositories and are described and demonstrated in chapter 18, “Reusability.” The SAS application facilitates stability of code modules through the SAS Autocall Macro Facility, the SAS Stored Compiled Macro Facility, and stored processes.

The SAS Autocall Macro Facility enables SAS macros to be stored centrally but compiled locally at execution time. Stored macros, on the other hand, are precompiled, thus saving some time at execution, but their open-source code must be maintained separately from the executable files, resulting in duplication. While use of these facilities is not demonstrated in this text, each supports software security, as production code can be segregated from the development environment to prevent accidental modification, as discussed in the SAS® Macro Language Reference.6 SAS stored processes offer similar advantages in that they store SAS programs centrally on a server rather than on distributed client machines, thus providing greater stability and security.7

Checksums

A checksum is defined as a “fixed-length string of bits calculated from a message of arbitrary length, such that it is unlikely that a change of one or more bits in the message will produce the same string of bits, thereby aiding detection of accidental modification.”8 Checksums—sometimes referred to as hashes or hash algorithms—are strings of hexadecimal characters that uniquely represent files or text in a substantially reduced format. One-way cryptographic hash functions (e.g., MD5, SHA-256) evaluate a file or text and produce the checksum. Checksums provide a reliable method of demonstrating software integrity by validating when two files are identical and, conversely, by elucidating corrupt files when checksums do not match.

For example, when you download software from the Internet, a checksum is typically performed at download completion to determine whether the entire file was downloaded correctly. A bit-level comparison is made to ensure that the downloaded copy of the software is identical to the original. The checksum of the original file on the hosting server already would have been produced, thus the download process needs to generate a checksum only for the downloaded file. If the checksums are found to differ, this represents file corruption because the downloaded software does not match the original. However, when the checksums are identical, file integrity is quickly and confidently demonstrated.

SAS software can benefit from this same methodology, and by comparing a current SAS program to its verified original, stakeholders are assured that the certified operational instance has not been altered since software testing and customer acceptance. However, rather than inefficiently comparing each line of code between two programs, comparison of their two respective checksums provides equivalent integrity.

Base SAS hashing functions include MD5, SHA256, and SHA256HEX, each of which can create a checksum. For example, the following code computes the SHA-256 hash (in hexadecimal format) for the provided text:

%let sha=%sysfunc(sha256hex(It is pitch black. You are likely to be eaten by a grue.));
%put &sha;
5A3F8C31D8D529D52DE843FB5C5B9B5EA9EFFF3CC04B4F3C21880B6C9615B797
%put LEN: %length(&sha);
LEN: 64

To determine whether a new string matches the text string, the hash value of the new string can be created and compared to the previous hash value. Because hash values uniquely reference content, if the hash values are identical, then the strings they represent are also identical. This same hashing technique, when applied to files such as SAS programs, produces checksums that can be used to compare and validate software. While this example represents two sentences as a 64-character checksum (which is not tremendously efficient), a 10,000-line SAS program would also be represented by a 64-character checksum, demonstrating the tremendous efficiency of comparing two checksums to validate file integrity rather than having to compare 10,000 lines of code.

While Base SAS hash functions such as SHA256HEX are useful in producing checksums for text strings, they are not intended to produce checksums for binary files. Although it is theoretically possible to read a SAS program into a character variable and create a checksum, language and encoding issues could produce unexpected or invalid results. A much more straightforward method is to use any object-oriented programming (OOP) language to produce the file checksum natively.

To demonstrate this technique, the string “It is pitch black. You are likely to be eaten by a grue.” is saved to the text file C:permgrue.txt and the following Python code is run:

# PYTHON CODE!!! NO SEMICOLONS HERE!!!
import hashlib
block = 65536
sha = hashlib.sha256()
with open(r'c:permgrue.txt', 'rb') as f:
    buff = f.read(block)
    while len(buff) > 0:
        sha.update(buff)
        buff = f.read(block)
print(sha.hexdigest())

The Python code produces the same checksum (SHA-256 hash value) as the previous SAS SHA256HEX function:

5A3F8C31D8D529D52DE843FB5C5B9B5EA9EFFF3CC04B4F3C21880B6C9615B797

Once software has been developed, tested, hardened, and accepted by a customer, immediate computation of the program file's checksum can be utilized as a baseline for later software validation. When subsequent checksums are calculated for the program file and are validated against the baseline checksum, this demonstrates that the original version of the software is still being executed and confirms that no modifications have been made (which would reduce software integrity).

Implementing Checksums

Unlike SAS data sets, SAS program files contain no metadata such as file name or creation date. Thus, when a hash function produces a checksum of the binary contents of a SAS program file, only the program itself is analyzed. This is critical because it ensures that malleable file metadata do not influence checksum values. For example, a SAS program can be renamed, moved to a different folder, and saved with a newer date, and so long as the contents of the file are unchanged, the checksum will remain static. The discrepancy between SAS data sets and program files, introduced briefly in the “Data Integrity” section, results because SAS data sets contain proprietary metadata that cannot be removed (but which are highly variable), whereas SAS programs are text files whose metadata are managed by the OS, not the SAS application.

To implement checksum integrity checking within a production environment, a checksum should be calculated on SAS software (including SAS programs, associated external macro modules, or static configuration files) after it has been tested, validated, and accepted by the customer. Only once the code has been hardened should the baseline checksums be generated, because even a single additional semicolon in a program will change and thus invalidate its checksum. As discussed in the “Reuse Catalog” section in chapter 18, “Reusability,” checksums are popularly included in reuse catalogs because they provide additional validation to SAS practitioners that code preserved in reuse libraries has not been modified since it was tested and validated.

Thus, one method to validate the integrity of production software is to programmatically calculate the checksum of the program file immediately before execution. When the calculated checksum is found to match the checksum located in the reuse catalog or other artifact, stakeholders are assured that the software running has not been modified since software testing and acceptance. The quality control process (not demonstrated) adds only a couple seconds to software execution but tremendous integrity to the operational environment. And, if a checksum does not validate, business rules can specify that the software should not execute or that a report be generated stating that software integrity could be compromised and that the software should be retested.

The technical implementation of checksums is straightforward and, as demonstrated, requires very little code to operationalize. Thus, the only actual labor involved in enforcing integrity constraints is the maintenance of the roster of checksums. Every time software is tested and accepted by the customer, the software should be hardened and the checksum must be regenerated and recorded in the reuse catalog or other operational artifact. After the checksum is generated, no modifications can be made to the software. A library of current checksums can be maintained in a SAS data set and, where a reuse library and reuse catalog are in use, linked to software modules listed in the reuse catalog. Thus, before reusing an existing module of code within new software, developers can be immediately assured they are utilizing the most recent (and securely tested) version of software by validating its checksum against the reuse catalog.

Checksums for Enterprise Guide Projects

Don't go there! This section is included only as a caveat. Like SAS data sets, SAS Enterprise Guide project files (.egp) are riddled with SAS metadata. To illustrate, open SAS Enterprise Guide and create a new program that contains the following text, so the checksum value can be compared to those in the previous “Checksums” section:

It is pitch black. You are likely to be eaten by a grue.

Don't worry that this text doesn't represent SAS syntax. Save the SAS Enterprise Guide project as C:perm est_checksum_EG.egp and exit SAS Enterprise Guide. Note the date-time stamp of the project file in your OS, because it will be changing. To generate the project file's checksum, execute the following Python code, which can be saved as C:perm est_checksum_EG.py:

# PYTHON CODE!!! NO SEMICOLONS HERE!!!
import hashlib
block = 65536
sha = hashlib.sha256()
with open(r'c:perm	est_checksum_EG.egp', 'rb') as f:
    buff = f.read(block)
    while len(buff) > 0:
        sha.update(buff)
        buff = f.read(block)
print(sha.hexdigest())

The checksum output follows, but yours will differ:

8746d07fc11753a1037c66b671961310585ff0d531796c5c9fd5dddd3ea61f54

Now run the SAS Enterprise Guide project (which will generate runtime errors since SAS can't figure out how to survive a grue attack). Save the project file, exit SAS Enterprise Guide, and rerun the Python code to generate the new checksum:

edc769ecd734991213051671b2b243612485b2d422a1882c40c0af606c741285

Again, your checksum will differ from this one but, more importantly, the two checksums differ from each other—demonstrating that the project file was modified simply by executing and saving the file. As one final experiment, reopen SAS Enterprise Guide and the C:perm est_checksum_EG project file. Run the project (which again produces runtime errors) and then attempt to exit SAS Enterprise Guide. You'll be asked if you want to save changes—which is interesting because you didn't make any changes. Do not save changes, exit SAS Enterprise Guide, and rerun the Python code to generate a familiar checksum value:

edc769ecd734991213051671b2b243612485b2d422a1882c40c0af606c741285

The “changes” that SAS Enterprise Guide was referencing were actually changes to project metadata that are inherently saved as part of all SAS Enterprise Guide project files. Because the changes were not saved, SAS Enterprise Guide did not update those metadata, so the most recent checksum matches the previous value. To view these metadata, peel back the covers on the project file by renaming test_checksum_EG.egp to test_checksum_EG.zip so that the OS will recognize it instead as a compressed file. Open the zip file and go spelunking—the project.xml file is a text file that includes information about the create date, last modification date, and who was doing the modifying. In more complex project files, references in the XML file point to subordinate compressed directories that contain program files.

Due to the complexity and malleability of project files, checksums are useless as an integrity control to demonstrate project file stability. However, by linking to external SAS programs from inside projects (rather than creating programs inside projects), the type of integrity control demonstrated previously in the “Implementing Checksums” section can be implemented on individual SAS program files that a SAS Enterprise Guide project file utilizes.

AVAILABILITY

Availability is “the degree to which a software component is operational and available when required for use.”9 Availability requires that software as well as all required components (including data sets, configuration files, control tables, and other files) exist where they should, are valid, and are accessible to necessary stakeholders. As a central principle of software reliability, availability is discussed throughout chapter 4, “Reliability.” However, from a risk management perspective, best practices can facilitate more secure software by eliminating common threats to data set and file availability.

Inaccessible data sets are one of the leading causes of SAS software failure, even in production software that should be designed to be reliable and robust to known threats. For example, if the SORT procedure is executing and sorting data into the PERM.Mydata data set, the procedure holds an exclusive lock that restricts all other users and processes from accessing the data set. However, this locked data set can be seen from two perspectives—the internal perspective of the software executing the SORT procedure and the external perspective of users or processes waiting for (or attempting) access to the data set.

chapter 6, “Robustness,” discusses data set availability from the external perspective of processes trying to gain access to locked data sets, demonstrating techniques that can avoid runtime errors and software failure when data sets are locked. However, from the internal perspective of the software performing the SORT that has successfully gained access to the data set, security rather than robustness is the priority. For example, this security aims to maximize the availability of PERM.Mydata to other potential users or processes, dictating that the SORT procedure should act swiftly and responsibly. Swiftness encompasses execution efficiency principles and development best practices while responsibility moreover requires that the lock of PERM.Mydata should be released as quickly as possible, thus enabling other competing processes to access the data subsequently. The following sections demonstrate techniques that can increase the availability of SAS data sets from the security rather than the robustness perspective.

Fail-Safe Path

The fail-safe path is the secure conclusion to a failed macro, process, or program. In failing safe, software must first detect the fatal exception or runtime error and, through exception handling, terminate the affected code. Failing safe is described throughout chapter 6, “Robustness,” and demonstrated within an exception handling framework. Failing safe is also discussed in chapter 5, “Recoverability,” as the harmless principle of the TEACH mnemonic, which requires that software should not damage its input, output, data products, or other aspects of its environment during or as a result of failure.

Software can cause damage through a number of modalities when it fails. For example, if exceptions are not handled or are handled improperly, critical data sets can be overwritten with invalid data. This can cause loss of business value as customers and other stakeholders are forced to wait while data or data products are corrected or recreated. While failure may be inevitable in some circumstances, graceful failure paths can reduce or eliminate unnecessary damage that can otherwise occur.

While the causes of failure and types of resultant damage are legion and beyond the scope of this text, a few specific threats to security are discussed in the following sections. When file streams are opened or data sets are explicitly locked with the LOCK statement, these accesses must be terminated within the fail-safe path to ensure that the files or data sets are not rendered inaccessible by other users and processes. Other idiosyncrasies can occur when SAS logs are being directed to an external file (via the PRINTTO procedure) and software does not fail safe. The following sections demonstrate the importance of unlocking data sets, closing file streams, and terminating other unresolved accesses as part of graceful software termination.

Closing Data Streams

The OPEN and FOPEN input/output (I/O) functions can be used within a DATA step to open data streams to SAS data sets and other files, respectively. For example, to determine the number of variables in a data set programmatically, SAS practitioners can interrogate the SAS dictionary tables or they can use the OPEN function to retrieve these metadata. The following code opens the data set Mydata and assigns the value 2 to the macro variable &NVARS, representing the number of variables in the data set:

data mydata;
   length char1 $10 num1 8;
   char1='tacos';
   num1=5;
run;
data _null_;
   dsid=open('work.mydata','i'),
   nvars=attrn(dsid,'nvars'),
   call symput('vars',strip(nvars));
run;

When OPEN is utilized within a DATA step, the data stream lasts only as long as the DATA step. Even if the DATA step terminates abruptly with a runtime error, the data stream is always closed automatically and the shared file lock is released. For this reason, I/O functionality within the DATA step is secure and the corresponding CLOSE function is unnecessary.

However, OPEN, FOPEN, and other I/O functions are commonly invoked using the %SYSFUNC macro function to facilitate I/O functionality within DATA steps, SAS procedures, or anywhere else. For example, the following %GETVARS macro calculates the number of variables in a data set and is much more flexible and reusable than the previous code because it does not require use of the _NULL_ data set to create &NVARS:

%macro getvars(dsn=);
%let syscc=0;
%global nvars;
%let nvars=;
%global getvarsRC;
%let getvarsRC=FAILURE;
%local dsid;
%let dsid=%sysfunc(open(&dsn,i));
%if &dsid>0 %then %do;
   %let nvars=%sysfunc(attrn(&dsid,nvars));
   %let close=%sysfunc(close(&dsn)); *required;
   %end;
%else %do;
   %let getvarsRC=data set could not be opened;
   %return;
   %end;
%if &syscc=0 %then %let getvarsRC=;
%mend;
%getvars(dsn=work.mydata);
%put VARS: &nvars;
%put RC: &getvarsRC;

For this reason, a similar methodology in which I/O functions are invoked with %SYSFUNC is typically preferred. The one disadvantage to this macro-driven technique is the lack of automatic file stream termination, as occurs when OPEN or FOPEN are invoked without %SYSFUNC from within a DATA step. The CLOSE function must be executed any time that the OPEN function is successfully invoked with %SYSFUNC or the file stream will remain open for the duration of the SAS session.

For example, if %SYSFUNC(OPEN) fails to open a data stream and returns a value of 0, then no stream is opened, so no closure is necessary. However, if %SYSFUNC(OPEN) conversely returns a value greater than 0, then %SYSFUNC(CLOSE) must explicitly be used to close the data stream. If the CLOSE function is not utilized, the data stream will remain open indefinitely, maintaining a shared file lock on the data set for the duration of the SAS session. Other users and processes would subsequently be able to open or view the data set, but would be unable to modify or delete the data set while the shared lock unnecessarily persisted.

To ensure that data streams are closed, the fail-safe path must ensure that all data streams opened with %SYSFUNC(OPEN) or %SYSFUNC(FOPEN) are closed with %SYSFUNC(CLOSE) or %SYSFUNC(FCLOSE), respectively. In more complex processes in which multiple I/O functions may be strung together through nested logic branches, multiple CLOSE or FCLOSE functions may be required to ensure that if a failure occurs, all opened data streams are closed before process or program termination.

Unlocking Data Sets

A file lock is created every time SAS accesses a data set, whether to view, modify, or delete the data set. The vast majority of SAS file locks are implicit—that is, they occur automatically, and SAS releases the lock as soon as the DATA step or procedure has terminated. In some cases, such as when a file lock is implicitly created by the OPEN or FOPEN I/O functions (when invoked through the %SYSFUNC macro function), an explicit CLOSE or FCLOSE statement is required to terminate the file lock. This idiosyncrasy is demonstrated earlier in the “Closing Data Streams” section. The LOCK statement is the only method through which SAS creates explicit file locks; thus, it requires an explicit LOCK CLEAR statement to release the lock, as demonstrated in the following code:

libname perm 'c:perm';
data perm.mydata;
   length char1 $10;
   char1='tacos';
run;
%macro sortstuff;
lock perm.mydata;
%if &syslckrc=0 %then %do;
   proc sort data=perm.mydata;
      by char1;
   run;
   lock perm.mydata clear;
   %end;
%else %put somebody in my data!;
%mend;
%sortstuff;

The LOCK statement was historically used to prevent separate SAS sessions, processes, or users from accessing a data set that was in use, but has been largely deprecated due to errors in LOCK functionality. Within legacy code, however, whenever LOCK is implemented, a corresponding LOCK CLEAR statement must follow. A more reliable method to lock shared data sets exclusively is demonstrated in the “Mutexes and Semaphores” section in chapter 3, “Communications.”

Missing: One SAS Log

The PRINTTO procedure can be used to redirect the SAS log to a text file. Log files can be invaluable in ferreting out sources of failure, and some environments require that production software logs be maintained for posterity to validate the success of processes. Another use of PRINTTO is to redirect the SAS log temporarily to a file so that notes in the log can be parsed to alter program flow dynamically. For example, because the SETINIT procedure doesn't create a usable return code or output data set that can be programmatically assessed, the log results from SETINIT must be saved to a text file and immediately parsed to determine if the SAS environment contains specific licensed SAS modules. This technique is demonstrated in the “SAS Component Portability” section in chapter 10, “Portability.” A similar use of PRINTTO is demonstrated in the “FULLSTIMER Automated” section in chapter 8, “Efficiency,” in which FULLSTIMER metrics from the SAS log are saved to a text file and subsequently parsed and analyzed via the %READFULLSTIMER macro.

This use of PRINTTO is reprised and captures the FULLSTIMER metrics that depict system resource utilization during the DATA step:

libname perm 'c:perm';
%let path=%sysfunc(pathname(perm));
options fullstimer;
proc printto log="&path/out.txt" new;
run;
data perm.subset;
   set perm.sortme (where=(num1<.1));
run;
proc printto;
%readfullstimer(textfile=&path/out.txt);

If the program is abruptly terminated during the DATA step, however, the log destination will still be directed toward the text file, not the SAS log window. The second PRINTTO procedure is required to redirect the log back to the default SAS window. In this scenario, at least two errors can occur when software terminates abruptly while the log is being sent to a file. First, if additional code is executed in the same SAS session, the log will continue to be sent to the log file, which may perplex SAS practitioners who can't find results of their code, while also adulterating the log file with irrelevant or unrelated results. Second, as long as the stream to the log file remains open, SAS maintains a file lock on the log file that prevents its deletion. Therefore, a later process intended to delete the log after parsing it (not shown) can also fail because SAS maintains the file lock until the second issuance of the PRINTTO procedure.

Thus, whenever PRINTTO is implemented to redirect the log destination, exception handling routines should ensure that the second PRINTTO procedure is implemented as part of the fail-safe path so that logs generated subsequent to the termination do not disappear into thin air or cause cascading failures.

Euthanizing Programs

In most instances, the fail-safe path is invoked because an exception, warning, or runtime error is encountered during execution which requires process or program termination. Thus, a functional failure precipitates the need to fail gracefully. However, occasionally a performance failure—notably, the failure to meet execution time requirements—necessitates program termination. For example, if a SAS process is expected to complete in less than ten minutes but is still running an hour later, some business rules might recommend or require that the process be terminated. However, production SAS software too often is terminated for this reason by manually killing the job rather than by building this logic into automated quality controls within a quality assurance exception handling framework. As demonstrated, Base SAS does provide all the tools to force processes to terminate when they exceed execution time thresholds.

Consider the following code that creates and sorts a data set and which simulates a process within production software:

libname perm 'c:perm';
data perm.mydata (drop=i);
   length num1 8;
   do i=1 to 100000000;
      num1=round(rand('uniform')*10);
      output;
      end;
run;
proc sort data=perm.mydata;
   by num1;
run;

The code executes in approximately 90 seconds, which will vary by environment and can be modified by changing the number of observations created. But imagine if the same code were expected (and required) to execute instead in 10 seconds or less. In taking 90 seconds (nine times longer than expected), this performance failure would undoubtedly leave SAS practitioners with the difficult decision of whether to kill the job manually or allow it to continue unfettered on its slow and likely errant path. Moreover, allowing a rogue process to continue processing after it has already demonstrated substantial deviation from performance norms poses a security risk, especially where the cause of the delay remains unknown while the code continues to execute.

To improve both the performance and security of production software, one best practice is to modularize processes that can be invoked with the SYSTASK statement. The SYSTASK statement, when followed by the WAITFOR statement with the TIMEOUT option, kills the specified SAS program (i.e., batch job) if an execution time threshold is reached. This method ensures that the SAS session (in which the batch job is running) is immediately terminated, providing a fail-safe path when performance failures are automatically detected.

To demonstrate this method, the previous SAS code should be saved in the file C:permmakeandsort.sas. This program will be called as a batch job, enabling the WAITFOR statement to terminate if necessary. The parent process follows and should be saved as C:permengine.sas.

libname perm 'c:perm';
%put %sysfunc(putn(%sysfunc(datetime()),datetime17.));
systask command """%sysget(SASROOT)sas.exe"" -noterminal -nosplash -nostatuswin -noicon -sysin ""c:permmakeandsort.sas"" -log ""c:permmakeandsort.log""" status=rc_makeandsort taskname=task_makeandsort;
waitfor task_makeandsort timeout=20;
%put SYSRC: &sysrc;
%put %sysfunc(putn(%sysfunc(datetime()),datetime17.));
systask kill task_makeandsort;

The SYSTASK statement is explained in detail in chapter 12, “Automation,” but the TASKNAME option is required in this instance because the WAITFOR statement uses it to reference this specific batch job. Because the SAS process (now saved as Makeandsort.sas) is required to complete in 10 seconds or less, the TIMEOUT option has been set at 20 seconds, arbitrarily specifying that the job should be killed if it has not completed in twice its expected execution time. Because the process in this scenario is taking more than 20 seconds to execute (approximately 90 seconds), it will be terminated automatically after 20 seconds.

The following output is produced by the Engine.sas program, which demonstrates that the batch job did time out. The passage of time is demonstrated in the log with the DATETIME function and the &SYSRC automatic macro variable is –1, demonstrating that the SYSTASK function did not complete correctly (due to the timeout):

libname perm 'c:perm';
NOTE: Libref PERM was successfully assigned as follows:
      Engine:        V9
      Physical Name: c:perm
%put %sysfunc(putn(%sysfunc(datetime()),datetime17.));
11MAR16:22:19:27
systask command """%sysget(SASROOT)sas.exe"" -noterminal -nosplash -nostatuswin -noicon
! -sysin ""c:permmakeandsort.sas"" -log ""c:permmakeandsort.log""" status=rc_makeandsort
! taskname=task_makeandsort;
waitfor task_makeandsort timeout=20;
NOTE: WAITFOR timed out.
%put SYSRC: &sysrc;
SYSRC: -1
%put %sysfunc(putn(%sysfunc(datetime()),datetime17.));
11MAR16:22:19:47
systask kill task_makeandsort;

Because the SYSTASK statement specifies that a log file (for the batch job only) should be created as C:permmakeandsort.log, this output is demonstrated:

libname perm 'c:perm';
NOTE: Libref PERM was successfully assigned as follows:
      Engine:        V9
      Physical Name: c:perm
%put %sysfunc(putn(%sysfunc(datetime()),datetime17.));
11MAR16:22:19:27
data perm.mydata (drop=i);
   length num1 8;
       do i=1 to 100000000;
      num1=round(rand('uniform')*10);
      output;
      end;
run;
NOTE: The data set PERM.MYDATA has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           11.55 seconds
      cpu time            11.03 seconds

Note that the batch job had time to complete the DATA step, which is demonstrated in the log, but did not complete the SORT procedure. Because the WAITFOR statement causes an abrupt termination of the batch job (and the SAS session in which it is running), whatever process is currently executing (when the KILL statement is received) is not written to the log. However, previous processes that have executed prior to the termination will be recorded and can be used to investigate the cause of the delay.

Because the &SYSRC return code is –1 when a timeout occurs (and 0 when execution occurs within the specified time limit), the &SYSRC value can be assessed programmatically to drive further exception handling routines. For example, upon discovering that a batch job has timed out, an email could notify stakeholders of this exception or any number of other actions could be performed, thus increasing the quality and responsiveness of software. And, because this technique can kill rogue processes that are likely wasting system resources and may be causing other harm, software security and efficiency can be substantially increased through euthanasia.

Complexities of Concurrency

Parallel processing is demonstrated throughout chapter 7, “Execution Efficiency,” chapter 9, “Scalability,” and chapter 12, “Automation,” as a best practice that can substantially increase software performance by decreasing execution time. The parallel processing model described most often involves spawning multiple SAS sessions concurrently—through manual execution, batch jobs, or the SYSTASK statement—and enabling them to communicate with each other to divide and conquer tasks or data sets. However, with this added communication and complexity comes the responsibility that parallelized software be secure to ensure parallel processes do not damage or interfere with each other.

A common parallel processing paradigm entails a SAS program engine (AKA controller or driver) that spawns batch jobs (child processes) or otherwise coordinates software executing in other SAS sessions. In this parallel processing model, software security must encompass not only the program engine but also its children or other concurrent software with which it is interacting. For example, if multiple batch jobs are spawned through SYSTASK, exception handling within the engine should monitor the success and failure of the respective batch jobs and ensure that their respective return codes and other information do not conflict within the engine. In addition, outside of the engine, additional measures must ensure that log files, data products, or other output from batch jobs or other concurrently executing programs do not conflict with each other.

Salt is “a variable incorporated as secondary input to a one-way or encryption function that derives password verification data.”10 Salting is used most commonly in cryptography, in which some unique value is internally added to a construct (such as a password) to strengthen it. Because the salted portion of the construct is typically unknown to users and only used for internal processing and validation, it facilitates passwords that are less vulnerable to attack. The use of salts in cryptography is beyond the scope of this text, but salts can be conceptually utilized in parallel processing to identify constructs uniquely to prevent collisions.

By appending a known, unique token—which can be as straightforward as an incremental number—to macro variable names, SAS data set names, log file names, data product names, and other relevant constructs, collisions are avoided because identically named entities are not created. Salts are used inside engines to ensure that batch job metadata and return codes do not conflict, and are used in child processes to ensure that their respective data products, log files, and other output do not conflict.

Salting Your Children

To eliminate the threat of collisions between child processes running in batch mode or other software running concurrently in different SAS sessions, salt can be applied to data set names, log file names, and other output. For example, the following program uses I/O functions to determine all variables in a data set and iteratively sorts the data set by each of the variables. That is, the data set Mydata is sorted first by the variable CHAR1 and next by the variable CHAR2. Because the output data sets need to be saved uniquely, a salt—the incremental macro variable &I—is appended to the output data set name to avoid collisions.

data mydata;
   length char1 $10 char2 $15;
   char1='one';
   char2='two';
run;
%macro serialsort(dsn=);
%local dsid;
%local vars;
%local i;
%let dsid=%sysfunc(open(&dsn));
%let vars=%sysfunc(attrn(&dsid,nvars));
%do i=1 %to &vars;
   proc sort data=&dsn out=&dsn._&i;
      by %sysfunc(varname(&dsid,&i));
   run;
   %end;
%mend;
%serialsort(dsn=mydata);

The code produces two data sets, Mydata_1 and Mydata_2, thus ensuring that file names are unique and do not conflict. With subtle modification of the code, these data sets instead could have been named after their respective sorted variable, or with any other dynamically generated salt values that would guarantee uniqueness.

Relevant to parallel processing, salts must be used in SYSTASK invocations wherever multiple copies of the same software are being asynchronously spawned. For example, the following two lines of code are demonstrated in the “Decomposing SYSTASK” section in chapter 12, “Automation”:

systask command """%sysget(SASROOT)sas.exe"" -noterminal -sysin ""&permmeans.sas"" -log ""&permmeans1.log"" -print ""&permmeans1.lst""" status=rc_means1 taskname=task_means1;
systask command """%sysget(SASROOT)sas.exe"" -noterminal -sysin ""&permmeans.sas"" -log ""&permmeans2.log"" -print ""&permmeans2.lst""" status=rc_means2 taskname=task_means2;

Note that the log files and output files must be uniquely named to ensure that they don't collide in the environment during execution. In more dynamic software, rather than generating log file names like Means1 and Means2 manually, the salt value would be dynamically generated and incremented, as is demonstrated in the following section.

Salting Your Parents

When batch jobs are spawned with the SYSTASK statement, the WAITFOR statement can be implemented to force the engine to pause while one or more batch jobs complete. Without WAITFOR implemented, SYSTASK continues processing and can asynchronously spawn other batch jobs that will execute concurrently. In addition to ensuring that the batch jobs themselves do not collide, batch job metadata within the engine may also need to be salted to ensure they are uniquely named. SYSTASK is discussed extensively throughout chapter 12, “Automation.”

To demonstrate this threat of collision, the %SERIALSORT macro from the “Salting Your Children” section is retrofitted so that it can be run in parallel. Because macro variables cannot be passed as parameters to external SAS sessions, the values for the data set name, variable name, and variable position—DSN, VAR, and I, respectively—must instead be supplied through the SYSPARM parameter. The engine that choreographs the respective SAS children is demonstrated and saved as C:permparallelsort.sas:

libname perm 'c:perm';
data perm.mydata;
   length char1 $10 char2 $15;
   char1='one';
   char2='two';
run;
%macro parallelsort(dsn=);
%local dsid;
%local vars;
%local var;
%local i;
%let dsid=%sysfunc(open(&dsn));
%let vars=%sysfunc(attrn(&dsid,nvars));
%do i=1 %to &vars;
   %let var=%sysfunc(varname(&dsid,&i));
   systask command """%sysget(SASROOT)sas.exe"" -noterminal -nosplash    -nostatuswin -noicon -sysparm ""&dsn &var &i"" -sysin    ""c:permserialsort.sas"" -log ""c:permserialsort_&i..log"""    status=rc_sort&i taskname=task_sort&i;
   %end;
waitfor _all_;
%put Sorting Done!;
%mend;
%parallelsort(dsn=perm.mydata);

Because the SYSTASK statement is executed iteratively—as many times as there are variables in the data set—its metadata (i.e., &STATUS and &TASKNAME macro variables) and output (i.e., log and print files) must be uniquely named utilizing the salt value &I. Without this convention, the batch job task names and return code variables could not be distinguished and would overwrite each other. For example, were Parallelsort.sas required to be more robust, exception handling would need to validate the individual return codes &TASK_SORT1 and &TASK_SORT2 to demonstrate that their respective batch jobs completed without error. While this exception handling is not demonstrated, the uniquely named task return codes facilitate this endeavor through salting.

The revised engine now invokes multiple simultaneous instances of the following batch job, saved as C:permserialsort.sas.

libname perm 'c:perm';
%let dsn=%scan(&sysparm,1,,S);
%let var=%scan(&sysparm,2,,S);
%let i=%scan(&sysparm,3,,S);
proc sort data=&dsn out=&dsn._&i;
   by &var;
run;

Because macro parameters cannot be passed from the parent process (Parallelsort.sas) to the child process (Serialsort.sas), the SYSTASK statement passes parameters via the SYSPARM parameter, which is read and parsed as the &SYSPARM automatic macro variable inside Serialsort.sas. And, because multiple copies of the child process execute concurrently, the salt &I is used again to ensure that output data sets and log files do not collide and overwrite each other. By parallelizing the serial process demonstrated in the “Salting Your Children” section, all SORT procedures run simultaneously and performance is greatly improved.

SECURITY IN THE SDLC

Direct references to security are less common in data analytic development environments and projects because malicious threats pose an unlikely source of risk. Thus, aspects of security more often are couched in terms of integrity and availability objectives during software planning and design discussions. For example, how will software integrity be demonstrated so that stakeholders are assured that production software has not been modified since testing and acceptance? Or, what methods will be implemented to ensure that data sets and other required software components are available when needed?

Notwithstanding the shift in focus that security may experience in data analytic development environments, many security objectives espouse best practices that should be implemented within all software regardless of other performance requirements. For example, leaky macros should never be tolerated because encapsulation principles can easily eliminate this vulnerability. In this sense, secure coding often equates to supporting software development best practices rather than requirements codified in project documentation.

Other security solutions may be more complicated to effect and are typically implemented at the organizational or team level rather than for one software product. For example, integrity constraints such as checksums are powerful tools that can demonstrate the integrity and stability of software, but which are implemented as a comprehensive solution that can support a corpus of software rather than one program. Thus, conversations about software security often occur at the team or organization level as opposed to during planning and design of a particular software product.

Requiring Security

Security requirements are typically organized around the CIA triad of confidentiality, integrity, and availability. Confidentiality is not a focus of this text because it is more commonly managed in the SAS application through the SAS administrator, metadata content, and OS controls and user privileges. SAS data sets can be encrypted programmatically, and this may be a requirement in some organizations, especially where data are transmitted beyond the organization through email, web services, or other electronic modality.

Due to the overwhelming focus in SAS literature on data quality, integrity requirements for SAS software products too often are focused narrowly on data integrity to the exclusion of code integrity. For example, requirements might state that a critical third-party data set ingested by software must demonstrate validity through a series of quality controls that validate the data or metadata. While data integrity is paramount in data analytic software, software integrity should be equally valued.

Where requirements seek to improve the quality of software products by introducing data quality control techniques, software integrity checks such as checksums that can instantaneously validate software and its components should also be included in software requirements. Integrity requirements often also speak to the level of stability required of software, for example, by stating that production software should be saved as stored processes or as separate macro modules.

This language can help demonstrate to developers and other stakeholders that software is intended to remain stable for some intended duration once placed into production. In a Waterfall environment, this could be several months, when a planned (or unplanned) software update might be released to eliminate defects or provide additional functionality or improved performance. In an Agile environment, on the other hand, that software stability might only last the length of one iteration, when upgrades to the SAS software are integrated as the next batch of working code is released. Thus, regardless of the type of development environment that exists—whether phase-gate or rapid development—an expectation of software stability must be demonstrated if software integrity is to be achieved.

Availability, the third facet of the CIA triad, is more commonly associated with software reliability and robustness requirements. Thus, where software availability is a stated requirement, technical specifications typically state the percent of the time that software must be operational, a maximum number of failures that software is permitted to have per month, or some other reliability-centric metric. While closing data streams, unlocking data sets, and salting dynamic constructs so they don't collide represent technical best practices that should always be followed, they typically are not included in requirements documentation. Availability as it relates to software requirements is further discussed in the “Reliability in the SDLC” sections in chapter 4, “Reliability.”

Measuring Security

Because security aims to eliminate threats, successful security policies are typically measured through reliability metrics. Thus, if a software product or its data products lack integrity, this will be demonstrated through functional failures that either terminate with runtime errors or produce invalid, inaccurate results. For example, if SAS practitioners fail to predict that data sets stored in persistent, shared SAS libraries may be locked, missing, or otherwise unavailable, then failures will result due to the lack of availability.

Security does differ from reliability in that security is more outward-focused. In measuring reliability, stakeholders may want to establish the percent of time that software was functional or the number of failures that occurred within some timeframe. Assessing software security, on the other hand, would additionally involve assessing how many times and to what extent software adversely affected the reliability or security of other software or the production environment as a whole. At the process level, security aims to measure not only if a code module is secure but moreover that the module does nothing to violate the functionality of other external software modules or programs.

This is an important distinction because a SAS module could appear to be very secure if it contained a macro definition that accepts a limited number of predefined parameters. Thus, the macro might appear to be encapsulated from and only loosely coupled to other components of the software. However, if the same macro also creates a plethora of global macro variables that overwrite existing macro variables, or if it overwrites SAS data sets or data products used by other components, then despite the reliability of the module, it doesn't demonstrate security because it harms its environment. Security, therefore, must always be considered from the perspective of not only the software being assessed but also the environment in which it interacts, which includes external software and data products.

WHAT'S NEXT?

A common objective of production software is to remove the human element so it can run unfettered, allowing SAS practitioners to perform more important and interesting tasks, such as analyzing data products or making data-driven decisions. Once software has achieved a necessary level of reliability, has been stabilized, and has been tested sufficiently to predict future performance, software can be automated. The next chapter demonstrates successful software automation—a well-deserved reward for SAS practitioners who have labored over a quality product.

NOTES

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset