Data Management

Introduction

Why is the data coded? Doesn’t it seem easier to use words that are more descriptive (i.e., diabetes, no diabetes)? There are a number of reasons why data collections, particularly large ones, are coded. Some computer programs, such as Excel, have limited capability for processing alphanumerics. When data is collected on paper (such as from questionnaires), it is more efficient and less error prone to transcribe the data into electronic form using codes. Oftentimes, these codes are assigned from a domain list of values within the collection process, by choosing the corresponding textual definition of the code. Large data collections are often accompanied by a codebook which gives the variable and code definitions.
While coded data is efficient for information processing, it is not ideal when communicating statistical results to stakeholders. Therefore we will “decode” the data prior to analysis.

Setting the JMP Modeling Type

Level (or scale) of measurement describes the relationship between the values that a variable can assume. The values of a nominal variable represent different categories, for example, gender or geographic region. An ordinal variable’s values have an implied ordering, such as in a severity of illness rating with levels minor, moderate, major, and extreme. An interval scale applies to numeric variables where intervals have the same interpretation throughout the scale, such as with temperature. Ratio scales have an absolute zero, for example, currency or age. A JMP modeling type is assigned to each column to indicate the level of measurement for that variable. There are three modeling types: nominal, ordinal, and numeric. The numeric type is assigned to variables measured on an interval or ordinal scale.
By default, JMP assigns a modeling type of continuous to columns containing numbers. It is important to assign JMP columns the proper modeling type so that the appropriate statistical analysis will be performed. All of the columns in CreatinineLevels.jmp are initially set to have a continuous modeling type. The patient_id column contains a unique anonymized patient identification number and should be assigned a nominal modeling type. To change the modeling type, right click at the patient_id icon in the column list on the left of the data table and select “nominal” as shown in Figure 6.1 Changing Modeling Type.
Figure 6.1 Changing Modeling Type
Notice that the icon for patient_id now appears as a small red histogram.

Renaming the Coded Variable Levels

The columns Diabetes, CAD, Outcome, and Race_Cat are all coded nominal variables. They will be easier to interpret if they are decoded into more descriptive names. The columns associated with the presence of disease are examples of binary or indicator variables, which are nominal variables with two possible values. It is conventional to assign a value of 0 when the disease is absent and a value of 1 when the disease is present. These variable levels can be easily renamed with the Recode feature selected from the Cols menu using the option “In Place” to replace the codes with descriptive names. The completed Recode dialog for Diabetes is shown in Figure 6.2 Recoding the Diabetes Column.
Figure 6.2 Recoding the Diabetes Column
The same recoding is completed for CAD and Outcome; the same technique can be applied to the Race_Cat column. (See Data Definitions section for the decoded values.)

Creating and Populating a Column

The data set does not contain a variable that allows us to directly estimate the proportion of patients with creatinine level greater than 2. A new variable containing an indicator variable can easily be created. When a variable is created from existing data it is referred to as derived data.
To estimate the proportion of patients that have a creatinine level greater than 2, create a column that contains a nominal variable that will serve as an indicator. Add a new column by double clicking in the blank column header to the right of the Outcome column. Right click and select Column Info from the menu. Figure 6.3 Setting the Column Name and Data and Modeling Types shows the completed dialog where a descriptive column name has been added.
Figure 6.3 Setting the Column Name and Data and Modeling Types
The JMP Formula Editor allows us to populate this new column with the indicator corresponding to the creatinine level. Selecting "Formula" from the Column Properties drop-down menu will cause the JMP Formula Editor to be displayed. To create the formula, select Conditional > If from the formula group list at the far left. In the expression box add Creatinine from the Columns list and then select Comparison > => and then enter 2. To the right of the = enter “Stage 1” and to the right of = associated with the else clause enter “No disease”. The completed formula is shown in Figure 6.4 Creating Values in a New Column Using the Formula Editor.
Figure 6.4 Creating Values in a New Column Using the Formula Editor
Apply the formula to the new column. The assignment of character values will cause the column to have a character data type and a nominal modeling type. The data file is now ready for analysis.
Last updated: October 12, 2017
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset