4

Process Models for Data Mining and Analysis

This chapter includes an overview of two complementary analytical process models: the Central Intelligence Agency (CIA) Intelligence Process1 and the CRoss Industry Standard Process for Data Mining (CRISP-DM), 2 as well as an integrated process model for Actionable Mining and Predictive Analysis that is specific to the application of data mining and predictive analytics in the public safety and security setting.

All of these models emphasize the analytical process over specific tools or techniques. In addition, they have been conceptualized as iterative processes, meaning simply that the analytical process can and should be repeated as conditions change or new information becomes available. Rather than representing a failure of the analysis or created model, the need for repetition can serve to validate successful analysis and information-based operations. When used to support information-based operations, data mining tools are similar to a public safety time machine in that they offer the ability to characterize, anticipate, predict, and even prevent certain crimes. For example, “risk-based deployment” strategies are based on the concept that identifying and characterizing what is likely to happen in new areas supports proactive deployment. Once crime has been suppressed in a particular area, the next steps could include analysis and evaluation of displacement, which occurs when a particular crime pattern or trend has been moved to another location. This would include a similar analytical process on a new set of data that accurately reflect the current conditions, including the positive changes associated with an earlier iteration of the model and the resulting operational plan. This development of effective, information-based tactics and strategies can allow managers and command staff to target their resources specifically and more effectively, which increases the likelihood of successful operations.

An iterative process is also important because crime and criminals change. As the players change, so do the underlying patterns and trends in crime. For example, preferences for illegal drugs often cycle between stimulants and depressants. Markets associated with distribution of cocaine frequently differ from those associated with heroin. Therefore, as the drug of choice changes, so will the associated markets and related crime. Similarly, seasonal patterns and even weather can change crime patterns and trends, particularly if they affect the availability of potential victims or other targets. For example, people are less likely to stroll city streets during a torrential downpour, which limits the availability of potential “targets” for street robberies; temperature extremes might be associated with an increased prevalence of vehicles left running to keep either the air conditioning or heat on, which increases the number of available targets for auto theft. Successful police work also will require periodic “refreshing” of the model. The models will change subtly as offenders are apprehended and removed from the streets. These revised models will reflect the absence of known offenders, as well as the emergence of any new players. For example, illegal drug markets frequently experience changes in operation and function associated with changes in players.

In many ways, identifying changing patterns or players is the most exciting outcome of the data mining process because it underscores the surprise and discovery that can be associated with the analysis. This process is not linear, with a clear beginning and end. Rather, it is better represented by an iterative cycle in which the answers to the initial questions almost inevitably beget additional questions, representing the beginning of the next analysis cycle. Another way to visualize this process of forward iteration is to consider a funnel-shaped spiral rather than a flat circle. The spirals get increasingly tight as the solution moves closer to the idealized target. The concepts of spiral processing, integration, and development are increasingly being used to describe an iterative process with forward progression. Although this language may change over time, the important feature of a spiral model of iterative crime or intelligence analysis is that the subsequent iterations reflect progress and movement toward the best fit.

Sequential iterations of the analysis process can be used to further refine models, which ultimately may result in more specific, targeted tactics, strategies, and responses. For example, it is common knowledge that there are different types of homicide that can be defined by their motives (e.g., domestic, drug related, sexual).3 Categorizing homicides has value from an investigative perspective in that knowledge of the type of homicide or motive generally serves to shorten the list of potential suspects, which ultimately enhances investigative efficacy. Similarly, analysis of the victims of violence has resulted in the identification of groups. This underscores the finding that different people are at risk for different reasons at different times and that global violence injury prevention programs might be less than adequate if they do not address the unique constellation of risk associated with specific victim groups.4 Therefore, additional analysis and characterization of the unique factors associated with specific groups of victims can be used to guide the creation of meaningful prevention and response strategies. This subject is addressed in greater detail in Chapter 11.

Data mining is as much analytical process as it is math and statistics. In fact, the general rule is that the data mining process is 80:20—80% preparation and 20% analysis. The specific elements or tasks associated with the data mining process will be addressed separately and in greater detail in subsequent chapters; however, a general overview of these models is provided below in an effort to highlight their similarities and functional relationships. Specific analytical protocols based on these process models also will be provided in relevant chapters.

4.1 CIA Intelligence Process

To highlight the multiple steps or detailed process associated with the transformation of data and information into intelligence, the CIA has developed an Intelligence Process model. The CIA intelligence model has been divided into six stages: Needs, Collection, Processing and Exploitation, Analysis and Production, Dissemination, and Feedback.5

Needs

During the needs or “requirements” phase,6 intelligence information priorities are determined. It is during this phase that conflicting or competing priorities are identified and resolved or rank ordered. The CIA model underscores the dynamic and changing nature of the Intelligence Process, emphasizing that the “answers” to questions frequently represent the starting point for subsequent iterations of the process. Therefore, these identified needs or requirements can and should be reevaluated as conditions or priorities change.

Collection

The intelligence community places particular emphasis on the collection of raw data and information that form the basis for finished intelligence products, creating agencies assigned exclusively to the collection, processing and exploitation, and analysis of specific intelligence sources. The CIA model specifies five basic collection modalities:7

1. Signals Intelligence (SIGINT)—SIGINT is a general category that includes information obtained from intercepted signals. Subdisciplines within this category include Communications Intelligence (COMINT) and Electronic Intelligence (ELINT).
2. Imagery Intelligence (IMINT)—IMINT includes information obtained through satellite, aerial, and ground-based collection methods.
3. Measurement and Signature Intelligence (MASINT)—MASINT includes technical data that are not SIGINT or IMINT. Sources for MASINT intelligence can include, but are not limited to, radar, nuclear, seismic, and chemical and biological intelligence.
4. Human-Source Intelligence (HUMINT)—HUMINT, as the name implies, includes intelligence gathered from human sources. This collection discipline has been divided further and can include clandestine activities as well as overt collection efforts, debriefing, and official contacts.
5. Open-Source Information (OSINT)—OSINT includes information available publicly and can include but is not limited to newspapers, radio, television, and the Internet.

Processing and Exploitation

The processing and exploitation phase includes the preparation and transformation of data into a format that can be analyzed. The inclusion of this step underscores the complexity of some forms of collected intelligence information. Single agencies may be almost entirely responsible for the processing and exploitation of specific categories of technically derived intelligence, which supports the critical importance of subject matter or domain expertise in the process.

Analysis and Production

It is during the analysis and production phase of the process that raw data and information are converted into intelligence products. These products may be relatively brief and limited in depth or coverage, or they may be longer and represent a more comprehensive study of a particular topic or issue. These finished intelligence studies also may include the integration of multiple sources of information, which affords a greater depth of analysis and insight.

Dissemination

Dissemination includes the distribution of intelligence products to the intelligence community, policy makers, the military, or other consumers of intelligence. Intelligence products may be developed rapidly, based on emerging or rapidly changing events; they may be regular reports for events such as as the President’s Daily Briefing; or they may reflect the results of a long-term study or analysis, such as the National Intelligence Estimates.

Feedback

The inclusion of feedback in the model supports the continuous and iterative nature of the Intelligence Process. Information provided by the consumers of finished intelligence can be used to guide new areas of inquiry or identify gaps in information that need to be filled or otherwise addressed. Feedback also may be used to adjust priorities or emphasis.

Summary

The CIA Intelligence Process is well suited to the functions and needs of the intelligence community. The scope, breadth, and applicability of this approach to such a diverse range of functions and responsibilities within the intelligence community are admirable, and have been highlighted by the model’s relative longevity, as well as the frequency with which it has been adopted, cited, and imitated. The level of detail associated with this process model, however, is not sufficient to support specific analytical strategies or approaches, including data mining. In all fairness, we should point out that it would be extremely difficult if not impossible to develop a general analytical process model or strategy that addressed accurately, and in specific detail, the unique challenges and idiosyncrasies associated with each collection discipline or modality.

4.2 CRISP-DM

What the CIA model brings in terms of specificity to intelligence, and by extension applied public safety and security analysis, the CRISP-DM process model contributes to data mining as a process, which is reflected in its origins. Several years ago, representatives from a diverse array of industries gathered to define the best practices, or standard process, for data mining.8 The result of this task was the CRoss-Industry Standard Process for Data Mining (CRISP-DM). The CRISP-DM process model was based on direct experience from data mining practitioners, rather than scientists or academics, and represents a “best practices” model for data mining that was intended to transcend professional domains. Data mining is as much analytical process as it is specific algorithms and models. Like the CIA Intelligence Process, the CRISP-DM process model has been broken down into six steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.9

Business Understanding

Perhaps the most important phase of the data mining process includes gaining an understanding of the current practices and overall objectives of the project. During the business understanding phase of the CRISP-DM process, the analyst determines the objectives of the data mining project. Included in this phase are an identification of the resources available and any associated constraints, overall goals, and specific metrics that can be used to evaluate the success or failure of the project.

Data Understanding

The second phase of the CRISP-DM analytical process is the data understanding step. During this phase, the data are collected and the analyst begins to explore and gain familiarity with the data, including form, content, and structure. Knowledge and understanding of the numeric features and properties of the data (e.g., categorical versus continuous data) will be important during the data preparation process and essential to the selection of appropriate statistical tools and algorithms used during the modeling phase. Finally, it is through this preliminary exploration that the analyst acquires an understanding of and familiarity with the data that will be used in subsequent steps to guide the analytical process, including any modeling, evaluate the results, and prepare the output and reports.

Data Preparation

After the data have been examined and characterized in a preliminary fashion during the data understanding stage, the data are then prepared for subsequent mining and analysis. This data preparation includes any cleaning and recoding as well as the selection of any necessary training and test samples. It is also during this stage that any necessary merging or aggregating of data sets or elements is done. The goal of this step is the creation of the data set that will be used in the subsequent modeling phase of the process.

Modeling

During the modeling phase of the project, specific modeling algorithms are selected and run on the data. Selection of the specific algorithms employed in the data mining process is based on the nature of the question and outputs desired. For example, scoring algorithms or decision tree models are used to create decision rules based on known categories or relationships that can be applied unknown data. Unsupervised learning or clustering techniques are used to uncover natural patterns or relationships in the data when group membership or category has not been identified previously. These algorithms can be categorized into two general groups: rule induction models or decision trees, and unsupervised learning or clustering techniques. Additional considerations in model selection and creation include the ability to balance accuracy and comprehensibility. Some extremely powerful models, although very accurate, can be very difficult to interpret and thus validate. On the other hand, models that generate output that can be understood and validated frequently compromise overall accuracy in order to achieve this.

Evaluation

During the evaluation phase of the project, the models created are reviewed to determine their accuracy as well as their ability to meet the goals and objectives of the project identified in the business understanding phase. Put simply: Is the model accurate, and does it answer the question posed?

Deployment

Finally, the deployment phase includes the dissemination of the information. The form of the information can include tables and reports as well as the creation of rule sets or scoring algorithms that can be applied directly to other data.

Summary

This model has worked very well for many business applications;10 however, law enforcement, security, and intelligence analysis can differ in several meaningful ways. Analysts in these settings frequently encounter unique challenges associated with the data, including timely availability, reliability, and validity. Moreover, the output needs to be comprehensible and easily understood by nontechnical end users while being directly actionable in the applied setting in almost all cases. Finally, unlike in the business community, the cost of errors in the applied public safety setting frequently is life itself. Errors in judgment based on faulty analysis or interpretation of the results can put citizens as well as operational personnel at risk for serious injury or death.

Table 4-1 Comparison of the CRISP-DM and CIA Intelligence Process Models.

CRISP-DMCIA Intelligence Process
Business understandingNeeds
Data understandingCollection
Data preparationProcessing and exploitation
ModelingAnalysis and production
EvaluationDissemination
DeploymentFeedback

The CIA Intelligence Process has unique features associated with its use in support of the intelligence community, including its ability to guide sound policy and information-based operational support. The importance of domain expertise is underscored in the intelligence community by the existence of specific agencies responsible for the collection, processing, and analysis of specific types of intelligence data. The CRISP-DM process model highlights the need for subject matter experts and domain expertise, but emphasizes a common analytical strategy that has been designed to transcend professional boundaries and that is relatively independent of content area or domain. The CIA Intelligence Process and CRISP-DM models are well suited to their respective professional domains; however, they are both somewhat limited in directly addressing the unique challenges and needs related to the direct application of data mining and predictive analytics in the public safety and security arena. Therefore, an integrated process model specific to public safety and security data mining and predictive analytics is outlined below. Like the CIA model, this model recognizes not only a role but also a critical need for analytical tradecraft in the process; and like the CRISP-DM process model, it emphasizes the fact that effective use of data mining and predictive analytics truly is an analytical process that encompasses far more than the mathematical algorithms and statistical techniques used in the modeling phase.

4.3 Actionable Mining and Predictive Analysis for Public Safety and Security

The CRISP-DM model highlights the importance of domain expertise and analytical tradecraft. As depicted in Figure 4-1, the steps both preceding and following the modeling phase require significant domain expertise and understanding of operational requirements. If we assume, as mentioned earlier, that 80% of the data mining process is in the data preprocessing and preparation steps, an effective data mining process model for public safety and security will specifically address these steps. This preparation should include focusing on the unique limitations and challenges associated with applied public safety and security analysis. In most situations, once the data preprocessing and output have been addressed, commercially available software packages can be used for the actual modeling. To address this requirement for operationally relevant data preprocessing and output, the Actionable Mining and Predictive Analysis for Public Safety and Security model has been created. The Actionable Mining model includes the following steps:

image

Figure 4-1 In the CRISP-DM process model, the steps preceding and following the modeling phase require significant domain expertise and understanding of the operational requirements.

1. Question or challenge
2. Data collection and fusion
3. Operationally relevant preprocessing
a. Recoding
b. Variable selection
4. Identification, characterization, modeling
5. Public safety–specific evaluation
6. Operationally actionable output

Question or Challenge

Sometimes the analyst is faced with specific questions: Are these crimes linked? When are burglaries most frequent? Do people buy drugs in the rain? Other times, however, the task initially manifests itself as a vague question or challenge that requires some preliminary work to identify or structure a specific question. For example, it is not unusual to be presented with a series of telephone calls or financial transactions and then to be asked whether there is any sort of pattern worthy of additional investigation. Therefore, during the initial phase of the process, the general question or challenge is identified and converted into a specific question that will be answered by the data mining process. This question will be used to structure the analytical design plan, guide the process, and ultimately evaluate the fit and value of the answer.

It is also during this stage that current procedures and reports should be reviewed. In the business community, it is desirable to work directly with the client or recipient of the data mining results at this phase to ensure that the end product addresses the specific business questions or challenges and otherwise meets their needs. In law enforcement, intelligence, and security analysis, it is imperative to collaborate directly with the anticipated recipient or end user of the analytical products, particularly operational personnel. Working with the intended recipients of the data mining results can make the difference between generating analytical end products that might be interesting but have little to no value to their recipients, and those analytical results that can be translated directly into the operational environment to support and enhance information-based decisions. One possible consequence of overlooking or omitting this step includes “producing the right answers to the wrong questions.”11 In the applied setting, if the results cannot be used in the field, then data mining becomes little more than an academic exercise. Therefore, it is never too early to begin to consider what the output should look like and how it will be used, as this can have implications for the remaining steps. The question or challenge phase also is a good point in the process to identify evaluation criteria or other metrics of success that can be reviewed later to evaluate the success of the process. Again, these criteria should include the operational value of the analytical products, as well as traditional measures of accuracy and reliability.

Data Collection and Fusion

In the CIA Intelligence Process model, data collection is a separate and distinct step; however, data collection is merged with preliminary analysis and exploration in the CRISP-DM process model. This difference in emphasis most likely speaks to the different professional disciplines associated with the two analytical process models and the associated cost and difficulty associated with their respective collection efforts. In the intelligence community, the collection of data and information for analysis can represent a significant function of an entire agency and consume a major portion of the budget, particularly as the technical complexity and required resources associated with the collection process increase. As outlined above, collection is so important to the entire intelligence process that it has been divided further into separate collection disciplines. The data collected for analysis in the business setting generally are less difficult to obtain and may even reflect some foresight and analytical input regarding structure, form, and content.

Public safety and security data generally lie somewhere in between these two perspectives. Most public safety and security organizations do not have dedicated collection efforts or the ability to effectively utilize some technically challenging sources currently available to the intelligence community (e.g., SIGINT). Public safety and security data and information generally assume the form of standard incident reports, citizen complaint data, and some narrative information. That is not to say that unusual or unorthodox data resources cannot play a significant role in public safety analysis. It is not unreasonable to consider that the economy, special events, seasonal changes, or even weather might affect crime trends and patterns, particularly if these trends significantly impact the movement and associated access to victim populations. For example, street robberies in a nightclub area might decrease during heavy rain if the robberies normally are associated with patrons leisurely strolling around. Similarly, auto theft might increase when it is extremely cold, as citizens leave keys in their cars while preheating them. Therefore, thinking outside the box regarding useful data can result in more comprehensive and accurate models of criminal activity. In this case, the size of the box is limited only by the creativity of the analysts, their willingness to explore additional approaches, and their legitimate and ethical access to data and information.

Most, if not all, data analyzed in the public safety and security arena were collected for some other purpose, which can affect data form, content, and structure. Crime incident reporting forms generally are not created with data mining in mind. Moreover, some of the most valuable information in an incident report frequently is included in the unstructured narrative portion of the report. It is in this narrative section that information relating to modus operandi and other important behavioral indicators can be found. Unfortunately, it is this section of the report that also contains misspellings, typographical errors, and incomplete and missing information, as well as other inconsistencies, all of which significantly limit the analysts’ ability to effectively exploit the information.

Integration or fusion of multiple data resources also is started during this data collection and fusion phase and can be continued through the data preprocessing stage, when the data set is created for modeling and analysis. Fusion of data and information across collection modalities, data subsets, or separate locations can be desirable or even necessary. Common types of data integration include any necessary linking of required tables with relational data resources, including incident-based reporting systems, as well as any required linking of data that have been stored in separate files. This can include files that are maintained in time-limited samples due to the amount of information. For example, citizen complaint data or calls for service might be stored in monthly files, which will need to be combined to support analysis of longer patterns and trends. Similarly, separate victim and suspect tables might need to be linked to support an analysis of victim selection or victim-perpetrator relationships.

Fusion and integrated analysis of multiple data resources may add value to the process, or may be required to explore a single series or pattern of crime. For example, bank and telephone records can be linked to reveal and model important patterns associated with illegal sales, distribution, or smuggling, while weather data might provide clues to patterns of crime that are affected by seasonal changes or localized weather patterns. Again, the only limitations to the data used are the creativity and insight of the analysts and the legal authority to access and use the information.

Public safety officials in many areas now are recognizing the value of regional analysis of crime trends and patterns. By linking data that span jurisdictional boundaries, individual localities can gain an understanding of regional trends and patterns that is not possible with locality-specific data.12 Regional fusion centers also may represent a unique path for the acquisition of more sophisticated analytical software if the expense is distributed over a region. While the cost of some powerful data mining tools might exceed a local budget, the cost could be distributed across localities through the establishment of a regional fusion center or coordinated analytical effort.

Finally, linking regional data resources also can be used to increase the frequency of rare events and support effective analysis. For example, some terrorist groups have shown a preference for multiple, simultaneous, yet geographically distinct attacks. While incidents of hostile surveillance are extremely rare, the ability to combine data across similar or otherwise linked locations provides a unique opportunity to more fully characterize and model a larger pattern of behavior.

Operationally Relevant Preprocessing

As mentioned above, data preprocessing and preparation generally account for approximately 80% of the data mining process. This phase of the data mining process assumes even greater importance in public safety and security analysis, given the limitations frequently associated with public safety–related data as well as the need for operationally relevant analytical products. Moreover, an additional limitation encountered in applied public safety and security analysis is the fact that not all data resources and variables are available when they are needed. Therefore, to address these issues, we divide the preprocessing step into operationally relevant recoding and variable selection.

Recoding

The recoding phase of data preparation includes both transformation and cleaning. Perhaps the most important function in this step is the creation of a data inventory. This data inventory helps the analysts identify what they know as well as what they do not know or what might be missing. The data organization and management function can be extremely powerful, particularly in analytical tasks supporting the investigative process. In the behavioral analysis of violent crime or cold case investigation, one of the first tasks conducted during the preliminary case review and evaluation is to organize the evidence and identify any gaps, inconsistencies, or missing information. In some cases, this review and organization of the case materials is sufficient to solve the crime by revealing information or clues that had been masked by disorganization. Sometimes, just identifying the fact that there is a missing piece in the investigative puzzle can provide new insight, which further underscores the importance of this relatively unglamorous task.

Similar to the CRISP-DM model and other analytical strategies, the data inventory should include a listing of the various data elements and any attributes that might be important to subsequent steps in the process (e.g., categorical versus continuous data). Also important is the identification of missing data, as well as what the missing data actually mean. For example, do blank fields in a report indicate a negative response or the absence of a particular feature or element, or do they indicate a failure to ask a question or gather the relevant information? The true meaning of missing data can have significant implications not only for the analysis but also for the interpretation of the results. Therefore, decisions about missing data and the interpretation of any subsequent analyses and derived results should be made with considerable caution. That being said, data and information available for applied public safety and security analysis almost always arrive with at least some missing data. This is an occupational hazard that requires subject matter expertise and knowledge of what effect it will have on operational decisions.

Additional data quality issues also are evaluated at this stage. Some quality issues can be addressed, while others cannot. One particularly challenging issue includes the duplication of records, which is addressed in greater detail in Chapter 5. Moreover, the incident-based reporting rules now create a situation in which multiple crimes, victims, and/or suspects can be included in a single crime incident. While this increases the richness of crime data, it also significantly increases the complexity of the data and associated analytical requirements. Crimes with more than one victim and/or perpetrator can be counted more than once. This system makes it difficult to count crimes, and it affects the analyst’s ability to analyze crime. Again, it frequently is up to the analyst to make informed decisions regarding data quality, cleaning, and the decision to include or disregard duplicate records.

Other data quality issues include the reliability and validity of the data. Although this is covered in detail in subsequent chapters, it is important to note that victims and witnesses frequently are unreliable in their reporting. Poor lighting, the passage of time, extreme fear, and other distractions can alter recall and reporting accuracy. Further, some suspects, witnesses, and even victims have been known to intentionally distort the facts or to outright lie. While some of these data quality issues can be addressed through cleaning and recoding of the data, many will remain unresolved, contributing to a level of uncertainty and necessary caution regarding the interpretations. Therefore, this step includes any necessary cleaning and recoding as well as the selection of training and test data. Ideally, the training and test samples will be constructed using random assignment methodologies. Given the extremely low frequency of some patterns of criminal behavior, however, alternate sampling methods may be required to ensure adequate representation of data in each sample.

Most of the data and information available for public safety and security analysis require some level of recoding. Whether it involves categorizing crimes by motive or creating new variables based on MO or some other behavioral feature, recoding the data in an operationally relevant manner is essential to effective analysis, as well as to the creation of meaningful analytical output that will have direct value in the applied setting. In response to the unique importance of time, space, and the nature of the incident or threat to most public safety and security analytical tasks and associated operational decisions, our research group at RTI International has developed the Trinity Sight™ analytical framework, which is described in greater detail in Chapter 6.

Finally, it is during this phase that the analysts begin to explore and probe the data. It is during this very important process that the analysts gain familiarity with the data, particularly the idiosyncrasies or limitations that will affect their interpretation of the findings. Therefore, the data understanding phase can be truly creative as analysts begin to identify and reveal interesting patterns or trends, which might have an impact on the analytical strategy, or even refine or change the original question.

Variable Selection

This step includes an assessment of the data resources available, based on the inventory created as well as any existing constraints on the process and assumptions made. Again, the applied setting puts a considerable number of constraints on the ability to identify and access data in a timely fashion and translate analytical products back into the operational environment. Consequently, the selection of the variables that will be used in subsequent modeling steps is extremely important in applied data mining and predictive analytics, and requires significant domain knowledge. Factors that should be considered in the selection of variables include not only the operational value of the variables selected but also their availability.

Operational Value

Many relationships identified or models created are interesting, but have no value in the applied setting because they are not operationally actionable. For example, we found that the use of a sawed-off shotgun was related to an increased likelihood that a victim would be assaulted during an armed robbery. A very interesting finding, but one with limited value to the overall objective of the analytical process, which was to develop information-based patrol deployment strategies. Social scientists might examine the relationship between weapon selection and the propensity for violence, but it is very difficult to proactively deploy resources for sawed-off shotguns, significantly limiting the operational value of this finding. Therefore, significant domain expertise is required in the variable selection process to ensure that the variables selected will support the creation of operationally actionable models and output. Like the tree that falls in the woods with nobody there to hear it, no matter how interesting some analytical output might be to the analyst, it has little to no value if colleagues in the field cannot use it.

On the other hand, a related finding that the amount of money taken during an armed robbery was associated with an increased likelihood of assault initially appeared to be similarly limited in its value in deployment decisions, yet additional review of the findings suggested otherwise. Discussion with the operational personnel revealed that two specific victim populations were noteworthy for the amount of cash that they carried and their risk for robbery-related assault: drug dealers and illegal immigrants. It is not unusual for street-level drug dealers and other players in illegal drug markets to carry large amounts of cash. Moreover, violence frequently is used to enforce and regulate behavior in this setting,13 so it is not surprising to find an increased likelihood of assault associated with drug-related robberies. Illegal immigrants frequently carry large amounts of cash because their immigration status limits their access to traditional financial institutions. In many cases, they are targeted by robbers specifically for this reason and are assaulted when they resist efforts to steal their money. This issue underscores the importance of domain expertise and a close working relationship with the ultimate recipients of the analysis.

Availability and Timeliness

One of the biggest challenges in translating the data mining process to the applied setting of public safety and security has been creating models with operational value and relevance. Elegant, very precise models can be created in the academic setting when accurate and reliable data are readily available and the outcomes are known. In the applied setting, however, suspects lie; incident reports frequently are incomplete; victims can be confused; witnesses are less than forthcoming; and information is limited, unreliable, or otherwise unavailable when it is needed. All of this limits the availability of and timely access to information, not to mention its reliability and validity. Ultimately, these factors can restrict the analytical pace, process, and interpretation, as well as the overall value of the results. Therefore, to increase the likelihood for success, a good understanding of what data are available and when they are available, including how the results with fit into the investigative pace or affect the tempo, how the analytical products will be used, and any other key assumptions or constraints are important to structuring the analysis.

Identification, Characterization, and Modeling

During the identification, characterization, and modeling phase of the project, specific statistical algorithms are selected and applied to the data in an effort to identify, characterize, and model the data of interest. Although the unique aspects of the data collected will guide selection of the specific modeling algorithms, the statistical algorithms used in data mining can be categorized into two general groups: unsupervised learning or clustering techniques, and rule induction models or decision trees. Unsupervised learning or clustering techniques group the data into sets based on similar attributes or features. These techniques also can be used to detect data that are anomalies or significantly different from the rest of the sample.

Rule induction models capitalize on the fact that criminal behavior can be relatively predictable or homogeneous, particularly regarding common or successful MO features. Specific attributes or behavioral patterns can be characterized and modeled using rule induction models, which resemble decision trees. These models can be based on empirically determined clusters identified using the unsupervised learning techniques or those predetermined by the analyst. Rule induction models can be used to characterize and model known patterns of behavior. These models then can be applied to new data in an effort to quickly identify previously observed, known patterns and categorize unknown behavior.

Although it can be helpful to categorize the specific modeling tools into two groups, they are not mutually exclusive. Unsupervised learning and decision tree models can be used in sequence or in successive analytical iterations on the same data resources to identify, characterize, and model unique patterns, clusters, attributes, or events. For example, in some situations it is enough to know that “something” is in the data, whether it is unusual events, or trends and patterns. In other situations, identification of a case, pattern, or trend of interest represents only the first step in the analytical process. Subsequent analytical steps and processes then will be used to further characterize and/or model the data or event of interest, so it is entirely possible to use unsupervised learning approaches to initially explore and characterize the data, followed by rule induction models or decision trees to further characterize and model these preliminary findings. The available data and resources, as well as the operational requirements, analytical tradecraft and preferences, and domain expertise are involved in the modeling approach selected.

Public Safety and Security-Specific Evaluation

During the evaluation phase of the process, the models created are reviewed to determine whether they answered the question or challenge identified at the beginning of the process. It is also during this step that the models are evaluated to determine whether the analytical output meets the needs of the end users and is actionable in the applied setting. Some modeling methods are extremely complex and only can be deployed as automated scoring algorithms, while results generated by other models can be interpreted readily and are directly actionable. Of particular importance to data mining in the applied public safety and security setting is the ability to translate the analytical output to the field in support of operational tactics and strategy. Overly complex models, while accurate and reliable, can be somewhat limited if they are too difficult to interpret. Therefore, analysts should work closely with the end users during the data evaluation phase of the process to ensure that this particular goal is achieved.

Included in the evaluation phase of the process is a review of the overall accuracy of the model, as well as the type and nature of errors. Predicting low-frequency events like crime can be particularly challenging, and overall accuracy of the models created can be somewhat misleading with these low-frequency events. For example, a model would be correct 97% of the time if it always predicted “no” for an event with an expected frequency of 3%. Clearly, overall accuracy would be an unacceptable measure for the predictive value of this type of model. In these cases, the nature and direction of errors can provide a better estimate of the overall value of the model. By adjusting the “costs” associated with false positives or misses, the model can be refined to better predict low-frequency events. These costs can be balanced to create a model that accurately identifies cases of interest while limiting the number of false alarms. Unfortunately, the analysts are often in the position of attempting to model infrequent or rare events, events that can change rapidly. Therefore, specific attention to errors and the nature of errors is required. In some situations, anything that brings decision accuracy above chance is adequate. In other situations, however, errors in analysis or interpretation can result in misallocated resources, wasted time, and even can cost lives. As always, significant domain expertise and extensive knowledge of the operational objectives, data resources, procedures, and goals are essential in creating predictive models that are operationally reasonable. It is essential that the analysts work closely with the operational end users during this phase of the process to ensure that the models are valuable and actionable in the applied setting and that any necessary compromises in accuracy are acceptable.

Finally, it is important to evaluate the models created and relationships identified to ensure that they make sense. The importance of domain expertise and tacit knowledge in the interpretation and evaluation of analytical results cannot be overstated. On the other hand, it does not necessarily indicate a failure of the process if the analysis raises as many or more questions than it answers. The data mining process includes confirmation of known or suspected relationships as well as surprise and discovery. The knowledge discovery associated with unanticipated outcomes can greatly increase our understanding of crime and criminal behavior, and result in novel approaches to enhancing public safety.

Operationally Actionable Output

The ability to translate complex analytical output into a format that can be directly used in the operational setting to support prevention and enforcement strategies is critically important to effective data mining in the applied public safety and security setting. Sophisticated analytical tools, including data mining software applications, have been commercially available for several years, and complex analytical strategies are commonplace in academic criminal justice research. It has been relatively recently, however, that these tools and approaches have started to be used in the applied public safety and security arena, in large part because the analytical output generated by sophisticated algorithms and tools have had little direct relevance to the applied setting. As discussed above, overly complex models, while accurate and reliable, can be somewhat limited if they are too difficult to interpret to be useful to the end users. On the other hand, innovative approaches to conveying complex analytical output in a format that is not only readily interpreted and understood by the end user but also leverages their tacit knowledge and domain expertise can add significant value to the analytical process and outcomes. Therefore, the critical importance of “operationally actionable” analysis and output will be referred to repeatedly throughout this text and addressed in more detail in Chapter 8.

Summary

The Actionable Mining and Predictive Analysis model presented above differs from the first two models in its specificity to the public safety and security domains, as well as in the inclusion of operationally relevant preprocessing and output. Specifically, this model includes operationally relevant recoding and variable selection, public safety and security-specific model evaluation, and an emphasis on operationally actionable output. Table 4-2 compares the three analytical process models covered in this chapter.

Table 4-2 Comparison of the CRISP-DM, CIA Intelligence Process and Actionable Mining and Predictive Analysis Analytical Process Models.

image

Data mining is as much analytical process and tradecraft as it is specific mathematical algorithms and statistics. This process of data exploration and the associated surprise and discovery, which are the hallmarks of data mining, can be as exciting as they are challenging; analysts are rewarded with a progressively evolving list of questions to be answered as the data reveal additional insights and relationships.

The challenge facing public safety and security analysts lies in being able to craft an analytical process model that can accommodate differences in collection methodologies and functional domains, yet also transcend these differences in support of global applicability. The limitation in that approach, particularly in such a functionally diverse field as applied public safety and security analysis, is that different questions, sources, tactics, and strategies require different analytical approaches and sometimes significantly different analytical processes. Therefore, the Integrated Process Model can be thought of as being similar to a building code, which outlines the specific elements that should be addressed and offers a suggested sequence of steps that should be covered within the larger process. Following this analogy, the specific protocols for each unique analytical task are the blueprints that operationalize these broader elements and concepts for specific public safety, intelligence, and security analyses.

In keeping with this concept, the next five chapters address specific elements or phases in the Integrated Process Model. Following that, specific public safety and security questions, topics, and challenges are addressed in greater detail. Specific analytical “blueprints” are provided in each chapter, outlining a specific application of data mining in the applied public safety setting. While it is unlikely that these recommended analytical strategies will fit perfectly with every situation in any department, they should represent a reasonable approximation or template that analysts can apply to their particular situation. Hopefully, as the use of data mining and predictive analytics becomes more widespread in the applied setting and a critical mass of end users is attained, the availability of these blueprints will increase concomitantly.

4.4 Bibliography

1. Office of Public Affairs, Central Intelligence Agency. (1993). A consumer’s guide to intelligence. National Technical Information Service, Springfield, VA.

2. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. (1999). CRISP-DM 1.0: Step-by-step data mining guide. http://www.crisp-dm.org

3. Douglas, J.E., Burgess, A.W., Burgess, A.G., and Ressler, R.K. (1997). Crime classification manual: A standard system for investigating and classifying violent crimes. Jossey-Bass, Hoboken, NJ.

4. Lord, W.D., Boudreaux, M.C., and Lanning, K.V. (2001). Investigation of potential child abduction cases: A developmental perspective. FBI Law Enforcement Bulletin, April.

5. The intelligence community analytical processes and strategies are historically rich and very interesting. The following overview is not meant to be inclusive. Other process models include the FBI Intelligence Process (http://www.fbi.gov/intelligence/process.htm), which is very similar to the CIA model. For additional reading in this area, see: Lowenthal, M.M. (2003). Intelligence: From secrets to policy. CQ Press, Washington, D.C.

6. Lowenthal (2003).

7. CIAHandbook, and Air War College, Gateway to Intelligence (http://www.au.af.mil/au/awc/awcgate/awc-ntel.htm#humint).

8. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. (1999). CRISP-DM 1.0: Step-by-step data mining guide. CRISP-DM Consortium (www.crisp-dm.org).

9. Ibid.

10. For review, see, Piatetsky-Shapiro, G. (1999). CRISP-DM: A proposed global standard for data mining. DS Star, 3, no. 15;www.taborcommunications.com/dsstar/99/0413/990413.html; KDnuggets (2002). What main methodology are you using for data mining? July; http://www.kdnuggets.com/polls/2002/methodology.htm; and Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5, 13–22.

11. Chapman et al. (1999).

12. Faggiani, D., and McLaughlin, C.R. (1999). A discussion on the use of NIBRS data for tactical crime analysis. Journal of Quantitative Criminology, 15, 181–191.

13. Goldstein, P.J. (1985). The drugs/violence nexus: A tripartite conceptual framework. J Drug Issues, 15, 493–506.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset