Chapter 1

Calculus Ratiocinator

Abstract

There is more need than ever to implement Leibniz's Calculus Ratiocinator suggestion concerning a machine that simulates human cognition but without the inherent subjective biases of humans. This need is seen in how predictive models based upon observation data often vary widely across different blinded modelers or across the same standard automated variable selection methods. This unreliability is amplified in today's Big Data with very high-dimension confounding candidate variables. The single biggest reason for this unreliability is uncontrolled error that is especially prevalent with highly multicollinear input variables. So, modelers need to make arbitrary or biased subjective choices to overcome these problems because widely used automated variable selection methods like standard stepwise methods are simply not built to handle such error. The stacked ensemble method that averages many different elementary models was reviewed as one way to avoid such bias and error to generate a reliable prediction, but there are disadvantages including lack of automation and lack of transparent, parsimonious, and understandable solutions. A form of logistic regression that also models error events as a component of the maximum likelihood estimation called Reduced Error Logistic Regression (RELR) was also introduced as a method that avoids this multicollinearity error. An important neuromorphic property of RELR is that it shows stable explicit and implicit learning in small training samples and high-dimension inputs as observed in neurons. Other important neuromorphic properties of RELR consistent with a Calculus Ratiocinator machine were also introduced including the ability to produce unbiased automatic stable maximum probability solutions and stable causal reasoning based upon matched sample quasi-experiments. Given RELR's connection to information theory, these stability properties are the basis of the new stable information theory that is reviewed in this book with wide ranging causal and predictive analytics applications.

Keywords

Analytic science; Big data; Calculus Ratiocinator; Causal analytics; Causality; Cognitive neuroscience; Cognitive science; Data mining; Ensemble learning; Explanation; Explicit learning and memory; High dimension data; Implicit learning and memory; Information theory; Logistic regression; Machine learning; Matching experiment; Maximum entropy; Maximum likelihood; Multicollinearity; Neuromorphic; Neuroscience; Observational data; Outcome score matching; Prediction; Predictive analytics; Propensity score; Quasi-experiment; Randomized controlled experiment; Reduced error logistic regression (RELR); Stable information theory

“It is obvious that if we could find characters or signs suited for expressing all our thoughts as clearly and as exactly as arithmetic expresses numbers or geometry expresses lines, we could do in all matters insofar as they are subject to reasoning all that we can do in arithmetic and geometry. For all investigations which depend on reasoning would be carried out by transposing these characters and by a species of calculus.”

Gottfried Leibniz, Preface to the General Science, 1677.1

Contents

At the of end of his life and starting in 1703 Gottfried Leibniz engaged in a 12-year feud with Isaac Newton over who first invented the calculus and who committed plagiarism. All serious scholarship now indicates that both Newton and Leibniz developed calculus independently.2 Yet, stories about Leibniz's invention of calculus usually focus on this priority dispute with Newton and give much less attention to how Leibniz's vision of calculus differed substantially from Newton's. Whereas Newton was trained in mathematical physics and continued to be associated with academia during the most creative time in his career, Leibniz's early academic failings in math led him to become a lawyer by training and an entrepreneur by profession.3 So Leibniz's deep mathematical insights that led to calculus occurred away from a university professional association. Unlike Newton whose entire mathematical interests seemed tied to physics, Leibniz clearly had a much broader goal for calculus. These applications were in areas well beyond physics that seem to have nothing to do with mathematics. His dream application was for a Calculus Ratiocinator, which is synonymous with Calculus of Thought.4 This can be interpreted to be a very precise mathematical model of cognition that could be automated in a machine to answer any important philosophical, scientific, or practical question that traditionally would be answered with human subjective conjecture.5 Leibniz proposed that if we had such a cognitive calculus, we could just say “Let us calculate”6 and always find most reasonable answers uncontaminated by human bias.

In a sense, this concept of Calculus Ratiocinator foreshadows today's predictive analytic technology.7 Predictive analytics are widely used today to generate better than chance longer term projections for more stable physical and biological outcomes like climate change, schizophrenia, Parkinson's disease, Alzheimer's disease, diabetes, cancer, optimal crop yields, and even good short-term projections for less stable social outcomes like marriage satisfaction, divorce, successful parenting, crime, successful businesses, satisfied customers, great employees, successful ad campaigns, stock price changes, loan decisions, among many others. Until the widespread practice of predictive analytics with the introduction of the computers in the past century, most of these outcomes were thought to be too capricious to have anything to do with mathematics. Instead, they were traditionally answered with speculative and biased hypotheses or intuitions often rooted in culture or philosophy (Fig. 1.1).

image

Figure 1.1 Gottfried Wilhelm Leibniz.8

Until just very recently, standard computer technology could only evaluate a small number of predictive features and observations. But, we are now in an era of big data and high performance massively parallel computing. So our predictive models should now become much more powerful. This is because it would seem reasonable that those traditional methods that worked to select important predictive features from small data will scale to high-dimension data and suddenly select predictive models that are much more accurate and insightful. This would give us a new and much more powerful big data machine intelligence technology that is everything that Leibniz imagined in a Calculus Ratiocinator. Big data massively parallel technology should thus theoretically allow completely new data-driven cognitive machines to predict and explain capricious outcomes in science, medicine, business, and government.

Unfortunately, it is not this simple. This is because observation samples are still fairly small in most of today's predictive analytic applications. One reason is that most real-world data are not representative samples of the population to which one wishes to generalize. For example, the people who visit Facebook or search on Google might not be a good representative sample of many populations, so smaller representative samples will need to be taken if the analytics are to generalize very well. Another problem is that many real-world data are not independent observations and instead are often repeated observations from the same individuals. For this reason, data also need to be down sampled significantly to be independent observations. Still, another problem is that even when there are many millions of independent representative observations, there are usually a much smaller number of individuals who do things like respond to a particular type of cancer drug or commit fraud or respond to an advertising promotion in the recent past. The informative sample for a predictive model is the group of targeted individuals and a group of similar size that did not show such a response, but these are not usually big data samples in terms of large numbers of observations. So, the biggest limitation of big data in the sense of a large number of observations is that most real-world data are not “big” and instead have limited numbers of observations. This is especially true because most predictive models are not built from Facebook or Google data.9

Still, most real-world data are “big” in another sense. This is in the sense of being very high dimensional given that interactions between variables and nonlinear effects are also predictive features. Previously we have not had the technology to evaluate high dimensions of potentially predictive variables rapidly enough to be useful. The slower processing that was the reason for this “curse of dimensionality” is now behind us. So many might believe that this suddenly allows the evaluation of almost unfathomably high dimensions of data for the selection of important features in much more accurate and smarter big data predictive models simply by applying traditional widely used methods.

Unfortunately, the traditional widely used methods often do not give unbiased or non-arbitrary predictions and explanations, and this problem will become ever more apparent with today's high-dimension data.

1 A Fundamental Problem with the Widely Used Methods

There is one glaring problem with today's widely used predictive analytic methods that stands in the way of our new data-driven science. This problem is inconsistent with Leibniz's idea of an automated machine that can reproduce the very computations of human cognition, but without the subjective biases of humans. This problem is suggested by the fact that there are probably at least hundreds of predictive analytic methods that are in use today. Each method makes differing assumptions that would not be agreed upon by all, and all have at least one and sometimes many arbitrary parameters. This arbitrary diversity is defended by those who believe a “no free lunch” theorem that argues that there is no one best method across all situations.10,11 Yet, when predictive modelers test various arbitrary algorithms based upon these methods to get a best model for a specific situation, they obviously will only test but a tiny subset of the possibilities. So unless there is an obvious very simple best model, different modelers will almost always produce substantially different arbitrary models with the same data.

As examples of this problem of arbitrary methods, there are different types of decision tree methods like CHAID and CART which have different statistical tests to determine branching. Even with the very same method, different user-provided parameters for splitting the branches of the tree will often give quite different decision trees that will generate very different predictions and explanations. Likewise, there are many widely used regression variable selection methods like stepwise and LASSO logistic regression that are all different in the arbitrary assumptions and parameters employed in how one selects important “explanatory” variables. Even with the very same regression method, different user choices in these parameters will almost always generate widely differing explanations and often substantially differing predictions. There are other methods like Principal Component Analysis (PCA), Variable Clustering and Factor Analysis that attempt to avoid the variable selection problem by greatly reducing the dimensionality of the variables. These methods work well when the data match underlying assumptions, but most behavioral data will not be easily modeled with the assumptions in these methods like orthogonal components in the case of PCA or that one knows how to rotate the components to be nonorthogonal using the other methods given that there are an infinite number of possible rotations. Likewise, there are many other methods like Bayesian Networks, Partial Least Squares, and Structural Equation Modeling that modelers often use to make explanatory inferences. These methods each make differing arbitrary assumptions that often generate wide diversity in explanations and predictions. Likewise, there are a large number of fairly black box methods like Support Vector Machines, Artificial Neural Networks, Random Forests, Stochastic Gradient Boosting, and various Genetic Algorithms that are not completely transparent in their explanations of how the predictions are formed, although some measure of variable importance often can be obtained. These methods can generate quite different predictions and important variables simply because of differing assumptions across the methods or differing user-defined modeling parameters within the methods.

Because there are so many methods and because all require unsubstantiated modeling assumptions along with arbitrary user-defined parameters, if you gave exactly the same data to a 100 different predictive modelers, you would likely get a 100 completely different models unless it was a simple solution. These differing models often would make very different predictions and almost always generate different explanations to the extent that the method produces transparent models that could be interpreted. In cases where regression methods are used and raw interaction or nonlinear effects are parsimoniously selected without accompanying main effects, the model's predictions are even likely to depend on how variables are scaled so that currency in Dollars versus Euros would give different predictions.12 Because of such variability that even can defy basic principles of logic, it is unreasonable to interpret any of these arbitrary models as reflecting a causal and/or most probable explanation or prediction.

Because the widely used methods yield arbitrary and even illogical models in many cases, hardly can we say “Let us calculate” to answer important questions such as the most likely contribution of environmental versus genetic versus other biological factors in causing Parkinson's disease, Alzheimer's disease, prostate cancer, breast cancer and so on. Hardly, can we say “Let us calculate”, when we wish to provide a most likely explanation for why there is climate change or why certain genetic and environmental markers correlate to diseases or why our business is suddenly losing customers or how we may decrease costs and yet improve quality in health care. Hardly, can we say “Let us calculate”, when we wish to know the extent to which sexual orientation and other average gender differences are determined by biological factors or by social factors, when we wish to know whether stricter guns control policies would have a positive or negative impact on crime and murder rates, or when we wish to know whether austerity as an economic intervention tool is helpful or hurtful. Because our widely used predictive analytic methods are so influenced by completely subjective human choices, predictive model explanations and predictions about human diseases, climate change, and business and social outcomes will have substantial variability simply due to our cognitive biases and/or our arbitrary modeling methods. The most important questions of our day relate to various economic, social, medical, and environmental outcomes related to human behavior by cause or effect, but our widely used predictive analytic methods cannot answer these questions reliably.

Even when the very same method is used to select variables, the important variables that the model selects as the basis of explanation are likely to vary across independent observation samples. This sampling variability will be especially prevalent if the observations available to train the model are limited or if there are many possible features that are candidates for explanatory variables, and if there is also more than a modest correlation between at least some of the candidate explanatory variables. This problem of correlation between variables or multicollinearity is ultimately the real culprit. This multicollinearity problem is almost always seen with human behavior outcomes. Unlike many physical phenomena, behavioral outcomes usually cannot be understood in terms of easy to separate uncorrelated causal components. Models based upon randomized controlled experimental selection methods can avoid this multicollinearity problem through designs that yield variables that are orthogonal.13 Yet, most of today's predictive analytic applications necessarily must deal with observation data, as randomized experiments are simply not possible usually with human behavior in real-world situations. Leo Breiman, who was one of the more prominent statisticians of recent memory, referred to this inability to deal with multicollinearity error as “the quiet scandal of statistics” because the attempts to avoid it in traditional predictive modeling methods are arbitrary and problematic.14

There is a wonderful demonstration of this Breiman “quiet scandal” problem in a paper by Austin and Tu in 2004.15 They were interested in the possibility of using binary logistic regression to answer “Why do some people die after a heart attack?”. With only 29 candidate variables, they observed 940 models with differing selected variables out of 1000 total models based upon different training samples. They found similar instability no matter the variable selection methods—backward, forward or stepwise variable selection. To further the problem beyond just this sampling issue, each of these variable selection methods has their own set of arbitrary parameters. These include the level of significance for a variable to be selected, whether and how correlated variables are removed initially before any variable selection and whether further criteria are employed to ensure a parsimonious model.16 So the amount of variability in these variable selection models produced from just 29 variables in this data set must be truly enormous. What compounds this problem even further is that the regression weight corresponding to a given selected variable often varies dramatically and even reverses in sign depending on other selected variables in a solution.17

So not only do we have a problem where different widely used methods generate very different models but even the same widely used variable selection method is likely to generate quite different models on separate occasions simply because of multicollinearity error and/or incorrect parameter choices. Because observations are usually limited, multicollinearity error will be a problem with these widely used predictive methods in most applications. Cognitive bias in the selection of models also will be a problem even with large numbers of observations. Unreliability due to multicollinearity error and cognitive bias is even seen with less than 30 candidate features like in the Austin and Tu study, where expert judgments to override the automated stepwise methods also produced highly unreliable variable selections.18 In today's Big Data high-dimension problems where potentially millions and even billions of candidate features are possible when interaction and nonlinear effects are derived from a large number of input variables, the extent of variability simply due to multicollinearity error and cognitive bias is almost astronomical.

This variability is usually explained in the predictive analytics community with the claim that building a model is much more of an art than a science. It is true that the arbitrary parameters and assumptions within the widely used methods allow the modeler to have substantial artistic license to choose the final form of the model. For this reason, a situation where each modeler tells a different explanatory story and makes different predictions with the same data is definitely much more like art than science. But, this actually is the heart of the problem because we do not wish engage in art, but instead hope to have a science. So, if we seek a mechanical Calculus Ratiocinator that is completely devoid of cognitive bias and multicollinearity error in the prediction and explanation of outcomes, today's widely used predictive analytic methods will not get us there. In fact, these methods will likely cause greater confusion than resolution because they will all tend to give such entirely different solutions based upon the method employed, the arbitrary and subjective choices of the person who built the model, and the data sample used.

While diverse and arbitrary explanations are the most glaring problem, diverse and arbitrary predictions can be the most risky problem especially with multicollinear high-dimension data. With high-dimension genetic microarray data, the MAQC-II study of 30,000 predictive models reported substantial evidence of predictive variability across different modelers, data samples, and different traditional variable selection methods.19 Because this predictive modeling variability directly causes decision making variability, this problem creates a substantial amount of risk and uncertainty. In fact, this MAQC-II study reported that the single best predictor of model failure in the sense of poor blinded sample predictive accuracy was when models built from the very same data set had discrepant variable selection across different modelers and methods. The MAQC-II study suggested that this was likely to occur unless it was a rare and very easy problem where a variable selection based upon just a few features gave a highly predictive solution. Thus, variable selection variability due to cognitive bias and/or multicollinearity error not only is indicative of a problematic explanation but also may warn of risky real-world predictions.

To reiterate, predictive modeling with observational data will be likely to generate highly suspect explanations and predictions when models are built with widely used methods due to cognitive bias and multicollinearity error problems. These problems will be even more apparent with today's high-dimension data, because there are simply many more degrees of freedom to drive even much greater variability. When we produce a model for questions like “why are we losing customers, and can we predict this loss the next month?” or “why do some people die immediately after a heart attack, and can we predict this occurrence?” or “why do some people default on their loans, and can we predict who is most at risk in a potentially changing economy?” or “why has climate change occurred, and can we stop this from occurring?” or “what is the optimal treatment regimen in terms of cost that does not sacrifice quality of health care”, we would hope that the model will not change when we build the model using a different representative training data sample or another expert modeler or another method or even another measurement scale for the units of variables. Such, indeed, is likely not to be the case with our widely used methods. Unfortunately, this variability will be even greatly magnified in today's Big Data environment with ultrahigh-dimension data.

2 Ensemble Models and Cognitive Processing in Playing Jeopardy

One way to avoid the risks due to an arbitrary unstable model is simply to average different elementary models produced by different arbitrary methods and/or different expert modelers. Given appropriate sampling from the population of possible models, the greater the number and the greater variety of elementary models that go into the average, the more likely it will be that the grand average ensemble model will be able to produce accurate and reliable predictions that will not be arbitrary. The exceptional Jeopardy game show performance by IBM's Watson exemplifies this power of ensemble modeling. In early 2011, we were all mesmerized by how Watson was able to perform better than the most capable human Jeopardy players. While we do not know the fine details of the ensemble machine learning process used in Watson, we do know that it is based upon an ensemble average of many elementary models.

Although there are many ways to construct an ensemble model, we also know that the most successful ensemble models, like that which won the Netflix prize, have been produced as a grand average “stacked” ensemble model across hundreds of elementary models.20 In many other predictive machine learning applications, such blended or stacked ensemble models are usually the most accurate methods, as was evidenced by the best performing methods in the Heritage Health Prize contest in 2011. There are other ensemble methods like Random Forest and its related hybrid Stochastic Gradient Boosting that are still sometimes accurate and much easier to implement than the stacked ensemble method because they are stand-alone methods that do not require averaging models built from many different methods. Yet, methods based upon Random Forest still have significant bias and multicollinearity error problems,21 so a more comprehensive ensemble averaging across a large number of different methods as is done in the stacked ensemble seems to be the more effective way to remove bias and error problems. Indeed, Seni and Elder's analysis of why ensemble methods perform well suggests that the averaging creates more stable predictions that are less susceptible to the error problems that would be especially problematic with multicollinear features.22

If we can build stacked ensemble models that are not arbitrary when a large enough assortment of models are sampled to produce the average, then this type of ensemble machine learning should give stable predictions that do not depend on the bias of any one individual modeler. Because stacked ensemble models also can simulate human cognition as in Watson's Jeopardy performance, then this may be a basis for Leibniz's Calculus Ratiocinator. In fact, Watson's performance seemed quite comparable with the best performing human Jeopardy players in all respects except he was faster and showed greater breadth of knowledge. Thus, Watson may have realized Leibniz's dream for an objective calculus of thought that is also a basis for our new data-driven cognitive machines. But even though Watson was highly successful at simulating human cognition, let us not move so fast. Let us first ask: what aspect of human cognition did Watson simulate?

While Watson's stacked ensemble models usually would be very good at retrieving names and other semantically appropriate words and phrases as required in Jeopardy, or in predicting movie ratings with accurate intuition as in Netflix, these models are not proficient at providing causal explanations for why a specific name or fact has been retrieved or why a certain prediction is intuited. Indeed, good prediction does not constitute a good causal explanation as we know from Ptolemy's extremely complicated ancient astronomical model that put the earth at the center of the universe. The complexity of stacked ensemble models is one reason that they do not work as explanatory models. This is because causal explanations that humans can understand are always parsimonious, as the conscious human brain has a significant capacity limit that does not allow more than a small set of conscious features to be maintained and readily accessed and understood. Easily understood explanations tend to chunk many elementary features into larger meaningful representations that are both very parsimonious and informative.23

Yet, the human brain is also capable of retrieving names or facts or correctly predicting actions without conscious causal explanation of the features that cause these outcomes. This happens when a familiar name pops into our mind in response to a memory cue, but we have no conscious recollection of how we know that this name is correct in terms of an episode in our lives or associated facts that caused us to know that it is true. This also happens all the time when we speak, as the words just seem to flow, and we do not usually need to retrieve a reason for why we have chosen any particular word consciously. As an example of how such automatic processing can be used in Jeopardy, the Jeopardy cue “Real estate owned with no claims on it is said to be free and …” causes the word “clear” to pop into my mind. But, since I never took a course in business law, I have no understanding as to why this particular word is retrieved and feels familiar and gives the correct answer. Eventually, I notice later that a frequently running television commercial that I probably have seen at least 20 times but have always tried to ignore uses this very same phrase. We will see that ensemble modeling would be very good at this type of purely predictive rote learning process that requires no associated conscious causal explanation or even any conscious learning intention.

There are certainly other times when I play Jeopardy when I am conscious of the causal linkages that explain my memory. As an example, say that the Jeopardy category is First Ladies and the cue is “She was First Lady for 12 years and 39 days”. In this case, I immediately know that Franklin Roosevelt was President for more than two four year terms, and I immediately realize that his wife was Eleanor. So, I respond with “Who was Eleanor Roosevelt?” In this particular Jeopardy example, some people may even remember an episode in their lives when they learned one or both of these two facts about Franklin and Eleanor Roosevelt. Ensemble modeling would not be able to produce this explanation based upon a few associated features that caused the recollection memory.

3 The Brain's Explicit and Implicit Learning

We will eventually recognize the basic memory process that Watson seems to simulate. But first we need a more general overview of the brain's learning and memory processes. Like artificial intelligence and machine learning research, cognitive neuroscience is an evolving field where there are still many areas of knowledge that have yet to be clarified. Yet, there is now fairly good agreement among cognitive neuroscientists in viewing the brain's learning and memory processing along a slow explicit learning versus fast implicit learning continuum.24

Explicit learning is characterized as slower, conscious, intentional processing where dysfunction in the brain's medial temporal lobe structures including the hippocampus will disrupt storage and retrieval of more recent memories. Episodic memory, or the conscious memory for events in our lives, along with semantic memory, or the conscious memory for facts and concepts, are explicit processes because the new learning of these types of memory is disrupted by medial temporal lobe injuries. Implicit learning is characterized as a rapid, unconscious, automatic processing that takes place independently of the brain's medial temporal lobe structures. Implicit learning is exemplified in procedural memory skills such as bicycle riding, typing, walking, piano playing, and automobile driving.

Working memory can be described as a limited capacity short-term conscious memory system that is very important in learning and which interacts with and controls attention. Working memory is believed to arise through the temporary reactivation and reconfiguring of existing long-term memory representations.25,26 Working memory tasks can involve either deep processing or shallow processing tasks in what might be considered intermittent cached updating of long-term memory. Deep processing requires effort and is slow because it elaborates on the meaning of stimuli, as when we determine whether words that we read are a member of a certain category such as “fish”. The transfer of new associations into longer term explicit memories is likely to result from deep processing. Yet, long-term explicit memories are much less likely to result from shallow processing like when we determine whether words visually presented to us are in upper or lower case.27 But, shallow processing does lead to unconscious implicit long-term memories due to rote repetition and priming. Repetition priming is when the previous presentation of a stimulus such as a word makes it more likely that this word or a semantically related word will be remembered even though we may not recall the previous presentation episode. Implicit memories due to rote repetition priming can be learned very rapidly with very few repetition trials.28

An important book by Nobel Laureate Daniel Kahneman titled Thinking, Fast and Slow reviews how these slow, deep, explicit and fast, shallow, implicit cognitive mechanisms are fundamental processes that define how the brain works and how human brain decisions are determined.29 Intuitive snap judgments are examples of fast, shallow, implicit processing, whereas well-reasoned causal explanations are examples of slow, deep, explicit processing. Much of Kahneman's book is concerned with how subjective cognitive biases cause errors in human decisions. My task in this book will be to propose how these implicit and explicit brain learning modes can be understood and modeled in terms of underlying machine learning mechanisms. Such understanding would allow cognitive machines similar to Leibniz's concept of a Calculus Ratiocinator that are devoid of arbitrary subjective biases and assumptions.

Playing the piano can be a good example of implicit learning. I took piano lessons for several years when I was in my early twenties, and my piano teacher incorporated a simple reward learning feedback process into his teaching. The teacher's rule was that I had to play a very small section of a piece 10 times in a row correctly before I could move to the next section; he would then apply the same principle to larger sections that grouped together the smaller sections. In its most mechanical form, the only feedback provided was a label that the targeted small section had been played correctly or incorrectly without any reason for why. As I began to play the target piece many times in a row correctly, I would notice that my mind would begin to wander. So I no longer needed to be conscious of the individual notes or chords that I was playing or even whether or not the teacher was about to give me feedback that I had played it incorrectly. When that occurred, my brain was able to predict and retrieve appropriate movements without my conscious awareness of this processing.

A similar story could be told for any implicit learning due to rote repetition. Conscious involvement is not necessary to retrieve these implicit memories or even to learn the causal explanations as to why they are remembered, although some reinforcement signal apparently must be present during procedural learning which requires motor responses. In implicit memory learning, our behavior eventually becomes guided by feeling rather than by thinking. During implicit memory retrieval, our performance even becomes impaired when we think too much, as when a basketball player may choke at the free throw line when they think about the causal steps in successful execution.

Usually, explicit and implicit learning happen concurrently in the normal human brain, but brain injury studies afford a way to observe these processes independently. As an example, there is striking brain injury patient evidence that implicit memories can be learned and retrieved without a functioning explicit memory learning system. New York Times writer Charles Duhigg is probably best known in the predictive analytics community for popularizing the story about the American retailer company Target's efforts to predict buying habits related to pregnancy in his book The Power of Habit.30 Yet, Duhigg's book also contains quite informative material about the neuroscience of implicit memory. For example, Duhigg reports on his interviews with cognitive neuroscientist Larry Squire of the University of California—San Diego about a medial temporal lobe encephalitis brain injury patient known in the medical literature as E.P. What is most striking about E.P. was that although his brain injury destroyed his medial temporal lobe's ability to form new explicit memories, he was still able to learn and retrieve new implicit memories.31 When E.P.'s family moved to a new town after his brain injury occurred, he acquired the habit of going on regular long walks. Amazingly, he would never get lost even though he had no conscious recollection of these walks. Thus, rote repetitive inputs that are associated with routine responses that are often rewarded are apparently all that is needed to learn new complex implicit procedural memories even without any explicit memory recall. Just like piano playing is a skill that can be acquired by chunking together elementary behaviors with reinforcement learning to result in a complex procedural memory, navigating a walk through a completely novel environment also appears to be such a skill.

The work of Ann Graybiel32 indicates that neural circuits involving the basal ganglia are necessary for implicit skill learning. This appears to arise from the transfer of neocortical input signals to basal ganglia output signals. As many as 10,000 neocortical neurons may converge on a single projection neuron in the striatum, which is the largest input component to the basal ganglia. These striatum projection neurons also converge with an estimated convergence rate of about 80:1 onto target neurons in primates. Thus, each feed-forward projection stage is similar to how a purely predictive machine ensemble model averages large numbers of elementary submodels into a single predictive output signal. Although this implicit memory computational process has more than a superficial resemblance to stacked ensemble predictive modeling, it is obviously much more complex given the large number of striatum neurons that all generate output predictions in parallel.33

This apparent explicit versus implicit dichotomy actually may be more of a continuum. This is suggested by the feeling of familiarity that occurs when we know that we have experienced an event previously, but cannot retrieve the causal reason. This familiarity process has some properties that are akin to more implicit memories in that it is automatic and rapid and does not involve conscious recollection of the reason for the memory. But, familiarity memory also has properties that are like more explicit memories because it involves the conscious awareness of familiarity and is disrupted in medial temporal lobe brain injuries involving the part of the medial temporal lobe which is adjacent to the hippocampus.34–36 Such an intermediate process might result from partial forgetting of previously learned explicit memories, or from partial learning of explicit memories.

What kind of memory would it take for a machine to play Jeopardy very rapidly and efficiently? At a minimum this would require automatic and rapid memory mechanisms like primed implicit memories that were learned simply through rote repetition. This mechanism would respond to a cue like “Real estate owned with no claims on it is said to be free and …” by choosing the highest probability prediction automatically based upon previous rote repetition learning. This is consistent with how Watson's ensemble memory worked, as three most likely memories associated with each Jeopardy item were remembered by Watson who simply chose the most probable response. This type of implicit memory mechanism would be much faster than trying to recollect the causal linkages for the memory through conscious explicit semantic associations or previous learning episodes. So humans may also perform better in Jeopardy when they rely upon an automatic implicit memory mechanism based upon rote repetition priming. In fact, one of Watson's main advantages seemed to be that he was simply faster than the two human contestants.

Because machine stacked ensemble methods can simulate the implicit memory performance of humans, it may be reasonable to theorize that they are similar to the brain's implicit memory ensemble computations. Yet, the analogy is not perfect. This is because highly successful stacked ensemble machine learning is not automatic implicit learning like that which occurs in the brain. Instead, a human is required to decide which elementary models go into the stacked ensemble and to decide upon tuning parameters that weight the impact of individual models. In the most accurate stacked ensemble modeling, these manual interventions are typically trial and error processes that may take weeks, months, or even years to produce the best final model, whereas the brain can learn implicit memories very quickly in a small number of learning trials as in rote repetition priming learning. Additionally, machine ensemble models are implemented as scoring models where they no longer learn based upon new data, whereas the brain's implicit memory learning is capable of automatic continued new learning as in the case of brain injury patient E.P. Ideally, any Calculus Ratiocinator should be capable of automatic learning because it removes much of the potential for human bias.

Although this necessary manual involvement in the learning phase is a limitation, today's machine ensemble models are still very good at implicit memory tasks that do not require a causal explanation for their performance. Whether it is a natural language process in Jeopardy, a recommendation process in Netflix,37 or the host of other purely predictive implicit memory processes in today's machine learning, ensemble models perform extremely well as long as the new predictive environment is stable and not different from the training environment. Yet, these accurate ensemble models are still limited in being just black box predictive machines. In contrast, any Calculus Ratiocinator must be able to generate causal explanations like those that arise from explicit memory learning.

In a sense, the computational models that we can now construct with artificial intelligence are an extension of our own brain. But, this extension can be a less intelligent system which is only able to generate short-term predictions with no causal understanding as might occur in insects, or it can be a very intelligent system that is also able to generate causal explanations as occurs in intelligent mammals. To understand the importance of explicit memory learning in intelligent systems, we need to remember the famous patient H.M. After his surgery that removed both medial temporal lobes, H.M. completely lost the ability to learn new causal explanations. Every time that he shifted the focus of his working memory, it was like he was constantly waking and not understanding the causal reason for how he got to his present situation. As an old man, he still believed that he was the same young man who had undergone the surgery, as he did not even recognize himself in the mirror. Obviously, we want our artificial intelligence models to learn the causal reasons for their predictions or at least to assist us in understanding these causal reasons, or else they will not be very intelligent extensions to our brains.

4 Two Distinct Modeling Cultures and Machine Intelligence

A famous paper by Leo Breiman characterized the predictive analytics profession as being composed of two different cultures that differ in terms of whether prediction versus explanation is their ultimate goal.38 A recent paper by Galit Shmueli reviews this same explanation versus prediction distinction.39 On the one hand, we have purely predictive models like stacked ensemble models that do not allow for causal explanations. Even in cases where the features and parameters are somewhat transparent, pure prediction models have too many arbitrarily chosen parameters for causal understanding. Pure predictive models are most successful in today's machine learning and artificial intelligence applications like natural language processing as in the example of Watson where the predicted probabilities were very accurate and not likely to be arbitrary. Pure prediction models are suspect when the predicted probabilities or outcomes are inaccurate and likely to change across modelers and data samples. Even the most accurate pure predictive models only work when the predictive environment is stable, which is seldom the case with human social behavior and other similarly chancy real-world outcomes for very long. Sometimes we can update our predictions fast enough to avoid model failure due to a changing environment. But more often that is simply not possible because the model's predictions are for the longer term, as we cannot take back a 30-year mortgage loan once we have made the decision to grant it. Thus, there will be considerable risk when we need to rely upon an assumption that the predictive environment is stable. This point was the thesis in Emanuel Derman's recent book Models Behaving Badly which argued that the US credit default crisis of 2008–2009 was greatly impacted by the failure of predictive models with a changing economy.40 To avoid the risk associated with black box models and environmental instability, there is a strong desire for models with accurate causal explanatory insights.

The problem is that the popular standard methods used to generate parsimonious “explanatory” models like standard stepwise regression methods do not really work to model putative causes because what they generate are often completely arbitrary unstable solutions. This failure is reflected in Breiman's characterization of stepwise regression as the “quiet scandal” of statistics. Like many in today's machine learning community, Breiman ultimately saw no reason to select an arbitrary “explanatory” model, and instead urged focus on pure predictive modeling. Yet, any debate about whether the focus should be in one or the other of these two cultures really misses the idea that the brain seems to have evolved both types of modeling processes as reflected in implicit and explicit learning processes. So, both of these modeling processes should be useful in artificial intelligence attempts to predict and explain, as long as there is a semblance to these two basic brain memory learning mechanisms.

The brain's implicit and explicit learning mechanisms both generate probabilistic predictions, but they otherwise have very different properties relating to the diffuse versus sparse characteristics of the underlying neural circuitry and reliance upon direct feedback in these networks.41 The brain's implicit memories seem to be built from large numbers of diffuse and redundant feed-forward neural connections as observed in Graybiel's work on procedural motor memory predictions projecting through basal ganglia. As seen in the patient E.P., these implicit memory mechanisms are not affected by brain injuries specific to the medial temporal lobe involving the hippocampus. In contrast, explicit learning seems to involve reciprocal feedback circuitry connecting relatively sparse representations in hippocampus and neocortex structures.42

Indeed, explicit memory representations in hippocampus appear to become sparse through learning. For example, a study by Karlsson and Frank43 examined hippocampus neurons in rats that learned to run a track to get a reward. In the early novel training, neurons were more active across larger fractions of the environment, yet tended to have low firing rates. As the environment became more familiar, neurons tended to be active across smaller proportions of the environment, and there appeared to be segregation into a higher and a lower rate group where the higher firing rate neurons were also more spatially selective. It is as though these explicit memory circuits were actually developing sparse explanatory predictive memory models in the course of learning. But unlike stepwise regression's unstable and arbitrary feature selection, the brain's sparse explicit memory features are stable and meaningful. Humans with at least average cognitive ability can usually agree on the important defining elements in recent shared episodic experiences that occur over more than a very brief period of time44 or in shared semantic memories such as basic important facts about the world taught in school.

Episodic memory involves the conscious recollection of one or more events or episodes often in a causal chain, as when we remember the parsimonious temporal sequence of steps in a recent experience like if we just changed an automobile tire. Some cognitive neuroscientists now believe that similar explicit neural processes involving the medial temporal lobe are the basis of causal reasoning and planning. The only distinction is that the imagined causal sequence of episodes that would encompass the retrieved explicit memories are now projected into the future like when we plan out the steps that are necessary for changing a tire.45 In fact, evidence suggests that the hippocampus may be necessary to represent temporal succession of events as required in an understanding of causal sequences.46,47 In contrast to the brain's explicit learning and reasoning mechanisms, parsimonious variable selection methods like stepwise regression usually fail to select causal features and parameters in historical data due to multicollinearity problems. Because of such multicollinearity error, it is obvious that widely used predictive analytics methods are also not smart enough to reason about future outcomes in simulations with any degree of certainty.

Some may say that it would be impossible to model classic explicit “thought” processes in artificial intelligence because consciousness is necessarily involved in these processes. However, machine learning models that simulate the brain's explicit learning and reasoning processes should at least be useful to generate reasonable causal models. If these machine models provide unbiased most probable explanations that avoid multicollinearity error, then these models would be a realization of Leibniz's Calculus Ratiocinator. This is because they would allow us to answer our most difficult questions with data-driven most probable causal explanations and predictions.

5 Logistic Regression and the Calculus Ratiocinator Problem

While there are probably hundreds of different predictive analytics methods that have at least some usage today, regression and decision tree methods are reported to be the most popular tools.48 Of these most popular methods, logistic regression is probably used most widely today, as it has shown very rapid adoption since it began to appear in statistical software packages in the 1980s.49 Much of the reason for the rapid rise in popularity is that it has effectively replaced older less accurate methods that were based upon linear regression like discriminant analysis50 and linear probability.51 Another reason is that it has interpretation advantages in the sense of returning solutions that are much more understandable compared with the older probit regression.52 Yet another reason for this rise in popularity is that logistic regression can be applied to all types of predictive analytics problems including those with binary and categorical outcome variables, those with continuous outcome variables that are categorized into ordinal categories,53 and those with matched samples in conditional logistic regression models. Logistic regression even has been even proposed for use in survival analysis although it is not yet widely used there.54 Today, logistic regression is applied across most major predictive analytic application areas in fields like epidemiology, biostatistics, biology, economics, sociology, criminology, political science, psychology, linguistics, finance, marketing, human resources, engineering, and most areas of government policy.

Logistic regression can be derived from the much more general maximum entropy method that arises in information theory; this derivation returns the sigmoid form that describes the probability of a response in logistic regression without having to assume such a form up front.55 Given this fundamental connection to information theory, the thesis of this book is that this popularity of logistic regression stems from something very basic about how our brains structure information to understand the world. This is that our brains may work to produce explicit and implicit neural learning through a stable process that can be understood through information theory as a logistic regression process. The brain may use logistic regression here because it is the most natural calculus to compute accurate probabilities given binary neural signals. That is, there simply are not many choices for how probabilities of binary signals might be computed, and if we limit ourselves to information theory considerations then we arrive at logistic regression.

So in spite of all the limitations in not being able to handle high-dimension and multicollinearity problems, it still might be that we have discovered the essence of a basic probabilistic learning process that represents information at the most basic binary neural level. The heart of this argument is that any cognitive machine designed to predict and explain our world will make the most sense if it employs the same most basic cognitive design principle as our brains. Of course, to buy into this argument one has to accept that our brains produce cognition based upon neural computation mechanisms that are essentially logistic regression, and one major purpose of this book is to lay out this case. Before making this case though, it is worthwhile to point out that it would be most natural for us to understand our world through artificial intelligence devices that are neuromorphic or designed to simulate the very binary information representation process that happens in real neural ensembles in the brain. And that is what we may have been doing in applying logistic regression to a wide variety of practical and scientific problems since the 1980s. This is in spite of shortcomings due to an inability to deal with high-dimension multicollinear data.

This book will propose a slight modification to standard logistic regression called Reduced Error Logistic Regression (RELR) that does overcome these problems. RELR changes logistic regression so that those error properties most famously elucidated by Nobel Laureate Daniel McFadden are now explicitly modeled as a component of the regression. Because evidence reviewed in this book suggests that neurons are especially good at learning stable binary signal probabilities with small sample data given high-dimension and/or multicollinear features, RELR has the effect of making logistic regression a much more general and stable method that is proposed to be a closer match to the basic neural computation process. Note that RELR is pronounced as “yeller” with an r or “RELLER”.

The theory of neural computation and cognitive machines that is the basis of RELR is based upon the most stable and agreed upon aspects of information theory. RELR results from adding stability considerations to information theory and associated maximum likelihood theory to get a closer match to what may occur in neural computations. This theory is concerned with stability in all aspects of causal and predictive analytics. This includes stability in regression estimates and feature selection in pure prediction models, stability in regression estimates and feature selection in putative explanatory models, stability in updating the regression estimates in online incremental learning, and stability in regression estimates in causal effect estimates based upon matched sample observation data.

In the end, this book will propose that RELR is a viable solution to the Leibniz's long-standing Calculus Ratiocinator problem in the sense that it can simulate stable neural computation, but produces automatic maximum probability solutions that are not corrupted by human bias. RELR can be used in all the same applications as standard logistic regression is used today. Yet, because RELR avoids the multicollinearity problems, this book will argue that RELR will have much wider application because its solutions are much more stable. The slight modification to standard logistic regression that is RELR simply ensures that error in measurements and inferences are effectively taken into account as a part of the basic logistic regression formulation. In the terms of regression modeling theory, RELR is an error-in-variables type method56 that estimates the error in the independent variables in terms of well-known properties instead of assuming that such error is zero.

Modifications to standard logistic regression in the newer penalized forms of logistic regression also attempt to correct these multicollinearity deficiencies, although these are not error-in-variables methods because unlike RELR they do not estimate the error in the independent variables. Instead, LASSO and Ridge logistic regression and other similar methods employ penalty functions that smooth or “regularize” error and have the effect of shrinking regression coefficients to smaller magnitudes than would be observed in standard logistic regression.57 These shrinkage methods may require that a modeler uses cross-validation through an independent validation sample to determine the degree of shrinkage that gives the best model, or they might even employ a predetermined definition of “best model” without any cross-validation. In any case, the “best model” will be an arbitrary function unless that best model is the maximum probability model which none of these penalized methods are designed to estimate. Yet, the most fundamental problem with the penalized methods is that there is no agreement on what the penalizing function should be as LASSO has it as a linear function, whereas Ridge has it as a quadratic function, whereas Elastic Net combines both linear and quadratic functions, and there are an endless number of possibilities. These different penalizing functions usually give radically different looking models.58

Unlike the arbitrary penalized forms of logistic regression, RELR does not require any arbitrary or subjective choices on the part of the model builder in terms of how to deal with error parameters or error functional forms or “best model” definitions. Rather than allow the effect of error to be happenstance as in standard logistic regression where it will have a very large debilitating effect on the quality of the model's ability to predict and explain with high-dimension or multicollinear data, RELR effectively models this error and largely removes it. The end result is that RELR gives reliable and valid predictions and explanations in exactly those error-prone situations with small observation samples and/or multicollinear high-dimension data where standard logistic regression fails miserably and may not even converge.

RELR can generate sparse models with relatively few features or more complex models with many redundant features. Implicit RELR is a diffuse method that usually returns many redundant features in high entropy models.59 Explicit RELR is a parsimonious feature selection method that returns much lower entropy models due to conditions that optimize sparse features.60 Both Implicit and Explicit RELR methods are maximum likelihood methods, but their maximum likelihood objective function differs depending on whether diffuse versus sparse feature selection is the goal. Much of this book starting with the next chapter explores the statistical basis of Implicit and Explicit RELR in maximum likelihood and maximum entropy estimation theory, along with the unique applications of each of these RELR methods and their putative analogues in neural computation.

RELR avoids the “no free lunch” problem because RELR changes the goal of machine learning. Instead of optimizing an arbitrary classification accuracy or cost function like in all other predictive analytic methods, the objective in both Explicit and Implicit RELR is defined through maximum likelihood criteria when both outcome and error event probabilities are included in the model. It is recognized that because RELR solutions are most probable solutions under the maximum likelihood assumptions, they will not always be the absolute most accurate in every situation. That is, there is always the possibility that less probable models will be more accurate in any given situation. Yet, available evidence is that RELR models are accurate and reliable models that generalize well and which can be automatically generated without human bias. This is all that can be demanded of any machine learning solution.

Although Explicit RELR's feature selection learning may be produced very rapidly in today's massively parallel computing, it is still not a process that ever could be realized in real time. This is because it requires feedback to guide its explanatory feature selection, and this is the small percentage of its processing that cannot be implemented in parallel. On the other hand, Implicit RELR can generate rapid feed-forward learning that is not quite real-time but still very fast in embarrassingly parallel Big Data implementations that require no feedback. In this sense of the relative speed of processing and the extent to which feedback determines each step in the process, this book will propose that Explicit RELR is similar to slow, deep, explicit memory learning, whereas Implicit RELR is similar to fast, shallow, implicit memory learning in neural computation.

So we will learn that the causal explanatory feature selection process guided by Explicit RELR is clearly a slower process than the rapid purely predictive Implicit RELR, and this is also the hallmark of explanatory reasoning compared with predictive intuition. Although Explicit RELR's most probable explanatory models are automatically produced in the computational trajectory that leads to an optimal solution, we will also see that their putative causal features require independent experimental testing. This is because these causal explanations produced are at best putative and most probable causal descriptions that clearly can be falsified with further and better data.

Explicit RELR's hypothesized explanations can be tested with randomized controlled experimental methods or with quasi-experimental methods that attempt to match observation samples to adequately test the validity of putative causal factors. However, quasi-experimental propensity score matching methods also suffer from many of the same estimation problems that have plagued variable selection in predictive analytics, as they also typically employ the same problematic unstable standard stepwise logistic regression or decision trees. In this regard, we will see that RELR gives rise to a new matched sampling quasi-experimental method that is not a propensity score matching method and instead is an outcome score matching method. RELR's outcome score matching quasi-experimental method holds all other factors to their expected outcome probability while varying the putative causal factor. Since the human brain does not perform randomized controlled experiments but obviously can discover causes, this book argues that a similar matched sample method could be the basis of the brain's causal reasoning. This quasi-experimental outcome score matching is an important aspect of RELR's Calculus Ratiocinator cognitive machine methodology.

One view of the origin of a science is that it begins with technical innovation in a craft that has large practical application. This view holds that it is the attempt to make theoretical sense out of such technological innovation that generates scientific principles rather than vice versa. This logical positivist view was most famously held by the Austrian physicist Ernst Mach who doubted the existence of atoms61 and the twentieth century behavioral psychologist B.F. Skinner who believed that psychology should develop without an attempt to understand cognition.62 Mach argued in the Science of Mechanics that the understanding of the concept of force came from our understanding of the effect of levers, so it is the practical invention that guides theoretical science rather than vice versa.63 People who work in predictive analytics, cognitive science and artificial intelligence tend to be much more the craftsperson than the theoretical scientist, so there is almost no underlying theory in this evolving field. Indeed, standard logistic regression with all of its limitation has had significant practical application in the past 30 years, but there has been no fundamental theory that explains why it fails so miserably with multicollinearity and how to correct this limitation. In fact, RELR grew out of my years of business and science analytics practice with no real guiding theory except to try to understand this multicollinearity limitation in standard logistic regression and other predictive analytic methods.

Without underlying theory, predictive analytics is in danger of being viewed somewhat like alchemy with many arbitrary techniques that sometimes produce an interesting result, but which also sometimes have significant risk of danger. Just as medieval alchemists in China accidentally discovered gun powder64 when they sought an elixir that would allow potency and immortality, predictive analytics today also can be accidental and random and produce risky results that are opposite to our intentions as in the 2008 United States financial crisis. Just as with alchemy, our problem is that we have no theory to guide our quests. Alchemy's methods to turn base metals into gold were ultimately shown to be inconsistent with atomic theory. Similarly, Mach's extreme avoidance of atomic theory was proven to be impractical, as one cannot imagine the development of chemistry, physics, chemical engineering, biochemistry, electrical engineering and even neuroscience without the concept of atoms.

In today's predictive modeling applications, it is very difficult to describe sophisticated approaches to prediction and explanation without using the neuromorphic term “learning”. Whether it is called statistical learning or machine learning, we are modeling a high-dimension multicollinear computation process that likely would be similar to what occurs in neurons if it is stable and accurate. So, the neuron is the fundamental structure in much the same way that the atom is the fundamental structure in physical science. Thus, the theory of RELR that emerges as a basis for Calculus Ratiocinator cognitive machines is a view based upon neuroscience. The crux of this theory is that the same maximum entropy concept that Boltzmann derived to characterize the maximum probability behavior of molecules also characterizes logistic regression neural computation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset