Notes and References

Preface

1. Rice, D.M. and Hagstrom, E.C. “Some evidence in support of a relationship between human auditory signal-detection performance and the phase of the alpha cycle”, Perceptual and Motor Skills 69 (1989): 451–457.

2. Rice, D.M., Buchsbaum, M.S., Starr, A., Auslander, L., Hagman, J. and Evans, W.J. “Abnormal EEG slow activity in left temporal areas in senile dementia of the Alzheimer type”, Journals of Gerontology: Medical Sciences 45(4) (1990): 145–151. The critical refining aspect was the use of the average recording reference, which gave a purer view of the potential field and the abnormality in the temporal lobes.

3. Rice, D.M., Buchsbaum, M.S., Hardy, D. and Burgwald, L. “Focal left temporal slow EEG activity is related to a verbal recent memory deficit in a nondemented elderly population”, Journals of Gerontology: Psychological Sciences 46(4) (1991): 144–151.

4. Scoville, W.B. and Milner, B. “Loss of recent memory after bilateral hippocampal lesions”, Journal of Neurology, Neurosurgery and Psychiatry 20(1) (1957): 11–21.

5. Boon, M.E., Melis, R.J.F., Rikkert, M.G.M.O. and Kessels, R.P.C. “Atrophy in the medial temporal lobe is specifically associated with encoding and storage of verbal information in MCI and Alzheimer patients”, Journal of Neurology Research 1(1) (2011): 11–15.

6. Henneman, W.J.P., Sluimer, J.D., Barnes, J., van der Flier, W.M., Sluimer, I.C, Fox, N.C., Scheltens, P., Vrenken, H. and Barkhof, F. “Hippocampal atrophy rates in Alzheimer disease: added value over whole brain volume measures”, Neurology 72 (2009): 999–1007.

7. Duara, R., Loewenstein, D.A., Potter, E., Appel, J., Greig, M.T., Urs, R., Shen, Q., Raj, A., Small, B., Barker, W., Schofield, E., Wu, Y. and Potter, H. “Medial temporal lobe atrophy on MRI scans and the diagnosis of Alzheimer disease”, Neurology 71(24) (2008): 1986–1992.

8. McKhann, G.M., Knopman, D.S., Chertkow, H., Hyman, B.T., Jack, C.R. Jr., Kawas, C.H., Klunk, W.E., Koroshetz, W.J., Manly, J.J., Mayeux, R., Mohs, R.C., Morris, J.C., Rossor, M.N., Scheltens, P., Carrillo, M.C., Thies, B., Weintraub, S. and Phelps, C.H. “The diagnosis of dementia due to Alzheimer's disease: recommendations from the National Institute on Aging-Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's Disease”, Alzheimer's and Dementia 7(3) (2011): 263–269.

9. Kuhn, T.S. The Structure of Scientific Revolutions (1st ed. 1962, 4th ed. 2012: Chicago, University of Chicago Press).

10. Marchione, M. “Alzheimer drug shows some promise in mild disease”, Bloomberg Business Week, October 8, 2012.

11. Khosla, D., Singh, M. and Rice, D.M. “Three dimensional EEG source imaging via the maximum entropy method”, IEEE Nuclear Science Symposium and Medical Imaging Record 3 (1995): 1515–1519.

Chapter 1

1. Leibniz, G. 1685. The Art of Discovery; In Wiener, Philip, Leibniz: Selections, (1951: Scribner).

2. Bardi, J.S. The Calculus Wars: Newton, Leibniz and the Greatest Mathematical Clash of all Time (2006: New York, Avalon Group/Thunder's Mouth Press).

3. Ibid.

4. Ibid, the word “calculus” is derived from the Greek word for a small stone that was used to make calculations. As Bardi points out, this was Leibniz's own invented term which Newton did not use. The word “ratio” has origins in the Latin word for reason. According to Merriam-Webster, the noun “ratiocination” means the process of exact thinking or a reasoned train of thought. http://www.merriam-webster.com/dictionary/ratiocination, so a Calculus Ratiocinator might be taken to be a device that produces exact thinking through computation. Ratiocinator is pronounced by combining the words “ratio”, “sin”, “ate” and “or”.

5. Leibniz, G. 1666. On the Art of Combination; In Parkinson, G.H.R., Logical Papers. (1966: Oxford: Clarendon Press).

6. Leibniz, The Art of Discovery, op. cit.

7. Today's widely used predictive analytics methods include Support Vector Machines, Artificial Neural Networks, Decision Trees, Standard Logistic and Ordinary Least Squares Regression, Genetic Programming, Random Forests, Ridge, LASSO, and LARS Regression and many others as reviewed through the course of this book.

8. Public domain image: http://en.wikipedia.org/wiki/File:Gottfried_Wilhelm_von_Leibniz.jpg.

9. Dejong, P. “Most data isn't big and businesses are wasting money pretending it is”, Quartz, May 6, 2013, http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/.

10. Wolpert, D.H. “The Lack of Apriori Distinctions between Learning Algorithms”, Neural Computation 8(7) (1996): 1341–1390.

11. Wolpert, D.H. and Macready, W.G. “No Free Lunch Theorems for Optimization”, IEEE Transactions on Evolutionary Computation 1(1) (1997): 67–82.

12. Nelder, J.A. “Functional Marginality and Response Surface Fitting”, Journal of Applied Statistics 27 (2000): 109–112. This scaling problem was passionately described by Nelder throughout his career, and this referenced article is just one example.

13. Standardized variables with a mean of 0 and a standard deviation of 1 are orthogonal when they are uncorrelated.

14. Breiman, L. “The Little Bootstrap and Other Methods for Dimensionality Detection in Regression: X-Fixed Prediction Error”, Journal of the American Statistical Association 87 (1992): 738–754.

15. Austin, P.C. and Tu, J.V. “Automated Variable Selection Methods for Logistic Regression Produced Unstable Models for Predicting Acute Myocardial Infarction Mortality”, Journal of Clinical Epidemiology 57 (2004): 1138–1146.

16. The Akaike Information Criterion and Bayesian Information Criterion are two such parsimony criteria commonly used in logistic regression.

17. Wickham, H. “Exploratory Model Analysis”, JSM Proceedings 2007; available online at http://had.co.nz/model-vis/2007-jsm.pdf.

18. Austin and Tu, op. cit. There was only a very slight improvement in the reliability of variable selections when experts also helped to select variables.

19. Shi, L. et al. “The MicroArray Quality Control (MAQC)-II Study of Common Practices for the Development and Validation of Microarray-Based Predictive Models”, Nature Biotechnology 28(8) (2010): 827–838.

20. Still, J. et al. “Feature-Weighted Linear Stacking”, 2009; available online at www.arxiv.org.

21. Some of these problems are due to a tendency to give biased estimates when data are from different types such as categorical and continuous variables. See Strobl, C. et al. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution”, BMC Bioinformatics 8 (2007): 25. Important variables are also more likely to be correlated predictors. The proposed solution to modify the original Random Forest method helps somewhat, but it does not eliminate this problem with multicollinear variables. See Strobl, C. et al. “Conditional Variable Importance for Random Forests”, BMC Bioinformatics 9 (2008): 307.

22. Seni, G. and Elder, J. Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions, (2010: Morgan and Claypool Publishers).

23. Miller, G.A. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information”, Psychological Review 63 (1956): 81–97. It might be more accurate to say that there are these small number limitations in the conscious objects, as a given object might be made from a larger number of features. Still the brain only appears to be able to hold a limited set of features in consciousness.

24. Dew, I.T. and Cabeza, R. “The Porous Boundaries between Explicit and Implicit Memory: Behavioral and Neural Evidence”, Annals of the New York Academy of Sciences 1224 (2011): 174–190.

25. Baddeley, A.D. and Hitch, G.J. “Working Memory”, In Bower, G.A. (Ed.), The Psychology of Learning and Motivation: Advances in Research and Theory, Vol. 8, pp. 47–89, (1974: New York, Academic Press). Baddeley, A.D. “The Episodic Buffer: A New Component of Working Memory?” Trends in Cognitive Sciences 4(11) (2000): 417–423.

26. Fell, J. and Axmacher, N. “The Role of Phase Synchronization in Memory Processes”, Nature Reviews Neuroscience 12 (2011):105–118.

27. Craik, F.I.M. and Tulving, E. “Depth of Processing and the Retention of Words in Episodic Memory”, Journal of Experimental Psychology: General 104(3) (1975): 268–294.

28. Warrington, E.K. and Weiskrantz, L. “Amnesic Syndrome: Consolidation or Retrieval?” Nature 228 (1970): 628–630.

29. Kahneman, D. Thinking, Fast and Slow (2011: New York, Farrar, Straus and Giroux).

30. Duhigg, C. The Power of Habit (2012: New York, Random House).

31. Stefanacci, L., Buffalo, E.A., Schmolck, H. and Squire, L.R. “Profound Amnesia after Damage to the Medial Temporal Lobe: A Neuroanatomical and Neuropsychological Profile of Patient E.P.”, Journal of Neuroscience 20(18) (2000): 7024–36.

32. Graybiel, A.M. “The Basal Ganglia and Chunking of Action Repertoires”, Neurobiology of Learning and Memory 70 (1998): 119–36.

33. In the basal ganglia, this predictive model learning occurs in parallel at each projection stage across a large population of basal ganglia neurons, and this circuitry also includes local feed-forward interneuron connections. See Graybiel, A.M. “Network-Level Neuroplasticity in Cortico-Basal Ganglia Pathways”, Parkinsonism and Related Disorders 10 (2004): 293–296.

34. Dew and Cabeza, Annals of the New York Academy of Sciences, op. cit.

35. Although unlike more explicit memories, familiarity memory does not appear to be disrupted by selective hippocampus injury even though select human hippocampus neurons actually show rapid old–new familiarity discriminations but these are probably too fast for conscious recollection. See Rutishauser, U., Mamelak, A.N. and Schuman, E.M. “Single-Trial Learning of Novel Stimuli by Individual Neurons of the Human Hippocampus–Amygdala Complex”, Neuron 49(6) (2006): 805–813.

36. Many cognitive neuroscientists believe that familiarity shares an underlying mechanism with repetition priming because studies of the temporal course of the brain's electric activity during recognition memory tasks suggest that they both operate in the same rapid response time frame and are difficult to dissociate, but an alternative view is that these recognition memory processes simply operate concurrently. See Dew and Cabeza, Annals of the New York Academy of Sciences, op. cit.

37. An interesting side note is that even though stacked ensemble models built from hundreds of submodels were most accurate in the Netflix competition, a linear ensemble blend of just two submodels, one built from the Singular Value Decomposition (SVD) method and the other built from Restricted Boltzmann Machines method, was accurate enough and far simpler to implement. So, this far simpler linear ensemble blend of just two submodels was put into production in the recommendation system at Netflix. See Amatriain, X. and Basilico, J. “Netflix Recommendations: Beyond the 5 Stars (Part 1)”, Netflix Tech Blog, April 6, 2012.

38. Breiman, L. “Statistical Modeling: The Two Cultures”, Statistical Science 16(3) (2001): 199–231.

39. Shmueli, G. “To Explain or To Predict?” Statistical Science 25(3) (2010): 289–310.

40. Derman, E. Models Behaving Badly: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life (2011: New York, Free Press).

41. Buzsáki, G. Rhythms of the Brain (2006: Oxford, Oxford University Press).

42. Buzsáki, Rhythms of the Brain, op. cit. See p. 297 discussion of how only small fraction of cells in hippocampus fire in any given place in an environment and pp. 284–287 for discussion of feedback in medial temporal lobe and hippocampus, along with entorhinal reciprical connection to neocortex.

43. Karlsson, M.P. and Frank, L.M. “Network Dynamics Underlying the Formation of Sparse, Informative Representations in the Hippocampus”, Journal of Neuroscience 28(52) (2008): 14271–14281.

44. There are important exceptions to when humans with normal cognitive ability can all agree on shared episodic experiences. These exceptions involve more remote experiences when suggestion is also involved to create false memories. Childhood memories are especially susceptible to these false suggestion effects. See Loftus, E. “Creating False Memories”, Scientific American 277 (1997): 70–75.

45. Buckner, R.L. and Carroll, D.C. “Self-Projection and the Brain”, Trends in Cognitive Sciences 11(2) (2006): 49–57.

46. Lehn, H., Steffenach, H.A., van Strien, N.M., Veltman, D.J., Witter, M.P. and Håberg, A.K. “A Specific Role of the Human Hippocampus in Recall of Temporal Sequences”, The Journal of Neuroscience 29(11) (2009): 3475–3484.

47. It has also been pointed out that it is misleading to characterize medial temporal lobe patients as having conscious recollection problems because they can clearly recollect remote experiences. They also show problems with tasks where stimuli relations are learned unconsciously, so their memory problems may be better characterized as problems in remembering relations between stimuli. In any case, the hippocampus does seem necessary to experience consciously recollected relations between recent events. See Eichenbaum, H. and Cohen, N.J. From Conditioning to Conscious Recollection: Memory Systems of the Brain (2004: Oxford, Oxford University Press).

48. The 2011 Data Mining Survey by Rexer Analytics concluded that while a wide variety of methods are used in predictive analytics, regression and decision tree methods are used most frequently. See http://www.rexeranalytics.com.

49. See discussion of this rise in popularity in the preface in Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression (2000: New York, Wiley, 2nd Edition). They mention how there was an “explosion” in the usage of logistic regression over the period from the later 1980s when their first edition came out through 2000 when their second edition appeared. Every indication is that this rapid rise in popularity has continued.

50. Peng, C.Y.J. and Po, T.S.H. “Logistic Regression Analysis and Reporting: A Primer”, Understanding Statistics (2002) 1: 31–70.

51. See Aldrich, J.H. Linear Probability, Logit, and Probit Models, (Sage Publications, London, 1984).

52. Ibid.

53. Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression, (2000: New York, Wiley).

54. Allison, P.D. “Discrete-Time Methods for the Analysis of Event Histories”, Sociological Methodology, Vol. 13, (1982), pp. 61–98. Chapter 3 will detail some of the nuances of using logistic regression in survival analysis, and why it has not been widely used in this case.

55. Mount, J. “The Equivalence of Logistic Regression and Maximum Entropy Models”, online at http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf.

56. There are many reasons reviewed through this book that the error-in-variables modeling in logistic regression is a much simpler problem than the linear regression case, even though linear regression has received far more study. For an overview of this method applied to linear regression see Gillard, J. “An Overview of Linear Structural Models in Errors in Variables Regression”, Revstat – Statistical Journal 8(1) (2010): 57–80.

57. Hastie, T., Tibshirani, R. and Friedman, J. Elements of Statistical Learning: Data Mining, Inference, and Prediction, (2011: Springer-Science, New York) 5th Edition, online publication.

58. For a concise summary of how modelers typically attempt to overcome some of the problems that result in the LASSO, see blogger Karl R.'s comment following Andrew Gelman's blog: “Tibshirani Announces New Research Result: A Significance Test for the LASSO”, on March 18, 2013, 10:55 am. The most useful method is simply to decorrelate independent variables so that they are no longer multicollinear—standard approaches are to drop variables or average variables through methods like Principal Components Analysis. However, because one has to remove multicollinearity to make optimal use of LASSO, this suggests that LASSO is not a very viable solution to the multicollinearity error problems. http://andrewgelman.com/2013/03/18/tibshirani-announces-new-research-result-a-significance-test-for-the-lasso/. The Elastic Net method which combines LASSO and Ridge Regression with the added cost of returning many more nonzero variables than LASSO might be a more viable way to handle multicollinearity error problems, although a study reviewed later in the book—Haury, A.-C., Gestraud, P., Vert, J.-P. (2011). “The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures”, PLoS One 6(12): e28210. http://dx.doi.org/10.1371/journal.pone.0028210—suggests that methods closer to RELR are more effective.

59. This is based upon the maximum entropy subject to linear constraints method first used by Boltzmann in the original ensemble modeling in the nineteenth century, as the maximum entropy subject to constraints and maximum likelihood solutions are identical in logistic regression. The original usage of maximum entropy subject to constraints methods might have been by Boltzmann. Note that if one assumes that one must perform some sort of averaging of elementary models to qualify for the definition of an ensemble method, then technically Implicit RELR is not an ensemble method because it returns a solution across many redundant features automatically. However, this Implicit RELR solution is actually the most probable weighted average that would be obtained by averaging many simpler sparse solutions, so it returns a solution that would have been formed through optimal weighted averaging of elementary models. RELR's Implicit “ensemble” method is actually similar to the usage of the term ensemble in statistical mechanics. This original usage of the concept of the ensemble is actually based upon maximum entropy statistical mechanics modeling where there are a set of microstates corresponding to possible molecular configurations, and the maximum entropy solution describes the most likely ensemble of microstates in terms of macro feature constraints such as involving temperature, pressure and energy.

60. In the commercial software implementation of RELR as a SAS language macro, Explicit RELR is currently called Parsed RELR and Implicit RELR is called Fullbest RELR. The generic explicit and implicit terms are used throughout the book because they are more descriptive and more general.

61. Mach, E. “The Guiding Principles of My Scientific Theory of Knowledge and Its Reception by My Contemporaries”, In Physical Reality, Edited by Toulmin, S. (1970: New York, Harper Torchbooks).

62. Skinner, B.F. About Behaviorism (1970: New York: Vintage).

63. Mach, E. The Science of Mechanics: A Critical and Historical Account of its Development (1919: Chicago, Open Court Publishing), originally published in 1883.

64. Kelly, J. Gunpowder: Alchemy, Bombards, and Pyrotechnics: The History of the Explosive that Changed the World (2005: New York, Perseus Books Group).

Chapter 2

1. Boltzmann, L. from Populäre Schriften, Essay 3. Address to a Formal meeting of the Imperial Academy of Science, May 29, 1886. In Brian McGuinness (ed.), Ludwig Boltzmann: Theoretical Physics and Philosophical Problems, Selected Writings (1974: New York, Springer).

2. Perrin, J.B. “Discontinuous Structure of Matter”, Lecture given on December 11, 1926, Nobel Lectures, Physics 1922–1941, (1965: Amsterdam, Elsevier).

3. As reviewed by Frigg and Werndl, it was eventually pointed out that Boltzmann's concept that systems would naturally evolve toward higher entropy would only happen with further ergodic assumptions, and even in this case the evolution would only then stay close to its maximum entropy value most of the time. In thermodynamics, ergodicity can be roughly taken to mean that the system has the same behavior when averaged over time as when averaged over space, so that particles will have an equal probability of being in all possible position and momentum states over long periods of time. Frigg, R. and Werndl, C. “Entropy—A Guide for the Perplexed”. In Probabilities in Physics, Edited by Beisbart, C. and Hartmann, S. (2010: Oxford, Oxford University Press).

4. Shannon, C. and Weaver, W. The Mathematical Theory of Communication, (1949: Urbana, University of Illinois Press). The Shannon expression is opposite in sign from Boltzmann's entropy because probabilities are used in this Shannon expression which reverse the sign of the logarithm. When all probabilities p(j) are equal to a constant, then Shannon's expression reduces to −logW where W is the product of the individual probabilities p(j) across all j, which is Boltmann's entropy with the scaling constant k = −1. Shannon's expression is more general than Boltzmann's because it does not require an arbitrary scaling constant specific to a given statistical mechanics implementation. The base e logarithm form is used in this book because it gives equivalent solutions as maximum likelihood logistic regression.

5. Jaynes, E. “Information theory and statistical mechanics”, Physical Review 106 (1957): 620–630.

6. This Jaynes Maximum Entropy Principle has not been without controversy. See Howson, C. and Urbach, P. Scientific Reasoning: The Bayesian Approach, (2006: La Salle, Open Court). They critique the Jaynes notion that objective prior probability can be obtained by maximizing entropy only on the basis of historical data and known constraints. Their critique centers on problems with continuous information entropy definitions, which may not entirely apply to the discrete Shannon entropy that is the basis of the logistic regression in this book. These authors are subjective Bayesians as opposed to Jaynes' objective Bayesian approach, and they view probability as a measure of degree of belief. This distinction will be discussed in detail in the next chapter.

7. Lucas, K. and Roosen, P. “Transdisciplinary Fundamentals” In Emergence, Analysis and Evolution of Structures: Concepts and Strategies Across Disciplines. Edited by Lucas, K. and Roosen, P. (2010, Berlin-Heidelberg, Springer-Verlag): 5–73.

8. Public domain image: http://en.wikipedia.org/wiki/File:Bentley_Snowflake9.jpg.

9. The Hodgkin and Huxley model uses an equation that describes how the concentration gradient in the distribution of a charged molecule across a divide such as a cell membrane is related to an electrostatic potential gradient. This effect can be understood as a mechanism that maximizes entropy subject to constraints in accordance with the second law of thermodynamics. This idea is basic to all concepts of how graded potentials arise across cell membranes, along with what causes ions to move across synaptic gates that temporarily open to produce ionic currents and corresponding graded potential changes that ultimately influence the probability of spiking at the soma as reviewed in the next chapter. See Hodgkin, A.L. and Huxley, A.F. “A Quantitative Description of Membrane Current and its Application to Conduction and Excitation in Nerve”, Journal of Physiology 117(1952): 500–544.

10. Golan, A., Judge, G. and Perloff, J.M. “A Maximum Entropy Approach to Recovering Information from Multinomial Response Data”, Journal of the American Statistical Association 91 (1996): 841–853.

11. Allison, P.D. “Convergence Failures in Logistic Regression”, SAS Global Forum, Paper 360, (2008): 1–11.

12. Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression, op. cit.

13. Aldrich, J.H. and Nelson, F.D. Linear Probability, Logit, and Probit Models, (1984, London, Sage Publications).

14. Mount, J. “The Equivalence of Logistic Regression and Maximum Entropy Models”, op. cit.

15. Egan, A. “Some Counterexamples to Causal Decision Theory”, The Philosophical Review 116(1) (2007): 93–114.

16. McFadden, D. (1974). “Conditional Logit Analysis of Qualitative Choice Behavior”, In Zarembka, P. (ed.) Frontiers in Econometrics (1974: New York, Academic Press) 105–142.

17. Train, K. Discrete Choice Methods with Simulation, (2nd Edition 2009: Cambridge, Cambridge University Press).

18. Ibid.

19. Ibid.

20. Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression, op. cit.

21. Wall, M.M., Dai, Y. and Eberly, L.E. “GEE Estimation of a Misspecified Time-Varying Covariate: An Example with the Effect of Alcoholism on Medical Utilization”, Statistics in Medicine 24(6) (2005): 925–939.

22. These are not really random effects but instead random intercepts, but the usage of the term is still widely applied to this meaning. On the other hand, the term “random coefficient” is often applied when coefficients in regression effects are also modeled to vary randomly across the clustering units.

23. One recently proposed alternative option that is implemented in linear regression is that the controversial random effects assumption that the random error effects across the clustering units are uncorrelated with the nested lower level explanatory variable effects is forced through design. See Bartels, B.L., “Beyond ‘Fixed versus Random Effects’: A Framework for Improving Substantive and Statistical Analysis of Panel, Time-Series Cross-Sectional, and Multilevel Data”, online publication originally presented at the Faculty Poster Session, Political Methodology Conference, Ann Arbor, MI, July 9–12, 2008. This will avoid explanatory regression coefficients that are biased by such correlation, but it still may not be appropriate depending upon whether the random effects assumption is reasonable to begin with. In some cases, it may be reasonable to believe that the unobserved factors in the clustering units may be expected to correlate with the explanatory variables, so the random effects assumption is simply not appropriate in those cases.

24. Train, K. Discrete Choice Methods with Simulation, op. cit.

25. Henscher, D.A. and Greene, W.H. “The Mixed Logit Model: The State of Practice and Warnings for the Unwary”, online paper.

26. Ibid.

27. Ibid.

28. Hosmer, D.W. and Lemeshow, S. Applied Logistic Regression, op. cit.

29. Gillard, J. “An Overview of Linear Structural Models in Errors in Variables Regression”, op. cit.

30. For example, see review by Rhonda Robertson Clark, “The Error-in-Variables Problem in the Logistic Regression Model”, Doctoral Dissertation, University of North Carolina-Chapel Hill (1982).

31. Ibid.

32. However, significant coverage anomalies still exist in small samples in standard measures of such binomial proportion error. See Agresti, A. and Coull, B.A. “Approximate is Better than ‘Exact’ for Interval Estimation of Binomial Proportions”, American Statistician 52 (1998): 119–126.

33. Gillard, J. “An Overview of Linear Structural Models in Errors in Variables Regression”, op. cit.

34. RELR is only applied to the cases of binary and ordinal regression, but any model with multiple qualitative categories that would be modeled with multinomial standard logistic regression also can be modeled as separate binary RELR models. The main body of the book presents the RELR binary model in conjunction with the previous sections, but Appendix A1 describes the special case of RELR ordinal regression.

35. For another treatment of error modeling in maximum entropy and logistic regression see Golan, A., Judge, G. and Perloff, J.M. “A Maximum Entropy Approach to Recovering Information from Multinomial Response Data”, op. cit. Golan, A., Judge, G. and Perloff, J.M. Maximum Entropy Econometrics, (1996: New York, Wiley). The specification of the RELR error differs from this previous work in that RELR assumes that the log odds of error is approximately inversely proportional to t, where t can be defined for each independent variable feature as discussed in Appendix A1.

36. Luce, R.D. and Suppes, P. “Preference, Utility and Subjective Probability”, In Luce, R.D., Bush, R.R. and Galanter, E. (eds), Handbook of Mathematical Psychology 3 (1965: New York, Wiley and Sons) 249–410.

37. McFadden, D. “Conditional Logit Analysis of Qualitative Choice Behavior”, op. cit.

38. Davidson, R.R. “Some Properties of Families of Generalized Logistic Distributions”, Statistical Climatology, Developments in Atmospheric Science, 13, Ikedia, S. et al. (ed.), (1980: New York, Elsevier).

39. George, E.O. and Ojo, M.O. “On a Generalization of the Logistic Distribution”, Annals of the Institute of Statistical Mathematics 32(2, A) (1980): 161–169. George and Ojo showed that when the standardized logit coefficient is multiplied by a scaling constant, these two probability distributions differ by less than 0.005% with a small number of degrees of freedom of less than 10 and this difference rapidly goes to zero with larger numbers of degrees of freedom.

40. This was originally shown in Rice, D.M. “Generalized Reduced Error Logistic Regression Machine”, Section on Statistical Computing—JSM Proceedings (2008): 3855–3862.

41. When only a prediction is desired, which is a typical application of Implicit RELR, balancing samples may not add much because models are not interpreted in terms of explanations. Still there can be an advantage even here in that fewer total observations are used in balanced samples. Yet, whenever Explicit RELR is computed to generate an explanation, balanced samples are strongly recommended because with no missing data the solution will not be dependent upon how the variance estimates in t values were computed. The formula for t values that is based upon Pearson correlations shown in the Appendix is used in the error model expression and feature reduction unless special conditions apply. A Student’s t value that assumes equal variance in two independent groups also gives approximately the same error modeling estimates here as does Welch’s t in balanced binary RELR samples without missing data. Yet, Welch’s t is always used in some special conditions as noted in the Appendix, as for example this can give noticeably more stable Explicit RELR feature selections when missing feature values are imbalanced across binary outcomes.

42. This issue of the Nelder marginality principle was discussed in the first chapter, Nelder, J.A. “Functional Marginality and Response Surface Fitting”, op. cit.

43. Moosbrugger, H., Schermelleh-Engel, K., Kelava, A., Klein, A.C. “Testing Multiple Nonlinear Effects in Structural Equation Modeling: A Comparison of Alternative Estimation Approaches”. Invited Chapter in Teo, T. & Khine, M.S. (eds), Structural Equation Modelling in Educational Research: Concepts and Applications (In press, Rotterdam, Sense Publishers).

44. Rice, D.M. “Generalized Reduced Error Logistic Regression Machine”, op. cit.

Chapter 3

1. Fienberg, S. “When Did Bayesian Inference Become ‘Bayesian’?” Bayesian Analysis 1 (1) (2003): 1–41.

2. Frigg, R. and Werndl, C. “Entropy—A Guide for the Perplexed”, op. cit.

3. Ibid.

4. Tversky, A. and Kahneman, D. “Judgment under Uncertainty: Heuristics and Biases”, Science 185 (1974): 1124–1131.

5. Ibid.

6. The KL divergence is not a metric because it does not exhibit the triangular property which determines that the shortest path between two points is a straight line.

7. Eguchi, S. and Copas, J. “Interpreting Kullback–Leibler Divergence with the Neyman–Pearson Lemma”, Journal of Multivariate Analysis 97 (2006): 2034–2040.

8. Eguchi, S. and Kano, Y. “Robustifing Maximum Likelihood Estimation” (2001): online publication.

9. Allison, P.D. Sociological Methodology, op. cit.

10. Hedeker, D. “Multilevel Models for Ordinal and Nominal Variables”, Handbook of Multilevel Analysis, edited by Jan de Leeuw and Erik Meijer, 2007 Springer, New York pp. 239–276. Hedeker brings multilevel modeling into his survival analysis methods, but as will be reviewed in Chapter 4, RELR does not require multilevel parameters so the parts of his method that relate to multilevel parameters do not apply to RELR.

11. Allison, P.D. Sociological Methodology, op. cit. Hedeker, D., op. cit.

12. Allison, P.D. Sociological Methodology, op. cit. Hedeker, D., op. cit.

13. Allison, P.D. Sociological Methodology, op. cit.

14. Allison, P.D. Sociological Methodology, op. cit. Hedeker, D., op. cit.

15. Hedeker, D., op. cit.

16. Hedeker, D., op. cit.

17. RELR will however exhibit the classic almost complete separation problems in high-dimension data relative to the sample size unless the sample is perfectly balanced with no intercept/zero intercept as reviewed in an upcoming example. See Allison, P.D. “Convergence Failures in Logistic Regression”, op. cit. for a description of this problem.

18. Maximum entropy and maximum log likelihood return the identical solution which is the maximum probability solution when known constraints are entered into a model. Yet, the meaning of these two measures is different when the constraints are not known. This difference is because these two measures are the same magnitude value but opposite in sign for a given set of feature constraints. So, in both standard logistic regression and RELR, a search across all possible selected feature sets will find that the maximum entropy solution across all possible selections will be a different solution than the maximum log likelihood solution across all possible selected feature sets. In this search across all possible feature sets in standard logistic regression, the maximum entropy solution is a trivial solution with no constraints at all. Thus, nothing is effectively learned by maximizing uncertainty in feature selection there. On the other hand, in the search for maximum log likelihood across all possible feature sets, the standard logistic regression solution contains all the features no matter how many. Such a solution is usually associated with extreme overfitting and very poor prediction generalization to new data samples unless there are very few total feature constraints in relation to the sample size. The problem is that each additional feature added in standard logistic regression always increases the log likelihood value in the fit to the training sample. So in standard logistic regression, the log likelihood value is not a good measure either that can be used to choose an optimal feature selection set that would generalize well in prediction to new data samples.

19. Seni and Elder, op. cit.

20. While this is implemented as a backward selection process in current software implementations, it is possible that an equivalent process could be implemented as a forward selection process which simply adds features based upon t-value magnitudes while controlling for bias in even versus odd polynomial features. Since the only goal is prediction in this feature selection, using a backward versus forward selection process can give identical selections and prediction. But when they differ in real-world data, it is because backward selection maintains linear features longer in the process which are more prevalent than nonlinear features because binary and category variable cannot be nonlinear. In such cases, it is usually an advantage that linear features are maintained longer because nonlinear features are often likely to be spurious or even falsely positive correlation effects as reviewed in the last chapter.

21. Should a tie between two or more solutions ever occur this method simply selects the solution with the fewest features.

22. Seni, G. and Elder, J., op. cit.

23. Seni, G. and Elder, J., op. cit.

24. One obvious very minor change that may sometimes lead to a slight improvement in observation log likelihood fit would be to allow either an odd- or even-powered polynomial feature to be dropped at each step, but force that both need to occur in every two steps. This may sometimes yield slightly larger observation log likelihood values across all steps than dropping both one odd- and one even-powered polynomial feature at each step, but it requires double the processing. The reason that it has not been implemented in Implicit RELR is because the improvement in the log likelihood is very small and not significant whereas the savings of half the processing effort with the current implementation is substantial.

25. van der Laan, M.J. and Rose, S. “Targeted Learning: Causal Learning for Observational and Experimental Data”, (Springer, New York, 2012). See Chapter 3 on Super Learners.

26. Rice, D.M. JSM Proceedings, op. cit. Note that the very poor performance of Partial Least Squares (PLS) is more about the fact that many of predictive features were binary features, which is actually a violation of best practice assumptions in PLS and Principal Component Analysis (PCA). Yet, this is a clear example of the problems that arise when restrictive assumptions of standard predictive analytic methods like PCA and PLS are not met.

27. Mitchell, T. “Generative and Discriminative Classifiers: Naïve Bayes and Logistic Regression”, Online draft.

28. Haury, A.C.,Gestraud, P. and Vert, J.P. “The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures”, PLoS ONE 6(12) (2011): e28210. http://dx.doi.org/10.1371/journal.pone.0028210.

29. Hand, D.J. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve”. Machine Learning 77 (2009): 103–123.

30. Hosmer, D.W. and Lemeshow, S., op. cit. They give a more detailed discussion of Wald and Likelihood Ratio test theory and computations in logistic regression along with other similar tests such as the Score test. Yet a very succinct description is that Likelihood Ratios are used to compute such a χ2 which estimates the effect of removing one parameter. So, the reduced model needs to be completely nested within the fuller model. That is, everything is the same in both models with the exception that the parameter that is tested is zeroed in the reduced model. In this test, minus two times the difference in log likelihood of the reduced model to the fuller model or −2LL is distributed as a χ2 variable with one degree of freedom.

31. For example, it is expected that t-values that measure reliability of correlation effects and that are closer to zero would vary less in magnitude across independent samples compared with such larger magnitude t-values. For example, in large enough samples where t-values that reflect the reliability of Pearson correlations would be expected to be stable such as 1000 observations, it would not be surprising if a t-value varied between 248 and 265 across two independent samples because this would translate into a correlation varying between 0.992 and 0.993 which we know intuitively is likely to be rounding error. But it would be surprising if a t-value varied between 0 and 17 across two similar samples, because this would reflect the very unlikely occurrence of a correlation varying between 0 and 0.475 across these two samples.
Fisher developed a test called the z′ test to correct for the tendency for Pearson correlations to have differential variability across their range and thus allow correlations to be compared with one another. But here we are talking about how t-values that measure correlation effects would have a different type of variability across their range. Yet, Fisher's test would predict that a correlation that changes between 0 and 0.47 across two independent samples of 1000 observations is a very unlikely occurrence, whereas a fluctuation between 0.992 and 0.993 is a likely occurrence. See Fisher, R.A. “On the ‘Probable Error’ of a Coefficient of Correlation Deduced from a Small Sample”, Metron 1 (1921): 3–32.

32. Different types of data in terms of different domains may require somewhat different considerations in reasonable feature reduction short lists for Explicit RELR, but there will be much less variability within a domain. Like Implicit RELR, Explicit RELR's feature selection is clearly multivariate in the sense of being able to handle interaction effects. Yet, unlike Implicit RELR, its regression coefficients in its final models are often not proportional to t-value magnitudes. Still, it does have similarities to Implicit RELR in that the relative magnitude and signs of the regression coefficients are not substantially affected by the presence or absence of other features in the model.
Depending upon the starting point in terms of the magnitude of the t-values that are considered as candidate features, Explicit RELR may miss causal features that have very small univariate effects. This would happen for example with causal features that are almost perfectly negatively correlated to one another if they cancel and have little effect on the dependent variable. But, such extreme cancellation is an unlikely possibility, although it is certainly possible. As shown in the simulations of the last chapter, even when linear and cubic features are simulated to have regression coefficients of the same magnitude but oppositely correlated, the t-values observed in RELR are still substantial with such opposing features and consistent with a Taylor series approximation to the sine function which simply gives less weight to the cubic effect. One may view Explicit RELR as obtaining most likely feature selection under the condition that parsimony is a strong objective, but this does not mean that Explicit RELR will necessarily select causal features.

33. This can be observed in as the regression coefficients shown in Fig. 3.2(a) are from an RELR model with only 176 features whereas those in Fig. 3.2(c) and (d) are from an RELR model based upon the same data set with 9 features, and standard error se values are higher with 9 features so stability s or 1/se is lower with 9 remaining features.

34. The χ2 values that are required to compute the magnitude of the derivative in Eqn (3.5b) can be computed using methods such as the Likelihood Ratio, the Score or the Wald approximation. Yet, commercial software implementations of RELR use the Wald approximation to calculate χ2 and get the corresponding χ value. Note that a very easy way to compute this derivative that avoids the chain rule in the appendix is to realize that d(βs)2/ds is simply 2β2s or 2βχ, since s = 1/se and χ = β/se.

35. Available at http://www.umass.edu/statdata/statdata/data/.

36. In RELR, interaction and nonlinear effects are always computed between standardized variables, and then the interaction or nonlinear feature is itself standardized. This removes the marginality problems related to differential scaling when interaction and nonlinear effects are introduced in regression modeling, as all such effects are on the same relative scale as main effects.

37. King, G. and Zeng, L. “Logistic Regression in Rare Events Data”, Political Analysis 9 (2001): 137–163. Copy at http://j.mp/lBZoIi. In accord with their Eqn (7) on p. 144 in this article, the adjusted intercept α can be defined as α = −ln((1 − τ)/τ) in the case of a perfectly balanced stratified sample by design, where τ is an estimate of the fraction of 1's that would be found in the population which is gotten from the proportion of 1's prior to sample balancing in the original Explicit RELR sample used in training.

38. Hosmer, D.W. and Lemeshow, S., op. cit. They point out that when the intercept is used, then there should be K-1 design or dummy variable levels in standard logistic regression. In this case shown in Fig. 3.4(b) where there is no intercept, RELR and Standard Logistic Regression under the Hosmer and Lemeshow criteria will have the same number of levels. But more generally RELR will always have K dummy variables for categorical variables with K levels.

39. Additionally, interaction effects can be formed with each specific categorical level feature. Because minimal KL divergence learning (sequential online learning) usually cannot be applied in the case of nonlongitudinal data where correlated observations cannot be assumed to be prior to one another, RELR only uses similar dummy-coded categorical features for all such correlated observations. For example, a standardized dummy code for a specific school (coded for School A versus Not School A) would be used for each different school in a model that compares high and low test scoring students. Such a categorical feature corresponding to each school will control for correlations among students within each school, as each school will have a unique regression coefficient parameter. Should a feature like a specific school be selected as a feature, then clearly this feature cannot be generalized to other schools. This specific school feature would be treated as a missing value in a more general scoring or updating sample that does not include that specific school. Even though any effect related to a specific school cannot be generalized beyond that school, at least the inclusion of an effect like that for a specific school controls for correlations in the sample that are caused by attendance at that school. And RELR's ability to handle unlimited numbers of categorical levels as standardized dummy variables even in small data samples and select unique effects for each such dummy variable greatly aids in the interpretation of such models.

40. The final selected one feature Explicit RELR model depicted in Fig. 3.4(b) was not completely nested within the final two-feature model depicted in Fig. 3.4(d) because these models had a different number of pseudo-observations. However, if we introduce a binary parameter to reflect whether or not pseudo-observations corresponding to a given feature were zeroed then this results in 1 df for this parameter and 1 df for the extra feature selected. Thus, a nested chi-square test of −2 times the difference in the RLL observed in an independent cross-sectional holdout sample in Fig. 3.4(c) relative to that in Fig. 3.4(a) would be appropriate, as this is based upon the idea that −2 times the log likelihood of one model that is nested within another generates a χ2 statistical test. This yielded a value of −2(−42.10) − 2(−38.67) = 6.85, which was significant (p < 0.05).

41. van der Laan, M.J. and Rose, S., op. cit.

42. Hastie, T., Tibshirani, R. and Friedman, J. Elements of Statistical Learning: Data Mining, Inference, and Prediction, (2011: Springer-Science, New York) 5th edition, online publication.

43. Ibid. Note that this definition of AIC multiples the right-hand side by N or the training sample size which is appropriate if one is only comparing relative values of AIC at a fixed training sample as in this definition.

44. Thomas Ball. Modeling First Year Achievement Using RELR, Report based upon independent research on RELR, July 2011. Can be downloaded at http://www.riceanalytics.com/_wsn/page13.html. Note that Parsed RELR was the name used at that time for what was essentially Explicit RELR without zero intercepts and perfectly balanced samples.

Chapter 4

1. Newton, I. The Mathematical Principles of Natural Philosophy, trans. Andrew Motte (London, 1729), pp. 387–393.

2. Padmanabhan, T. “Thermodynamical Aspects of Gravity: New insights”. Report on Progress in Physics 73 (4) (2009): 6901; Verlinde, E.P. “On the Origin of Gravity and the Laws of Newton”, Journal of High Energy Physics 29 (2011) 1–26.

3. Ioannidis, J.P.A. “Why Most Published Research Findings Are False”, PLoS Med 2(8) (2005): e124 http://dx.doi.org/10.1371/journal.pmed.0020124.

4. Rosenbaum, P.R. and Rubin, D.B. “Constructing a Control Group using Multivariate Matched Sampling Incorporating the Propensity Score”, The American Statistician 39 (1985): 33–38.

5. Imai, K. “Covariate Balancing Propensity Score”, can be downloaded from Imai's Princeton University homepage.

6. Austin, P.C. “The Performance of Different Propensity-Score Methods for Estimating Differences in Proportions (Risk Differences or Absolute Risk Reductions) in Observational Studies”, Statistics in Medicine 29(20) (2010): 2137–2148.

7. Rubin, D.B. “The Design versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials”, Statistics in Medicine 26 (1) (2007): 20–36.

8. Pearl, J. “Remarks on the Method of the Propensity Score”, Statistics in Medicine 28 (2009): 1415–1424.

9. Gelman, A. “Resolving Disputes between Pearl, J. and Rubin, D. on Causal Inference”, http://andrewgelman.com/2009/07/disputes_about/. Note that Pearl is actually simply rephrasing a comment by another participant Phil in the first part of this quote before he agrees that it is the “crux of the matter”.

10. Ibid.

11. Frölich, M. “Propensity Score Matching without Conditional Independence Assumption—with an Application to the Gender Wage Gap in the United Kingdom”, Econometrics Journal 10 (2007): 359–407.

12. Shadish, W.R. and Clark, M.H. Unpublished work reviewed by Rubin, D.B., (2007): op. cit.

13. Rose and van der Laan propose a method for causal learning that is not a propensity score method. Instead, they employ a two stage process where the initial stage is what they call a Super Learner designed to build a best predictive model based upon a large number of candidates as mentioned in the last chapter. Their Super Learner process could be an ensemble modeling process that averages separate candidate models, or it could be a process that just returns one parsimonious model if a good parsimonious model is supplied as a candidate model by a user. In any case, as reviewed in the last chapter, this Super Learner process does require a user to specify a loss function that defines a best model. The goal of this best model is to provide an initial model that controls for all covariates in an assessment of a causal treatment effect. The second stage of this process is what Rose and van der Laan term as Targeted Maximum Likelihood Estimation or TMLE. They describe TMLE as a semiparametric method that provides an optimal tradeoff between bias and variance in the estimation of targeted effects. For binary outcomes, this TMLE updates the original Super Learner model in a logistic regression step that views the original Super Learner's model as an intercept offset and a given targeted variable as the independent variable. In this way, the TMLE is targeted toward the parameter of interest to return an estimate that has relatively low bias compared to a straight logistic regression with no such controls. The relatively unbiased targeted effect that is measured in TMLE may be interpreted as a causal factor when causal assumptions are made. Like TMLE, RELR's approach to causal reasoning completely avoids the construction of a first stage propensity score model. Yet, TMLE may be designed to test preexisting causal hypotheses rather than to discover new causal hypotheses, whereas RELR's causal reasoning process is designed to do both as it employs Explicit RELR to discover new putative causal features. There are other aspects of RELR's approach to causal effect estimates that are similar only in a very superficial way to how TMLE estimates effects in binary outcomes, as RELR also makes use of offset estimates. Yet, RELR uses the offset to adjust the covariate weights rather than the putative causal effects, and another critical distinction is that RELR is used instead of standard logistic regression in the offset regression. Because TMLE is based upon standard maximum likelihood logistic regression in its assessment of targeted effects, it may return causal effects that are likely to be more biased than RELR.

14. Rubin, D.B. “Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies”, Journal of the American Statistical Association 74 (1979): 318–328.

15. Other hypotheses can be tested other than concerning Explicit RELR's selected features and whether matched subgroups dictated by whether they are above or below the mean of these features differ. However, these hypothesis-testing decisions need to be made in advance of viewing the data used in the test. Obviously when hypotheses become a part of the process, there is a greater possibility of subjective biases which would be avoided by the automated procedures outlined in Appendix A8.

16. Austin, P.C. “A Critical Appraisal of Propensity Score Matching in the Medical Literature between 1996 and 2003”, Statistics in Medicine 27(12) (2008): 2037–2049.

17. Deza, M.M. and Deza, E. Encyclopedia of Distances, (2013: Berlin, Springer-Verlag 2nd Edition).

18. Topsøe, F. “Some Inequalities for Information Divergence and Related Measures of Discrimination”, IEEE Transactions on Information Theory 46 (2000): 1602–1609.

19. Ibid. This Topsøe distance is simply twice the better known Jensen–Shannon divergence measure when equal weight is given to each probability distribution that is matched. The Topsøe distance is not a metric, but its square root is a metric as it has the important triangular property. In this RELR causal reasoning application, the closest matching based upon the Topsøe distance occurs is exactly the same matching as that based upon the square root of the Topsøe distance, so this assures that this matching is based upon well behaved metric.

20. Eguchi, S. and Copas, J., op. cit.

21. Schneidman, E., Bialek, W. and Berry, M.J. “An Information Theoretic Approach to the Functional Classification of Neurons”, Advances in Neural Information Processing Systems 15 (2003): 197–204.

22. See case study from STATA manual which shows that conditional logistic regression and McNemar's test produce very similar results; www.stata-press.com/manuals/stata10/clogit.pdf.

23. Pearl, J. Causality, (2009: Cambridge University Press, Cambridge 2nd Edition).

24. Hosmer, D.W. and Lemeshow, S., op. cit.

25. Ibid. They reference Breslow and Day (1980) as the source in the literature for this fact.

26. Mitchell, T., op. cit.

27. Ibid.

28. Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.R. Bayesian Data Analysis, (2004: Boca Routon, Chapman and Hill/CRC 2nd Edition).

29. Ibid.

30. Crabbe, M. and Vandebrock, M. “Improving the Efficiency of Individualized Designs for the Mixed Logit Choice Model by Including Covariates”, Online paper.

31. Orme, B. and Howell, J. “Applications of covariates within Sawtooth Software's HB/CBC program: theory and practical example”, Sawtooth Software Conference Papers (2009: Sequoia, WA, Sawtooth Software).

32. Chen, D. “Sisterhood of Classifiers: A Comparative Study of Naïve Bayes and Noisy-or Networks”, Online publication.

33. Ibid.

34. Twardy, C.R. and Tworb, K.B. “Causal interaction in Bayesian Networks”, Online paper.

35. Chickering, D.M., Heckerman, D. and Meek, C. “Large-sample Learning of Bayesian Networks is NP-hard”, Journal of Machine Learning Research 5 (2004): 1287–1330.

36. Pearson, K., Lee, A., Bramley-Moore, L. “Genetic (Reproductive) Selection: Inheritance of Fertility in Man”, Philosophical Translations of the Royal Statistical Society, Series A 173 (1899): 534–539.

37. Simpson, E.H. (1951). “The Interpretation of Interaction in Contingency Tables”, Journal of the Royal Statistical Society, Series B 13 (1951): 238–241.

38. Charig, C. R., Webb, D. R., Payne, S. R., Wickham, O. E. “Comparison of Treatment of Renal Calculi by Operative Surgery, Percutaneous Nephrolithotomy, and Extracorporeal Shock Wave Lithotripsy”, British Medical Journal (Clinical Research Ed.) 292 (6524) (1986): 879–882.

39. Pearl, J. Causality, op. cit.

40. Rigmen, F. “Bayesian Networks with a Logistic Regression model for the Conditional Probabilities”, International Journal of Approximate Reasoning 48(2) (2008): 659–666.

41. Chen, D., op. cit.

42. These comparisons are also problematic because these studies used the Receiver Operator Curve—area under the curve (ROC AUC) measure. This ROC AUC measure has been a widely used measure of classification accuracy, but has recently been shown to have some fundamental flaws. These flaws include that the AUC measure randomly fluctuates so that an accurate confidence interval often cannot be obtained and the validity of this AUC measure is a problem when the ROC from different models cross. These fundamental problems may often exist even at very large sample sizes. The fluctuation can be significant much like a stopwatch that is randomly varying by 1 s in a 100 m sprint, so the fluctuation can affect the outcome of comparisons. Hand, D., op. cit.

43. Gevaert, O., De Smet, F., Kirk, E., Van Calster, B., Bourne, T., Van Huffel, S., Moreau, Y., Timmerman, D., De Moor, B. and Condous, G. “Predicting the Outcome of Pregnancies of Unknown Location: Bayesian Networks with Expert Prior Information Compared to Logistic Regression”, Human Reproduction 21(7) (2006): 1824–1831.

44. Le, Q.A., Strylewicz, G., Doctor, J.N. “Detecting Blood Laboratory Errors Using a Bayesian Network: An Evaluation on Liver Enzyme Tests”, Medical Decision Making 31(2) (2011): 325–337.

45. Gopnik, A., Glymour, C., Sobel, D., Schulz, L.E., Kushnir, T. and Danks, D. “A Theory of Causal Learning in Children: Causal Maps and Bayes Nets”, Psychological Review 111(1) (2004): 3–32.

Chapter 5

1. Cajal, S.R.y “Estructura de los Centros Nerviosos de las aves. Revista Trimestral de Histologia Normol y Patologica”, First published in 1888 and translated in Clarke, E. and O'Malley, C.D. The Human Brain and Spinal Cord: A Historical Study Illustrated by Writings from Antiquity to the Twentieth Century, (1996: San Francisco, Norman Publishing 2nd revised edition).

2. Ibid.

3. Hocking, A.B. and Levy, W.B. “Computing Conditional Probabilities in a Minimal CA3 Pyramidal Neuron, Neurocomputing 65–66 (2005): 297–303.

4. Aosaki, T., Tsubokawa, H., Ishida, A., Watanabe, K., Graybiel, A.M. and Kimura, M. “Responses of Tonically Active Neurons in the Primate's Striatum Undergo Systematic Changes during Behavioral Sensorimotor Conditioning”, The Journal of Neuroscience 14(6) (1994): 3969–3984. In this study involving implicit memory learning in monkeys, neural responses in basal ganglia, linked to targeted licking responses that were rewarded when auditory cues were presented, were observed within 10 min of training. These neural responses occurred after about 100 auditory cues and after much lower numbers of targeted licking responses.

5. Rutishauser, Mamelak, and Schuman (2006): Neuron, op. cit. This study in humans actually showed single trial learning in select medial temporal lobe neurons in old–new visual image recognition memory learning, although it took the recorded population recorded six trials on average to reach 90% discrimination accuracy in a predictive model.

6. The idea that field properties within the brain could have significant effects in the brain's cognitive function has received empirical support in recent years. As one example, see Nunez, P.L. and Srinivasan, R. “A Theoretical Basis for Standing and Traveling Brain Waves Measured with Human EEG with Implications for an Integrated Consciousness.” Clinical Neurophysiology, 117(11) (2006 November): 2424–2435.

7. Not all neurons have graded potentials that directly produce spiking, as electrotonic neurons only produce graded potentials which may then help cause spiking in adjacent neurons through tight junction synapses.

8. Public domain image available at http://en.wikipedia.org/wiki/File:PurkinjeCell.jpg.

9. Avitan, L., Teicher, M. and Abeles, M. “EEG Generator—A Model of Potentials in a Volume Conductor”, Journal of Neurophysiology, 102(5) (2009): 3046–3059.

10. Note that binary variables in RELR are also viewed as interval variables which reflect a probability of occurrence of either zero or one where missing values in independent variables will be mean imputed values which will be an interior point in this interval. The rationale behind mean-imputation is reviewed in the next chapter. Note that ordinal variables are also possible as either independent or dependent variables in RELR and care should be taken to ensure that these variables indicate rank when used to compute Pearson correlations in the error model and feature reduction.

11. http://commons.wikimedia.org/wiki/File:Artificial_Neuron_Scheme.png.

12. Hebb's idea is often simplified as “neurons that fire together wire together”, but this may be too simple because it does not account for negative correlations in spiking between adjacent neurons in a feed forward network which would occur in the case of inhibitory synaptic connections. See Hebb, D.O. The Organization of Behavior (1949: New York, John Wiley & Sons). The research area known as spike time dependent plasticity is largely based upon Hebb's original ideas about neural learning.

13. The mathematical details are shown an appendix which will be based upon the JSM paper that can be downloaded from www.riceanalytics.com. See above note.

14. Izhikevich, E.M. “Simple Model of Spiking Neurons”, IEEE Transactions on Neural Networks 14 (2003): 1569–1572.

15. Electronic versions of the figure and reproduction permissions are freely available at www.izhikevich.com.

16. Izhikevich, E.M. Dynamical Systems in Neuroscience, (2007: Cambridge MA, MIT Press). The properties of these coefficients are described on p. 155 in this reference, along with an example of how the sign of the b coefficient that determines the linear v term in the time derivative of u would determine whether this u variable has an amplifying or resonating effect. If we drop the linear v term and u variable altogether, this gives a quadratic integrate and fire neuron as shown on p. 80 of this reference. So, interesting dynamic variations are possible that reflect real neural dynamic properties simply by varying these parameters within the simple model.

17. Touboul, J. “Importance of the Cutoff Value in the Quadratic Adaptive Integrate-and-Fire Model”, Neural Computation 21 (2009): 2114–2122.

18. Izhikevich, E.M. “Hybrid Spiking Models”, Philosophical Transactions of the Royal Society 368(A) (2010): 5061–5070.

19. Touboul, J. and Brette, R. “Spiking Dynamics of Bidimensional Integrate-and-Fire Neurons”, SIAM Journal of Applied Dynamical Systems Society for Industrial and Applied Mathematics 8(4) (2009): 1462–1506.

20. Izhikevich, Dynamical Systems in Neuroscience, op. cit., pp. 106–107.

21. Markram, H. “On simulating the brain—the next decisive years”, Lecture given at International Supercomputing Conference, July 2011. Video at http://www.kurzweilai.net/henry-markram-simulating-the-brain-next-decisive-years.

22. Izhikevich, Dynamical Systems in Neuroscience, op. cit., pp. 292–294.

23. Gerstner, W. and Kistler, W. Spiking Neuron Models, Single Neurons, Populations, Plasticity (2002): London, Cambridge University Press. See section on multi-compartment modeling at http://icwww.epfl.ch/∼gerstner/SPNM/node29.html.

24. Izhikevich, E.M. and Edelman, G.M. “Large-Scale Model of Mammalian Thalamocortical Systems”, PNAS 105(9) (2008): 3593–3598.

25. In the Izhikevich simple neural dynamics model, depending upon the signal strength in a given dendritic compartment, the signals that are passed on to the next compartment can be either passive and degrade with distance or active and maintain strength with distance as a dendritic spike. The output of each dendritic compartment's signal is then passed to the next compartment in a trajectory leading to the axonal hillock.

26. London, M. and Hausser, M. “Dendritic Computation”, Annual Review of Neuroscience 28 (2005): 503–532.

27. Gerstner and Kistler, Spiking Neuron Models, Single Neurons, Populations, Plasticity (2002): op. cit.

28. Izhikevich and Edelman, PNAS (2008): op. cit. See description of synaptic dynamics and plasticity in their appendix.

29. Schneidman, E., Berry, M.J., Segev, R. and Bialek, W. “Weak Pairwise Correlations Imply Strongly Correlated Network States in a Neural Population”, Nature 440(20) (2006): 1007–1012.

30. RELR allows very specific higher order nonlinear or interaction effects to be selected without selecting lower order effects. In other words, it is not subject to the principle of marginality suggested by the late John Nelder which would require for example that two main effect terms always be selected whenever an interaction between these two effects is selected. Nelder pointed out that arbitrary scaling of a variable relative to others will arbitrarily influence the regression model if the marginality is not honored as in the selection of main effects when an interaction is selected. The reason that RELR is not subject to this limitation is that all independent variables, including all interaction and nonlinear effects, are standardized to become z-scores with a mean of 0 and a standard deviation of 1. Additionally each interaction is computed based upon standardized variables, so whether temperature is initially measured in Celsius or Fahrenheit units has no influence on the RELR model. See Nelder, Journal of Applied Statistics (2000): op. cit.

31. Rutishauser, Mamelak, and Schuman (2006): Neuron, op. cit.

32. Xu, F. and Tenenbaum, J.B. “Word Learning as Bayesian Inference”, Psychological Review 114 (2007): 245–272.

33. Xu, F. and Tenenbaum, J.B. “Sensitivity to Sampling in Bayesian Word Learning”, Developmental Science 10 (2007): 288–297.

34. Izhikevich, Dynamical Systems in Neuroscience, op. cit., p. 2.

35. If we assume 1000 independent linear inputs with two way interactions and no nonlinear effects, then we get (10002/2) +1000 or 500,000 two way interactions and 5,001,000 total independent variables.

36. We simply multiply our 500,000 two way interactions by four and we add 4 × 1000 effects for the linear main effects and the associated quadratic, cubic, and quartic effects to give 2,004,000 effects as candidate effects.

37. Dahlquist, G. and Björk, Å. Numerical Methods, (1974: Englewood Cliffs, Prentice Hall). See pp. 101–103 on the Runge Phenomenon. This Dahlquist and Björk rule is that the degree of polynomial n must be less than 2√m, where m equals the number of sampled points, in a polynomial interpolation system of equations. In a multivariate regression model with many linear main effects, polynomial effects, and interaction effects, then n may be considered to refer to the total number of independent variable effects and m to the number of training sample observations. This is because the number of variables in a system of polynomial interpolating equations also includes all interaction terms and nonlinear terms. Note that when there are a small number of linear variables with no interaction or nonlinear terms as traditionally occurred in standard regression methods before the era of Big Data due to processing speed constraints, this rule is basically the traditional “10:1 rule” that was used to determine how many independent variable effects could be employed in a regression model given the number the observations. For example, if we have 400 observations or m = 400 and if we assume that each independent variable is like a separate polynomial effect in an interpolating system of equations, then the Dahlquist and Björk rule tells me that I can choose 40 linear main effect variables or n = 40 in this case, which is exactly the tradition 10:1 rule that we must have 10 observations for every variable.

38. Ibid. Assuming that a neuron fires 100 times a second, we get 360,000 spike responses in an hour. Assuming that we employ Dahlquist and Björk rule and call m, the number of spike responses, and N, the number of total features including interaction effects, then m = N2/2 or m = 501,0002/2, which implies that m = 125,500,500,000 spikes. So at the rate of 360,000 spikes/h, we have 125,500,500,000 spikes/360,000 spikes/h or about 348,612 h or about 40 years for even minimally reliable learning.

39. Random Forest methods suffer from their own problems that have been highlighted in recent years such as biased solutions that are still corrupted by multicollinearity and other variable selection problems. See Strobl, C. et al. (2007) op. cit.

40. The instability of L2 norm Penalized Logistic Regression parameters with small samples is exemplified in a model reviewed in Rice, D.M. “Generalized Reduced Error Logistic Regression Machine”, Section on Statistical Computing—JSM Proceedings (2008): 3855–3862.

41. Poggio, T. and Girosi, F. “Extensions of a Theory of Networks for Approximation and Learning: Dimensionality Reduction and Clustering”, AI Memo/CBIP Paper 1167(44) (1990): 1–18.

42. Mitchell, T.M. “Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression”, Chapter 1 in Machine Learning, Online book (2010): 1–17.

43. Hebb, D.O. (1949). The organization of behavior. New York: Wiley & Sons.

44. Rosenblatt, F. “The Perceptron—A Perceiving and Recognizing Automaton”, Report 85-460-1, Cornell Aeronautical Laboratory (1957). The perceptron performed classification but not regression, so the perceptron is still quite distinct from today's standard maximum likelihood regression.

45. Minsky, M.L. and Papert, S.A. Perceptrons (1969: Cambridge MA, MIT Press).

46. Rumelhart, D.E., Hinton, G.E. and Williams, R. J. “Learning Internal Representations by Error Propagation”, In Rumelhart, D.E., McClelland, J.L. and the PDP research group. (eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations (1986: Cambridge MA, MIT Press) 318–362.

47. Clopath, C. and Gerstner, W. “Voltage and Spike-Timing Interact in STDP – A Unified Model”, Frontiers in Synaptic Neuroscience 2 (2010) 1–11.

48. Poirazi, P., Brannon, T. and Mel, B.W. “Pyramidal Neuron as a 2-Layer Neural Network”, Neuron 37 (2003): 989–999.

49. London and Hausser, Annual Review of Neuroscience, (2005): op. cit.

50. Barra, A., Bernacchia, A., Santucci, E., Contucci, P. “On the equivalence of Hopfield Networks and Boltzmann Machines”, Neural Networks 34 (2012): 1–9.

51. Hopfield, J.J. “Neural Networks and Physical Systems with Emergent Collective Computational Abilities”, Proceedings of the National Academy of Sciences of the United States of America 79 (1982): 2554–2558.

52. Hinton, G.E. “Boltzmann Machine”, Scholarpedia 2(5) (2007): 1668. Note that IBM's current neuromorphic cognitive architecture is based in part on Restricted Boltzmann Machines. See https://dl.dropboxusercontent.com/u/91714474/Papers/021.IJCNN2013.Applications.pdf.

53. Hopfield, J.J. “Hopfield Network”, Scholarpedia 2(5) (2007): 1977. Associative memory networks based upon Hopfield networks or Restricted Boltzmann Machines seem to have been applied in engineering much more than in business or medicine to date, which is similar to other artificial neural network methods. But as reviewed earlier the Netflix model that was actually used was a simple ensemble of a Restricted Boltzmann Machine and an SVM model; Netflix Tech Blog op cit.

54. Hinton, G.E., op. cit.

55. Liou, C.Y. and Yuan, S.K. “Error Tolerant Associative Memory”, Biological Cybernetics 81 (1999): 331–342. Another significant development beyond what occurred in the 1980’s research is that Hopfield associative memory models do lend themselves to real-time hardware configurations using memristors. See Farnood Merrikh-Bayat and Saeed Bagheri-Shouraki, Efficient neuro-fuzzy system and its Memristor Crossbar-based Hardware Implementation, http://arxiv.org/pdf/1103.1156v1.pdf. Other advances developed by Saffron Technologies now appear to avoid the pitfalls related to physical scaling limitations of the crossbar memristor implementations (Personal Communication, Manny Aparicio Co-founder and Paul Hofmann CTO, Saffron Technology, Cary, NC).

56. Storkey, A. “Increasing the Capacity of a Hopfield Network without Sacrificing Functionality”, Artificial Neural Networks—ICANN'97 (1997): 451–456. However, some increase in complexity with the memristors referenced in the previous note might be a major improvement with huge capacity increases. Yet, associative memory networks are still limited in that they only work with binary features, so human judgment is still involved in producing binary features from non-binary variables in real world applications, as this apparently cannot be done as an automatic associative learning process at present. So it is not yet a hands-free approach to machine learning.

57. Riesenhuber, M. and Poggio, T. “A New Framework of Object Categorization in Cortex” Biologically Motivated Computer Vision, First IEEE International Workshop Proceedings, (2000): 1–9.

58. The unsupervised K-means clustering approach used in Poggio's group responds with binary signals that prefer visual input data along one dimension, such as the orientation of lines in a specific direction in the visual field. This selective tuning to one dimension of information is similar to how simple cells in the primary visual cortex respond. In these Poggio simulations, later higher level visual processing that models inferotemporal cortex object categorization incorporates such features as inputs and classifies the objects using classic supervised neural networks.

59. Kohonen, T. “Self-Organized Formation of Topologically Correct Feature Maps”, Biological Cybernetics 43 (1982): 59–69.

60. Izhikevich, E.M. “Polychronization: Computation with Spikes”, Neural Computing 18 (2006):245–282.

61. Edelman, G. Neural Darwinism: The Theory of Neuronal Group Selection (1987: New York, Basic Books).

62. Calvin, W.H. The Cerebral Code (1996: Cambridge, MIT Press).

63. Izhikevich, E.M. and Edelman, G.M. “Large-Scale Model of Mammalian Thalamocortical Systems”, PNAS 105 (2008): 3593–3598.

64. Another approach to circumvent problems with the standard supervised/unsupervised learning paradigm is the semi-supervised learning approach which initially seeds a set of responses with human generated labels that determine supervised learning and then proceeds with unsupervised learning from this initial set. See Mitchell, “Semi-Supervised Learning over Text” (2006): op. cit. However, semi-supervised learning suffers from requiring humans to label responses, so it is obviously not an automated machine learning solution. Under specific seeding conditions, semi-supervised learning is argued to be a model for reinforcement learning. See Damljanovi, D.D. Natural Language Interfaces to Conceptual Models, Ph.D. thesis. The University of Shefeld Department of Computer Science July 2011. Also see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an Introduction (MIT Press, Cambridge, MA, 1998) for a more general discussion of machine reinforcement learning. Note that reinforcement learning also can be modeled as a time varying hazard model similar to survival analysis. See Alexander, W.H. and Brown, J.W., “Hyperbolically Discounted Temporal Difference Learning”, Neural Computation 22(6) (2010):1511–1527.

65. Van Belle, T. “Is Neural Darwinism Darwinism?” Artificial Life 3(1) (1997): 41–49.

66. Fernando, C., Karishma, K.K. and Szathmáry, E. “Copying and Evolution of Neuronal Topology”, PLoS One 3(11) (2008): e3775. http://dx.doi.org/10.1371/journal.pone.0003775.

67. Izhikevich, E.M. and Edelman, G.M. “Large-Scale Model of Mammalian Thalamocortical Systems”, PNAS 105 (2008): 3593–3598.

68. Furber, S.B., Lester, D.R., Plana, L.A., Garside, J.D., Painkras, E., Temple, S. and Brown, A.D. “Overview of the SpiNNaker System Architecture Computers”, IEEE Transactions on Computers PP(99) (2012) 1.

69. Hodgkin and Huxley, Journal of Physiology (1952): op. cit.

70. John, E.R., op. cit.

71. Nunez, P.L. and Srinivasan, R., op. cit.

72. Anastassiou, C.A., Perin, R., Markram, H., Koch, C. “Ephaptic Coupling of Cortical Neurons”, Nature Neuroscience 14(2) (2011): 217.

Chapter 6

1. Schopenhauer, A. The World as Will and Representation, translated by Payne, E.F.J. (1958: Indian Hills, Colorado, The Falcon's Wing).

2. Alivisatos, A.P., Chun, M., Church, G.M., Greenspan, R.J., Roukes, M.L. and Yuste, R. “The Brain Activity Map Project and the Challenge of Functional Connectomics”, Neuron 74(6) (2012): 970–974.

3. John, E.R. “The Neurophysics of Consciousness”, Brain Research Reviews 39 (2002): 1–28.

4. Ibid.

5. Ibid.

6. The scalp EEG is not only much noisier, but is also most sensitive to field potentials that arise from current oriented in the direction of electrodes in apical pyramidal neurons in cerebral cortex neurons. Pizzagalli, D.E. “Electroencephalography and High-Density Electrophysiological Source Localization”, Chapter prepared for Cacioppo, J.T. et al., Handbook of Psychophysiology (3rd Edition) online preprint; the MEG on the other hand is most sensitive to current that exists in a tangential direction with respect to the surface of the scalp.

7. Penfield, W. and Erickson, T.C. Epilepsy and Cerebral Localization: A Study of the Mechanism, Treatment and Prevention of Epileptic Seizures (1941: Springfield IL, Thomas, C.C.).

8. Rice, D.M. “The EEG and the Human Hypnagogic Sleep State”, Master's Empirical Research Thesis, (1983: Durham NH, University of New Hampshire).

9. Schacter, D.L. “The Hypnagogic State: A Critical Review of the Literature”, Psychological Bulletin 83 (1976): 452–481.

10. Tonini, G. “Consciousness as Integrated Information: A Provisional Manifesto”, Biological Bulletin 215 (2008): 216–242.

11. Rice, D.M. The EEG and the Human Hypnagogic Sleep State, Master's Empirical Research Thesis, (1983: Durham NH, University of New Hampshire).

12. https://commons.wikimedia.org/wiki/File:Brain_headBorder.jpg.

13. Nunez, P.L. and Srinivasan, R., op. cit.

14. Hameroff, S. “The ‘Conscious Pilot’—Dendritic Synchrony Moves through the Brain to Mediate Consciousness”, Journal of Biological Physics 36 (1) (2009): 71–93.

15. Anderson, C.H. and DeAngelis, G.C. “Redundant Populations of Simple Cells Represent Wavelet Coefficients in Monkey V1”, Journal of Vision 5(8) (2005): article 669.

16. Hubel, D.H. and Wiesel, T.N. “Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex”, Journal of Physiology 160 (1962): 106–154.

17. Berens, P., Ecker, A.S., Cotton, R.J., Ma, W.J., Bethge, M., and Tolias, A.S. “A Fast and Simple Population Code for Orientation in Primate V1”, The Journal of Neuroscience 32(31) (2012): 10618–10626.

18. Gross, C.G. “Genealogy of the Grandmother Cell”, Neuroscientist 8(5): (2002): 512–518.

19. Connor, C. “Friends and grandmothers”, Nature 435 (7045) (2005): 1036–1037.

20. Original image in Lesher, G.W. and Mingolla, E. “The Role of Edges and Line-ends in Illusory Contour Formation”, Vision Research 33(16) (1993): 2263–2270; Redrawn with permission from Elsevier.

21. Wertheimer, M., Laws of Organization in Perceptual Forms, First published as Untersuchungen zur Lehre von der Gestalt II, in Psycologische Forschung, 4 (1923): 301–350. Translation published in Ellis, W., A Source Book of Gestalt Psychology (1938: London, Routledge and Kegan Paul).

22. Singer, W. “Neuronal Synchrony: A Versatile Code for the Definition of Relations?”, Neuron 24 (1999): 49–65.

23. Singer, W. “Binding by Synchrony”, Scholarpedia 2(12) (2007): 1657.

24. Ibid.

25. Gray, C.M., Konig, P., Engel, A.K., Singer, W. “Oscillatory Responses in Cat Visual Cortex Exhibit Inter-Columnar Synchronization which Reflects Global Stimulus Properties”, Nature 338 (1989): 334–337.

26. John, E.R., op. cit.

27. Anastassiou, C.A., Perin, R., Markram, H., Koch, C., op. cit.

28. Werning, M. and Maye, A. “The Cortical Implementation of Complex Attribute and Substance Concepts Synchrony, Frames, and Hierarchical Binding”, Chaos and Complexity Letters 2(2/3) (2007): 435–452.

29. Engel, A.K., König, P., Kreiter, A.K. and Singer, W. “Interhemispheric Synchronization of Oscillatory Neuronal Responses in Cat Visual Cortex”, Science 252 (1991): 1177–1179.

30. Thiele, A. and Stoner, G. “Neuronal Synchrony Does Not Correlate with Motion Coherence in Cortical Area MT”, Nature 421 (6921) (2003): 366–370.

31. Dong, Y., Mihalas, S., Qiu, F., von der Heydt, R. and Niebur, E. “Synchrony and the Binding Problem in Macaque Visual Cortex”, Journal of Vision 8 (7) (2008): 1–16.

32. Singer, W. Binding by synchrony, op. cit.

33. Eckhorn, R., Bruns, A., Saam, M., Gail, A., Gabriel, A. and Brinksmeyer, H.J. “Flexible Cortical Gamma–Band Correlations Suggest Neural Principles of Visual Processing”, Visual Cognition 8(3–5) (2001): 519–530.

34. Izhikevich, E.M. and Hoppensteadt, F.C. “Polychronous Wavefront Computations”, International Journal of Bifurcation and Chaos 19(5) (2009): 1733–1739.

35. The Werning, M. and Maye, A., op. cit. (2007) model also allows for traveling waves.

36. Hameroff, S. “The ‘Conscious Pilot’—Dendritic Synchrony Moves through the Brain to Mediate Consciousness”, op. cit.

37. Hebb, D.O. The organization of behavior, op. cit.

38. Pallas, S.L. (ed.) Developmental Plasticity of Inhibitory Circuitry, Introductory chapter (also by same author), pp. 3–12, (2010: New York, Springer).

39. Rutishauser, U., Ross, I.B., Mamelak, A.N., Schuman, E.M. “Human Memory Strength is Predicted by Theta-Frequency Phase-Locking of Single Neurons”, Nature 464(7290) (2010): 903–907.

40. Sjöström, J. and Gerstner, W. “Spike Timing Dependent Plasticity”, Scholarpedia (2010): 1362.

41. Clopath, G. and Gerstner, W. “Voltage and Spike-Timing Interact in STDP—A Unified Model”, Frontiers in Synaptic Neuroscience 2 (2010): 1–11.

42. Ibid. The previous references are a good starting point for research on the fine details of this mechanism along with current reasonable speculations.

43. Gaba synapses may start out by being excitatory but may become inhibitory over the course of development. See Pallas, S.L. (ed.) Developmental Plasticity of Inhibitory Circuitry, op. cit.

44. Rice, D.M. and Hagstrom, E.C. “Some Evidence in Support of a Relationship between Human Auditory Signal-Detection Performance and the Phase of the Alpha Cycle”, op. cit.

45. Busch, N.A. and VanRullen, R. “Spontaneous EEG Oscillations Reveal Periodic Sampling of Visual Attention”, PNAS 107(7) (2010): 16038–16043.

46. Desimone, R. “Neural Synchrony and Selective Attention”, Lecture given at Boston University on Feb. 27, 2008. http://www.youtube.com/watch?v=GdDMzV26WSk.

47. Desimone, R., op. cit.

48. Buzsáki, G. Rhythms of the Brain, op. cit.

49. Axmacher, N., Henseler, M.M., Jensend, O., Weinreich, I., Elger, C.E. and Fell, J. “Cross-Frequency Coupling Supports Multi-Item Working Memory in the Human Hippocampus”, PNAS 107(7) (2010): 3228–3233.

50. Fell, J. and Axmacher, N. “The Role of Phase Synchronization in Memory Processes”, Nature Reviews Neuroscience 12 (2011): 105–118.

51. Kamiński, J., Brzezicka, A. and Wróbel, A. Short-Term Memory Capacity (7 ± 2) Predicted by Theta to Gamma Cycle Length Ratio”, Neurobiology of Learning and Memory 95(1) 2011: 19–23.

52. Lisman, J.E. and Idiart, M.A.P. “Storage of 7 + 2 Short-Term Memories in Oscillatory Subcycles”, Science 267 (1995): 1512–1514.

53. Fell, J. and Axmacher, N. “The Role of Phase Synchronization in Memory Processes”, op. cit.

54. Craik, F.I.M. and Tulving, E., op. cit.

55. Carlqvist, H, Nikulin, V.V., Strömberg, J.O. and Brismar, T. “Amplitude and Phase Relationship between Alpha and Beta Oscillations in the Human Electroencephalogram”, Medical and Biological Engineering and Computing 43(5) (2005): 599–607.

56. Nikulin, V.V. and Brismar, T. “Phase Synchronization between Alpha and Beta Oscillations in the Human Electroencephalogram”, Neuroscience 137(2) 2006: 647–657.

57. Palva, J.P., Palva, S. and Kaila, K. “Phase Synchrony among Neuronal Oscillations in the Human Cortex”, The Journal of Neuroscience 2005 25(15) (2005): 3962–3972.

58. Roopun, A.K., Kramer, M.A., Carracedo, L.M., Kaiser, M., Davies, C.H., Traub, R.D., Kopell, N.J. and Whittington, M. “Temporal Interactions between Cortical Rhythms”, Frontiers in Neuroscience 2(2) (2008): 145–154.

59. Sauseng, P., Klimesch, W., Gruber, W.R., Birmbaumer, N. “Cross-Frequency Phase Synchronisation: A Brain Mechanism of Memory Matching and Attention”, Neuroimage 40 (2008): 308–317.

60. Gaona, C.M., Sharma, M., Freudenburg, Z.V., Breshears, J.D., Bundy, D.T., Roland, J., Barbour,D.L., Schalk, G. and Leuthardt, E.C. “Nonuniform High-Gamma (60–500 Hz) Power Changes Dissociate Cognitive Task and Anatomy in Human Cortex”, The Journal of Neuroscience 31(6) (2011): 2091–2100.

61. Jacobs, J. and Kahana, M.J. “Neural Representations of Individual Stimuli in Humans Revealed by Gamma–Band Electrocorticographic Activity”, The Journal of Neuroscience 29(33) (2009): 10203–10214.

62. Canolty, R.T., Edwards, E., Dalal, S.S., Soltani, M., Nagarajan, S.S., Kirsch, H.E., Berger, M.S., Barbaro, N.M. and Knight, R.T. “High Gamma Power is Phase-Locked to Theta Oscillations in Human Neocortex”, Science 313(5793) (2006): 1626–1628.

63. Voytek, B., Canolty, R.T., Shestyuk, A., Crone, N.E., Parvizi, J. and Knight, R.T. “Shifts in Gamma Phase–Amplitude Coupling Frequency from Theta to Alpha over Posterior Cortex during Visual Tasks”, Frontiers in Human Neuroscience 4:191 (2010): http://dx.doi.org/10.3389/fnhum.2010.00191.

64. Kassam, K.S., Markey, A.R., Cherkassky,V.L., Loewenstein, G., Just, M.A. “Identifying Emotions on the Basis of Neural Activation”, PLoS One 8(6) (2013): e66032. http://dx.doi.org/10.1371/journal.pone.0066032.

Chapter 7

1. Published by Vigyan Prasar, New Delhi, India.

2. Abitz, M., Nielsen, R.D., Jones, E.G., Laursen, H., Graem, N., Pakkenberg, B. “Excess of Neurons in the Human Newborn Mediodorsal Thalamus Compared with that of the Adult”, Cerebral Cortex 17 (11) (2007): 2573–2578.

3. Huttenlocher, P.R. “Synaptic Plasticity and Elimination in Developing Cerebral Cortex”, Journal of Mental Deficiency 88(5) (1984): 488–496.

4. Petanjek, Z., Judaš, M., Šimić, G., Rašin, M.R., Uylings, H.B.M., Rakic, P. and Kostović, I. “Extraordinary Neoteny of Synaptic Spines in the Human Prefrontal Cortex”, PNAS 2011 108(32) (2011): 13281–13286.

5. Changeux, J.P. Neuronal Man (Laurence Garey, translator) (1985: New York: Pantheon Books).

6. Black, J.E. and Greenough, W.T. “Induction of Pattern in Neural Structure by Experience: Implications for Cognitive development”, In Lamb, M.E., Brown, A.L. & Rogott, B. (eds), Advances in Developmental Psychology Vol. 4 (pp. 1–50). (1986: Hillsdale NJ, Erlbaum).

7. Fu, M., Yu, X., Lu, J., Zuo, Y. “Repetitive Motor Learning Induces Coordinated Formation of Clustered Dendritic Spines in vivo”, Nature 483 (2012): 92–95.

8. Uhlhaas, P.J., Roux, F., Rodriguez, E., Rotarska-Jagiela, A. and Singer, W. “Neural Synchrony and the Development of Cortical Networks”, Trends in Cognitive Sciences 14(2) (2010): 72–80.

9. Shaw, P., Greenstein, D., Lerch, J., Clasen, L., Lenroot, R., Gogtay, N., Evans, A., Rapoport, J., Giedd, J. “Intellectual Ability and Cortical Development in Children and Adolescents”, Nature 440 (2006): 676–679.

10. Ibid.

11. Choi, Y.Y., Shamosh, N.A., Cho, S.H., DeYoung, C.G., Lee, M.J., Lee, J.M., Kim, S.I., Cho, Z.H., Kim, K., Gray, J.R. and Lee, K.H. “Multiple Bases of Human Intelligence Revealed by Cortical Thickness and Neural Activation”, The Journal of Neuroscience 28(41) (2008):10323–10329.

12. Sowell, E.R., Peterson, B.S., Thompson, P.M., Welcome, S.E., Henkenius, A.L., Toga, A.W. “Mapping Cortical Change across the Human Life Span”, Nature Neuroscience 6 (2003): 309–315.

13. Burke, S.N. and Barnes, C.A. “Neural Plasticity in the Ageing Brain”, Nature Reviews Neuroscience 7 (2006): 30–40. They report that one area where substantial age-related neuronal loss is observed is area 8a of dorsolateral prefrontal cortex in nonhuman primates where a 30% age-related neuronal loss is associated with working memory deficits.

14. Hof, P.R. and Morrison, J.H. “The Aging Brain: Morphomolecular Senescence of Cortical Circuits”, Trends in Neurosciences 27(10) (2004): 607–613.

15. Burke, S.N. and Barnes, C.A., op. cit.

16. Duan, H., Wearne, S.L., Rocher, A.B., Macedo, A., Morrison, J.H. and Hof, P.R. “Age-related Dendritic and Spine Changes in Corticocortically Projecting Neurons in Macaque Monkeys”, Cerebral Cortex 13 (2003): 950–961.

17. Fernández, A., Arrazola, J., Maestú, F., Amo, C., Gil-Gregorio, P., Wienbruch, C. and Ortiz, T. “Correlations of Hippocampal Atrophy and Focal Low-Frequency Magnetic Activity in Alzheimer Disease: Volumetric MR Imaging-Magnetoencephalographic Study”, American Journal of Neuroradiology 24(3) (2003): 481–487.

18. Grunwald, M., Hensel, A., Wolf, H., Weiss, T. and Gertz, H.G. “Does the Hippocampal Atrophy Correlate with the Cortical Theta Power in Elderly Subjects with a Range of Cognitive Impairment?” Journal of Clinical Neurophysiology 24(1) (2007): 22–26.

19. Breslau, J., Starr, A., Sicotte, N., Higa, J. and Buchsbaum, M.S. “Topographic EEG Changes with Normal Aging and SDAT”, Electroencephalography and Clinical Neurophysiology 72(4) (1989): 281–289.

20. Rice, D.M. et al. Journals of Gerontology: Medical Sciences (1990): op. cit.

21. Visser, S.L., Hooijer, C., Jonker, C., Van Tilburg, W. and De Rijke, W. “Anterior Temporal Focal Abnormalities in EEG in Normal Aged Subjects; Correlations with Psychopathological and CT Brain Scan Findings”, Electroencephalography and Clinical Neurophysiology 66(1) (1987): 1–7.

22. Rice, D.M. et al. Journals of Gerontology: Psychological Sciences (1991): op. cit.

23. Boon, M.E., Melis, R.J.F., Rikkert, M.G.M.O., Kessels, R.P.C. “Atrophy in the Medial Temporal Lobe is specifically Associated with Encoding and Storage of Verbal Information in MCI and Alzheimer Patients”, Journal of Neurology Research 1(1) (2011): 11–15.

24. Henneman, W.J.P., Sluimer, J.D., Barnes, J., van der Flier, W.M., Sluimer, I.C., Fox, N.C., Scheltens, P., Vrenken, H. and Barkhof, F. “Hippocampal Atrophy Rates in Alzheimer Disease: Added Value over Whole Brain Volume Measures”, Neurology 72 (2009): 999–1007.

25. Ibid.

26. Duara, R., Loewenstein, D.A., Potter, E., Appel, J., Greig, M.T., Urs, R., Shen, Q., Raj, A., Small, B., Barker, W., Schofield, E., Wu, Y. and Potter, H. “Medial Temporal Lobe Atrophy on MRI Scans and the Diagnosis of Alzheimer Disease”, Neurology 71(24) (2008): 1986–1992.

27. Mosconi, L. “Brain Glucose Metabolism in the Early and Specific Diagnosis of Alzheimer's Disease: FDG-PET studies in MCI and AD”, European Journal of Nuclear Medicine and Molecular Imaging 32(4) (2005): 486–510.

28. At present, efforts to use PET scans to image the putative causal beta amyloid protein in living patients have not yet received FDA approval.

29. Koivunen, J., Scheinin, N., Virta, J.R., Aalto, S., Vahlberg, T., Någren, K., Helin, S., Parkkola, R., Viitanen, M. and Rinne, J.O. “Amyloid PET Imaging in Patients with Mild Cognitive Impairment: A 2-year Follow-up Study”, Neurology 76(12) (2011): 1085–1090.

30. Marchione, M., op. cit.

31. Wechsler Memory Scale, Revised. Now it is Wechsler Memory Scale-IV, marketed by Pearson, San Antonio, TX.

32. Stories like this one have been part of the Wechsler Memory Scale for many years. This particular story comes from a talk given at the Washington University Alzheimer's Center in St Louis by Albert, M.S., http://www.alz.washington.edu/NONMEMBER/SPR10/Albert.pdf.

33. This simulation gives almost complete separation—as there is no almost no error other than rounding error. In a case with substantial noise, the initial episodic learning often may show greater magnitudes of weights with smaller episodic samples—and those weights may decrease somewhat with larger episodic samples as demonstrated in Fig. 3.1. Under the RELR model, this leads to an interpretation that a larger magnitude of positive or negative weights in novel learning episodes will reflect overfitting if that novel learning episode is not representative of future data. These learning-related changes in RELR regression coefficient weight magnitudes can be interpreted as a model of long term potentiation (LTP) and long term depression (LDP). Rice, D.M. “Simulated Learning and Memory Effects in RELR”, Society for Neuroscience 2013 Conference Abstracts, in press.

34. Howieson, D.B., Mattek, N., Seeyle, A.M., Dodge, H.H., Wasserman, D., Zitzelberger, T. and Kaye, J.A. “Serial Position Effects in Mild Cognitive Impairment”, Journal of Clinical and Experimental Neuropsychology 33(3) (2011): 292–299.

35. Ibid. Note that the point that is made in the next sentence in Chapter 7 about Explicit RELR and missing values in Alzheimer’s disease is corrected as best as possible in the actual real world implementations of RELR with the usage of Welch’s t values to handle missing data in the RELR error model and feature reduction as explained in the appendix. But obviously, the brain does not have good mechanisms to correct the susceptibility its explicit memory to missing information due to neural loss in Alzheimer’s disease.

36. Lehn, H., Steffenach, H.A., van Strien, N.M., Veltman, D.J., Witter, M.P. and Håberg, A.K. “A Specific Role of the Human Hippocampus in Recall of Temporal Sequences”, The Journal of Neuroscience 29(11) (2009): 3475–3484.

37. Kesner, R.P., Gilbert, P.E., Barua, L.A. “The Role of the Hippocampus in Memory for the Temporal Order of a Sequence of Odors”, Behavioral Neuroscience 116(2) (2002): 286–290.

38. Farovik, A., Dupont, L.M., Eichenbaum, H. “Distinct Roles for Dorsal CA3 and CA1 in Memory for Sequential Non-spatial Events”, Learning and Memory 17(1) (2009): 12–17.

39. Thorvaldsson, V., MacDonald, S.W.S., Fratiglioni, L., Winblad, B., Kivipelto, M., Laukka, E.J., Skoog, I., Sacuiu, S., Guo, X., Östling, S., Börjesson-Hanson, A., Gustafson, D., Johansson, B. and Bäckman, L. “Onset and Rate of Cognitive Change before Dementia Diagnosis: Findings from Two Swedish Population-Based Longitudinal Studies”, Journal of the International Neuropsychological Society 17(1) (2011): 154–162.

40. Rice et al. Journals of Gerontology (1991), op. cit.

41. Anderson, K.L., Rajagovindan, R., Ghacibeh, G.A., Meador, K.J. and Ding, M. “Theta Oscillations Mediate Interaction between Prefrontal Cortex and Medial Temporal Lobe in Human Memory”, Cerebral Cortex 20 (7) (2010): 1604–1612.

42. Rutishauser, U., Ross, I.B., Mamelak, A.N., Schuman, E.M. “Human Memory Strength is Predicted by Theta-Frequency Phase-Locking of Single Neurons”, op. cit.

43. Greenwood, P.M. and Parasuraman, R. “Neuronal and Cognitive Plasticity: A Neurocognitive Framework for Ameliorating Cognitive Aging”, Frontiers in Aging Neuroscience 2:150 (2010).

44. Ibid.

45. Milgram, N.W., Head, E., Muggenburg, B., Holowachuk, D., Murphey, H., Estrada, J., Ikeda-Douglas, C.J., Zicker, S.C. and Cotman, C.W. (2002). “Landmark Discrimination Learning in the Dog: Effects of Age, an Antioxidant Fortified Food, and Cognitive Strategy”, Neuroscience Biobehavioral Reviews 26 (2002): 679–695. While these are very interesting results and an excellent study, this study might not have been able to control exercise independently from the cognitive stimulation treatment through the inclusion of walking.

46. Fabre, C., Chamari, K., Mucci, P., Massé-Biron, J., Préfaut, C. “Improvement of Cognitive Function by Mental and/or Individualized Aerobic Training in Healthy Elderly Subjects”, International Journal of Sports Medicine 23(6) (2002): 415–421.

47. Owen, A.M., Hampshire, A., Grahn, J.A., Stenton, R., Dajan, S., Burns, A.S., Howard, R.J. and Ballard, C.G. “Putting Brain Training to the Test”, Nature 465 (2010): 775–779.

48. Borness, C., Proudfoot, J., Crawford, J., Valenzuela, M. “Putting Brain Training to the Test in the Workplace: A Randomized, Blinded, Multisite, Active-Controlled Trial”, PLoS One 8(3) (2013): e59982. http://dx.doi.org/10.1371/journal.pone.0059982.

49. Descartes and the Pineal Gland. (Stanford Encyclopedia of Philosophy).

50. Anastassiou, C.A., Perin, R., Markram, H., Koch, C., op. cit.

51. Edelman, G.M. “Naturalizing Consciousness: A Theoretical Framework”, PNAS 100(9) (2003): 5520–5524.

52. Edelman, G.M., Gally, J.A. and Baars, B.J. “Biology of Consciousness”, Frontiers in Psychology 25 January (2011): http://dx.doi.org/10.3389/fpsyg.2011.00004.

53. McFadden, J. “The CEMI Field Theory: Seven Clues to the Nature of Consciousness”. In Tuszynski, J.A. The Emerging Physics of Consciousness pp. 385–404 (2006: Berlin: Springer).

54. Tse, P.U. The Neural Basis of Free Will, (Cambridge, MIT Press, 2013).

55. Hameroff, S. (October 22, 2010). “Clarifying the tubulin bit/qubit—Defending the Penrose-Hameroff Orch OR Model of Quantum Computation in Microtubules”. Google Workshop on Quantum Biology. http://www.youtube.com/watch?v=LXFFbxoHp3s.

56. Hameroff, S. “The ‘Conscious Pilot’—Dendritic Synchrony Moves through the Brain to Mediate Consciousness”, op. cit.

57. Kurzweil, R. Your Brain in the Cloud, http://www.youtube.com/watch?v=0iTq0FLDII4, published on Feb. 19, 2013.

Chapter 8

1. Leibniz, G. 1685, The Art of Discovery, op. cit.

2. Mutual information is sensitive to both linear and nonlinear effects, whereas Pearson correlation is only sensitive to linear effects unless nonlinear effects are defined through polynomial terms as is done in RELR. Yet, all important models of neural dynamics reviewed in Chapter 5 are also based upon similar polynomial expressions as RELR, and it is unclear how to formulate models of neural dynamics that are capable of separable linear and nonlinear effect learning with mutual information, whereas all linear and nonlinear effects can be separated in RELR’s correlation-based learning and are thus capable of separable spike-timing dependent learning as would be required if such effects code learned features. The larger issue is the lack of agreement in how to define these information theory concepts not used in the present book. To get a sense of different operational definitions of minimum description length computation and Kolmogorov Complexity, see Ming Li and Paul Vitányi, An Introduction to Kolmogorov Complexity and its Applications, (2008, New York, Springer-Verlag) and Volker Nannen. “A short introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length”, online publication. There is similar diversity in how to use or define multivariate mutual information to measure interactions. See for example reviews and proposals by A. Margolin, K. Wang, A. Califano, and I. Nemenman, Multivariate Dependence and Genetic Networks Inference. IET Syst Biol 4, 428 (2010). A. Jakulin and I. Bratko, Quantifying and Visualizing attribute interactions, (2003b) arxiv.org/abs/cs/0308002v1 online publication. A basic problem is that feature reduction based upon mutual information will not discriminate linear and nonlinear effects for a given variable, but this confusion is compounded with interactions. For example, some multivariate mutual information definitions of interaction like interaction information have surprising negative sign relationships that can defy simple interpretation. In addition, these methods might be interpreted to require main effects be included prior to selecting interactions just like the generalized linear model avoids the marginality scaling problem (J.A. Nelder op cit.), but this can exclude important interaction effects or include superfluous main effects.

3. Rice, D.M. Society for Neuroscience Abstracts, 2013, op. cit.

4. Davies, P., Keynote address, American Geriatric Society, 1990.

5. http://archive.org/details/KQED_20120206_200000_Charlie_Rose. The idea that Alzheimer’s could start in the medial temporal lobe was also proposed on the basis of evidence that damage at the end was substantially heavier in entorhinal cortex, but there was still not understanding from such neuropathology data of how the onset might occur at least 10 years earlier than the clinical diagnosis. See G.W. Van Hoesen, B.T. Hyman, and A.R. Damasio, Entrorhinal Cortex Pathology in Alzheimer’s Disease, Hippocampus 1(1) (1991):1–8. Our work did provide a direct linkage between recent memory and temporal lobe abnormalities in living older people who were not demented and a very precise estimate of the 10 year average time delay after onset of these abnormalities until clinical diagnosis of dementia.

6. Kahneman, D. Thinking, Fast and Slow, op. cit.

7. Ibid.

8. Ibid.

9. Ibid.

10. Ibid.

11. Ibid.

12. Ibid.

13. Shi, L. et al., “The MicroArray Quality Control (MAQC)-II Study of Common Practices for the Development and Validation of Microarray-Based Predictive Models”, op. cit.

14. http://www.pbs.org/wgbh/nova/body/cracking-your-genetic-code.html, March 29, 2012.

15. This website link, http://message.snopes.com/showthread.php?t=30482, has a discussion about this ad back in 2008 amongst women car buyers with a link to the you tube video.

16. Lowenstein, R. When Genius Failed: the Rise and Fall of Long-Term Capital Management, (2000, Random House Trade Paperbacks, New York).

17. Taleb, N.N. The Black Swan: The Impact of the Highly Improbable, (2007, Random House, New York).

18. Salmon, F. “The Formula that Killed Wall Street”, Wired Magazine February 23, 2009.

19. Abrahams, C.R. and Zhang, M. Credit Risk Assessment: The New Lending System for Borrowers, Lenders, and Investors (2009, Wiley and SAS Business Series, Hoboken, N.J. and Cary, N.C.).

20. Morgenson, G. and Story, L. “Banks Collapse in Europe Points to Global Risks”, New York Times October 22, 2011.

21. Derman, E. Models Behaving Badly, op. cit.

22. Tierney, J. “Findings; Social Scientist Sees Bias Within”, New York Times February 8, 2011.

23. Krugman, P. “The Excel Depression”, Opinion Editorial New York Times April 19, 2013.

Appendix

1. Golan, A, Judge, G. and Perloff, J.M., op. cit.

2. Mount, J., op. cit.

3. Luce, R.D. and Suppes, P., op. cit.; McFadden, D., op. cit.

4. Hosmer, D.W. and Lemeshow, S., op. cit.

5. Mount, J., op. cit.

6. Mitchell, T., op. cit.

7. Hopefully, there is no confusion here in using t to reflect time with the Student's t or other t distribution values used to compute the error model.

8. Allison, P.D. Sociological Methodology, op. cit.

9. Hedeker, D., op. cit. He reviews how other parameters can be temporal dependent. Although as stated in Chapter 4, RELR does not require multilevel parameters so his comments about multilevel parameters do not apply.

10. King, G. and Langche, Z., op. cit.

11. Topsøe, F., op. cit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset