Glossary of Algorithms and Methods in Data Science

  • k-Nearest Neighbors algorithm: An algorithm that estimates an unknown data item to be like the majority of the k-closest neighbors to that item.
  • Naive Bayes classifier: A way to classify a data item using Bayes' theorem about the conditional probabilities, P(A|B)=(P(B|A) * P(A))/P(B), and in addition, assuming the independence between the given variables in the data.
  • Decision Tree: A model classifying a data item into one of the classes at the leaf node, based on the matching properties between the branches on the tree and the actual data item.
  • Random Decision Tree: A decision tree in which every branch is formed using only a random subset of the available variables during its construction.
  • Random Forest: An ensemble of random decision trees constructed on the random subset of the data with the replacement, where a data item is classified to the class with the majority vote from its trees.
  • K-means algorithm: The clustering algorithm that divides the dataset into the k groups such that the members in the group are as similar possible, that is, closest to each other.
  • Regression analysis: A method of the estimation of the unknown parameters in a functional model predicting the output variable from the input variables, for example, to estimate a and b in the linear model y=a*x+b.
  • Time series analysis: The analysis of data dependent on time; it mainly includes the analysis of trend and seasonality.
  • Support vector machines: A classification algorithm that finds the hyperplane that divides the training data into the given classes. This division by the hyperplane is then used to classify the data further.
  • Principal component analysis: The preprocessing of the individual components of the given data in order to achieve better accuracy, for example, rescaling of the variables in the input vector depending on how much impact they have on the end result.
  • Text mining: The search and extraction of text and its possible conversion to numerical data used for data analysis.
  • Neural networks: A machine learning algorithm consisting of a network of simple classifiers making decisions based on the input or the results of the other classifiers in the network.
  • Deep learning: The ability of a neural network to improve its learning process.
  • A priori association rules: The rules that can be observed in the training data and, based on which, a classification of the future data can be made.
  • PageRank: A search algorithm that assigns the greatest relevance to the search result that has the greatest number of incoming web links from the most relevant search results on a given search term. In mathematical terms, PageRank calculates a certain eigenvector representing these measures of relevance.
  • Ensemble learning: A method of learning where different learning algorithms are used to make a final conclusion.
  • Bagging: A method of classifying a data item by the majority vote of the classifiers trained on the random subsets of the training data.
  • Genetic algorithms: Machine learning algorithms inspired by the genetic processes, for example, an evolution where classifiers with the best accuracy are trained further.
  • Inductive inference: A machine learning method learning the rules that produced the actual data.
  • Bayesian networks: A graph model representing random variables with their conditional dependencies.
  • Singular value decomposition: A factorization of a matrix, a generalization of eigen decomposition, used in least squares methods.
  • Boosting: A machine learning meta algorithm decreasing the variance in the estimation by making a prediction based on the ensembles of the classifiers.
  • Expectation maximization: An iterative method to search the parameters in the model that maximize the accuracy of the prediction of the model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset