Time series and social network analysis

In this section, we will try to provide some insights and challenges of dealing and developing a large-scale ML pipeline from the time series and social network data.

Time series analysis

Time series data often arises, when, for example, monitoring industrial processes or tracking corporate business metrics. One of the fundamental differences between modeling data via time series methods is that time series analysis accounts for the fact that data points taken over time may have an internal structure.

This might include autocorrelation, trends, or seasonal variation that should be taken into account. Regression analysis in this regard is mostly used to test the theories. The target is to test to make sure that the current values of one or more independent times series parameters are correlated to the current properties of other time series data.

To develop large scale predictive analytics applications, time series analysis techniques can be applied to real-valued, categorical variables, continuous data, discrete numeric data, or even discrete symbolic data. A time series is a sequence of floating-point values, each linked to a timestamp. In particular, we try as hard as possible to stick with time series as meaning a univariate time series, although in other contexts, it sometimes refers to a series of multiple values at the same timestamp.

An instant in the time series data is the vector of values in a collection of time series corresponding to a single point in time. An observation is a tuple (timestamp, key, value), that is, a single value in a time series or instant. In a nutshell, a time series has mainly four characteristics:

  • Series with trends: Since observations increase or decrease over time, although the trends are persistent and have long term movement.
  • Series data with seasonality: Since observations stay high then drop off and some patterns repeat from one period to the next, and contain regular periodic fluctuations, say within a 12 month period.
  • Series data with cyclic component: Since the business model changes periodically, that is, recessions in the business occur in a cyclic order sometimes. It also might contain the repeating swings or movements over more than one year.
  • Random variation: An unpredictable component that gives time series graphs an irregular or zigzag appearance. It also contains erratic or residual fluctuations.

Because of these kinds of challenging characteristics, it really becomes difficult to develop practical machine learning applications for practical purposes. Hence, until now, there is only one package available for time series data analysis, developed by Cloudera; it is called the Spark-TS library. Here each time series is typically labeled with a key that enables identifying it among a collection of time series.

However, the current implementation of Spark does not provide any implemented algorithm for the time series data analysis. However, since it is an emerging and trending topic of interest, hopefully, we will have at least some algorithms implemented in Spark in coming releases. In Chapter 10, Configuring and Working with External Libraries, we will provide more insight into how to use these kinds of third-party packages with Spark.

Social network analysis

A social network is made up of nodes (points) and associated links,where nodes, links, or edges are then identifiable categories of analysis. These nodes might include the information about the people, groups, and organizations. Typically, this information is usually the main priority and concern for any type of social experimentation and analysis. The links in this kind of analysis focus on the collective way to include social contacts and exchangeable information to expand social interaction, such as Facebook, LinkedIn, Twitter, and so on. Therefore, it is obvious that organizations that are embedded in networks of larger social processes, links, and nodes, influence the others.

On the other hand, according to Otte E.et al. (Social network analysis: A powerful strategy, also for the information sciences, Journal of Information Science, 28: 441-453), Social Network Analysis (SNA) is the study of finding the mapping and measuring the relationships between the connected people or groups. It is also used to find the flows between people, groups, organizations, and information processing entities.

A proper SNA analysis could be used to show the distinction between the three most popular individual centrality measures: degree centrality, betweenness centrality, and closeness centrality. Degree centrality signifies how many links or incidents a node has or how many ties a node has.

The betweenness centrality is a centrality measure of a vertex within a graph. This also considers the edge of betweenness. Moreover, the betweenness centrality signifies the number of times a node acts as a bridge by considering the shortest paths between other nodes.

On the other hand, the closeness centrality of a node is the average length of the shortest path between a particular node and all other nodes in a connected graph, such as a social network.

Tip

Interested readers are recommended to read more about the eigenvector centrality, Katz centrality, PageRank centrality, Percolation centrality, Cross-clique centrality, and alpha centrality for the proper understanding of statistical as well as social network centrality.

The social network is often represented as connected graphs (directed or undirected). As a result, it also involves graph data analysis, where people act as nodes and the connections or links act as the edges. Moreover, collecting and analyzing large-scale data and later on developing predictive and descriptive analytics applications from the social network, such as Facebook, Twitter, and LinkedIn also involve social network data analysis, including: link prediction such as predicting relations or friendship, determining communities in social networks such as clustering on graphs, and determining opinion leaders in networks, which is essentially a PageRank problem if the proper structure is done on a graph data.

Spark has its dedicated API for the analysis of graphs, which is called GraphX. This API can be used, for instance, to search for spam, rank search results, determine communities in social networks, or search for opinion leaders, and it's not a complete list of applying methods for analyzing graphs. We will discuss using the GraphX later in this chapter in more detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset