3
Data Development Mechanisms

Since the advent of IT and the internet, the amount of data stored in digital form has been growing rapidly. An increasing number of data silos are created across the world. Individuals are putting more and more publicly available data on the web. Many companies collect information on their clients and their respective behaviour. As such, many industrial and commercial processes are being controlled by computers. The results of medical tests are also being retained for analysis.

The increase in data produced by companies, individuals, scientists and public officials, coupled with the development of IT tools, offers new analytical perspectives. Thus, currently, not only is the quantity of digitally stored data much larger, but the type of data is also very varied. The data analyzed is no longer necessarily structured in the same way as in previous analysis, but can now be in the form of text, images, multimedia content, digital traces, connected objects, etc.

Faced with this volume and diversification, it is essential to develop techniques to make the best use of all of these stocks in order to extract the maximum amount of information. There are several techniques, such as “data mining”, which are not new but respond to the principles of descriptive and predictive methods. This is an integral part of data analysis, especially when it is substantial. Companies are increasingly aware of the potential benefits of Big Data in terms of developing new products and services, transforming their business model, improving customer knowledge or exploring new areas of development.

To make the most out of Big Data, the issue is not limited to the “simple” technical issues of collection, storage and processing speed. The use of Big Data requires rethinking the process of collecting, processing and the management of data. It is necessary to put in place governance that enables the classification of data and the sorting of analytical priorities. Every company has a deposit value through their data; Big Data is a generator of value as an analysis of these data.

It is the “analysis” that will be applied to data which will justify Big Data, not the collection of data itself. We must therefore allow companies access to the “information” which will be likely to generate the most value. But the process whereby information acquires value is that of “competitive intelligence”. It becomes increasingly useful once the company has a large amount of information stored on a database. This is the best way for an entire company to get hold of Big Data in order to create value. If data development is itself a new strategic challenge, then governance is a required tool for success.

3.1. How do we develop data?

The amount of data (structured and unstructured) which originates from diverse sources and is produced in real-time leads us to suggesting that there is a “Malthusian law of data”. Malthus noted that the quantity of food only increased arithmetically, whilst the number of humans increased in a geometric progression [MAL 08]. Similarly, Pool [POO 84] notes that the supply of data increases exponentially whilst the amount of data consumed increases, at best, in a linear manner.

This explosion of data volumes will gradually increase as the Internet of Things develops. Human beings are relayed in their “data production” by a multitude of sensors, whether at home or in their means of transport. In parallel, the production of data outside of the traditional boundaries of a company (and their information systems) is growing exponentially, particularly via social media. Data produced are relatively unstructured (photos, videos, etc.).

The data market is, therefore, in a situation where the quantity offered is much higher than the quantity demanded. This phenomenon is essentially due to the fact that our mental capacities and our available time to deal with the information are limited. The increasing digitization of our activities, the ever-increasing ability to store digital data and the subsequent accumulation of all kinds of information generates a new industry that focuses on the analysis of large volumes of data.

Big data has been qualified as “the next frontier for innovation, competition and productivity” [MAN 11]. It offers businesses unprecedented opportunities for increased revenues, performance and knowledge related to their business, to the market and clients. Data from social media is a perfect example.

Using technology to capture data from Facebook, Linkedin, Twitter, etc., the company can refine brand management, find new distribution channels and strengthen the customer relationship. The accumulation of data through diverse information systems has a potential value that companies are not always aware of. Even if they do not necessarily understand how to use them themselves, they have resources that they do not yet value and this data and its usage is a source of capital for these businesses.

The challenge of valuing data is working out, in an intelligent manner, which of them are usable. Solutions and methods have been put in place by companies such as Google, Amazon and others for whom “Big Data” really makes sense. These organizations are characterized by their strong capacity of innovation and particularly their “data science”. It is crucial to highlight the mapping of data by companies and their subsequent capacity to extract value from them. We need to define a “trade priority”, put into place solutions to respond and see if this system is labelled Big Data or not.

Using these resources is therefore the biggest challenge, or at least in the context of the phenomenon of Big Data, the challenge is the awareness and prevention of intersections of numerous data, and not an eventual generation of value. Companies must be aware of data wealth that they have internally available, and how to exploit them to extract value before wanting to expand them through external data. The question is no longer about identifying data that needs to be stored, but what can we do with this data? The capturing and mixing of this data becomes highly strategic in order to reveal new knowledge.

The analysis of large volumes of data can bring about some clear insights that would allow for a competitive advantage. But this requires tools to process data in large volumes, at high speeds and in various formats. This calls for an efficient and profitable infrastructure: “cloud computing”. The cloud becomes a reality in the lives of companies. It provides a method that supports the volume of data and advanced analytical applications that are capable of bringing added value to the business. The development of cloud computing is linked closely to the phenomenon of Big Data.

Hence, we must know how to elevate these technologies to the scale of billions of information units. It is also about developing a culture of data, its operational logic and its visual interpretation. The challenge today is to develop technologies to visualize massive data flows: “data visualisation”. Visualisation is a way for operational decision-makers to best understand and use the results of the analysis. The visual restitution achieves the necessary level of abstraction needed to achieve an understanding of the “bulk data” and to give it meaning. Correlations between data allow for the extraction of new knowledge.

image

Example 3.1. Data centres or digital factories

The development of increasingly quick and powerful tools is pushing decision-makers to seek “decision support” that is rapid and efficient, and visual advice in any geographical area, and hence on the web. In our society of information and communication, statistics are playing an increasingly important role. They are omnipresent. Every day, particularly through the media, we are all exposed to a wealth of information from quantitative studies: opinion polls, “barometers” of the popularity of politicians, satisfaction surveys and indicators of national accounts.

In addition to many functions, modern life requires the reading and understanding of the most diverse statistical data: market research, dashboards, financial statistics, forecasts, statistical tests, indicators, etc. Statistics are the base of the social and economic information system. They are, nowadays, extremely affordable thanks to the tremendous developments of databases, notably via the Internet, or easily buildable through spreadsheets integrated with the most current software. Amongst the available techniques are:

  1. – descriptive techniques: they seek to highlight existing information that is hidden by the volume of data (customer fragmentation and research of product association on receipts). Thus, these techniques help reduce, summarize and synthesize data such as:
    1. - factor analysis: projection of data in graphical form to obtain a visualization of the overall connections (reconciliations and oppositions) between the various data;
    2. - automatic classification (clustering–segmentation): Association of homogeneous data groups to highlight a segmentation of certain individuals in some classes;
    3. - research of association (analysis of receipts): this involves spotting dependencies between the objects or individuals observed.
  2. – predictive techniques: predictive analysis allows for very accurate projections to identify new opportunities (or threats) and thus anticipate appropriate responses to the situation. They aim to extrapolate new information from existing sources (this is the case of scoring). These are statistical methods which analyse the relation between one or several variables depending on independent variables, such as:
    1. - classification/discrimination: the dependent variable is qualitative;
    2. - discriminate analysis/logistic regression: find allocation rules from individuals to their groups;
    3. - decision trees: they allow dividing the individuals in a population into classes; we begin by selecting the best variable to separate individuals in each class into sub-populations called nodes, depending on the targeted variable;
    4. - neural networks: from the field of artificial intelligence, this is a set of interconnected nodes that have values. They can be understood by adjusting the weight of each node until you find an optimal solution to achieve the fixed number of iterations;
    5. - forecast: the variable is continuous or quantitative;
    6. - linear regression (simple and multiple): it allows for modeling the variation of a dependent variable with regards to one or more independent variables, and thus, to predict the evolution of the first in relation to the last;
    7. - general linear model, it generalizes the linear regression with continuous explanatory variables;
image

Example 3.2. An example of data processing for the startup “123PRESTA” in 2010. For a color version of the figure, see www.iste.co.uk/monino/data.zip

image

Example 3.3. Forecasting of time series using neural networks, the turnover of large retail stores. For a color version of the figure, see www.iste.co.uk/monino/data.zip

These techniques are called upon to develop themselves in order to improve data processing. Thus the software necessary for this processing must be able to detect interesting information: the capacity to combine more data will reinforce the interconnectivity in the company, which increases responsiveness. Another aspect of the Big Data revolution, which reinforces the power of mathematical formulas to help explain data, is “algorithms”. Beyond data collection and storage, algorithmic intelligence is indispensable to make sense of data. Algorithms are used to establish correlative models, ordering, sorting, prioritizing data and making them intelligible through correlation or prediction models.

Companies can now understand complex phenomena and share these analyses to increase their collective intelligence. But it should be noted that every company must define its own way of extrapolating data. To do this they need the help of someone often called a “data scientist”. They must be capable of analyzing enormous quantities of data to find correlations. However, these correlations must be applicable, cost effective and achievable. The necessary qualities are therefore: understanding the use of data, creativity in interpreting the data and confirming correlations and an understanding of the company in order to master the use of data or find models which improve its profitability1.

The “Big Data” phenomenon, therefore, raises several questions relating to what technological developments it may create and what value these may have. It requires the development of new interdisciplinary training programs, resulting in an operational expertise particularly suited to the field of “Big Data”, in IT interface, statistics and mathematics as well as a strategic vision to design new services and products and the deployment of advanced decision-making systems.

Another important factor in the process of using data is linked to the “quality” of data. Data, the basic elements of the process, constitute a vital and historical asset for the company. The quality of the information processed by Big Data is directly linked to the quality of the datasets entered, whether they originate from within or outside the company. The quality of data is an important factor in the phenomenon of Big Data because dependency on unreliable data can lead to bad decisions, and therefore, bad choices.

image

Example 3.4. Chaos, exponents of Hurst and Bootstrap. An example applied to the Paris Stock Exchange [MAT 05]

“Data” is no longer structured and relational; it is unstructured and heterogeneous content (reviews, videos, images, sounds, sensor data, etc.). On the other hand, recognizing images or objects, defining variables, solving semantic subtleties and crossing different data sources all comes from an unstructured analysis.

Thus “open data” is another source of data; it is not another amount of data that is added to Big Data but rather reliable data than can be referenced. The volume is not a real problem: the challenge of Big Data is identifying intelligent data and being able to interpret it to improve the competitiveness and performance of the company. The goal of interoperability is to mix a great amount of data from diverse sources. As such, an “Open Data” approach represents a new business model in which the company is producing data.

In a context of crisis and accelerating cycles, companies must continuously optimize their productivity and their operational efficiency. Poor quality data (customer, product, supplier, structure, etc.) will have a direct impact on the competitiveness, efficiency and responsiveness of the company. To transform data into value it is necessary to invest in technologies: mastering the techniques, the methods and tools will fill this function. But technology alone will not settle the issue.

Collecting data without prior justification, without defining a specific strategy, may be much less profitable than expected. The absence of a unified and controlled management can also have serious consequences on losing control of operational risks. This therefore requires genuine political governance beyond simply the collection and processing operations. The aim is to help decision-making in a context where “information”, which is at the heart of “competitive intelligence”, has become a major strategic asset for companies.

The use of masses of collected data requires sorting, testing, processing, analysis and synthesis. It is through this process that raw data collected during the search will be transformed into “strategic information”. The interpretation of the collected data through analysis and synthesis will represent the level of success of the “competitive intelligence” process.

Data development process, which is used to generate highly valuable information for companies and aims to help them take advantage of “Big Data”, consists of collecting, processing and managing large volumes of data to identify what they can bring to the company as well as the measurement and monitoring of their use, consumer consumption patterns, their storage and their governance. The latter includes the preceding steps and establishes processes and good operational practices as and when required.

3.2. Data governance: a key factor for data valorization

It is difficult for companies to best utilize the ever-increasing volume of various data. Taking advantage of Big Data, which is often summarized in four, or sometimes seven, words (4V and 7V respectively), appears to be a major challenge. Moreover, automatically generated data often requires new analytical techniques, representing an additional challenge. According to the available statistics, 80% of data in companies is unstructured. In addition, documents in plain text, videos, images, etc. add to the types of data.

Textual data, log data, mobile data, videos, etc., have disrupted traditional processing technologies because they are not structured data. The new challenge is the processing of these new data formats. A portion of these resources are stocked in databases, developed in order to be managed and available for use. Once again we are faced with the problem of the quality of data.

In this regard, how can companies know that the available data is reliable and suitable for use in order to extract information (see the pyramid of knowledge)? Treating erroneous and unreliable data in the same way as other data will likely distort the analysis, and therefore, the subsequent decisions. This risk related to the quality of data is at the origin of a new concept, “Smart Data”. We must, therefore, recognize the importance of searching for the right data (from the warehouse), that is reliable and useful, which will allow for obtaining information and creating value, rather than only looking at the processing.

In these circumstances the company has to find a solution that facilitates the processing of this volume of data and its availability for very different analytical purposes. Moreover, the substantive discussions on the control of a company’s information holdings leads them to adopt data governance principles. For a company to be able to value its data as a corporate asset it needs to change its culture and the way data is managed.

This means mixing the business’s data, which is already available and structured, with those coming from Big Data (meaning non-structured). Data governance is a set of functional domains that allow for managing, integrating, analysing and governing data.

Data governance is defined as “the capacity to mix reliable and timely data to increase analytics on a large scale, regardless of data source, the environment or the role of the users”2. This is a quality control framework aiming to evaluate, manage, operate, optimize, control, monitor, maintain and protect corporate data. Based on this foundational concept, users can easily explore and analyze mixed and validated data from an entire company.

Once the relevant data has been obtained, it must be formatted properly to be analyzed. Data ready for analysis must comply with all the relevant rules of governance, notably in being reliable, consistent, timely and properly mixed. The introduction of analysis tools based on massive volumes entails:

  1. – access to existing data, but for uses that are not necessarily listed. These uses may lead to violations of security policies, especially in regards to regulatory constraints (protection of privacy, health data, banking data etc);
  2. – the imports from social networks, the Web or new data that has not yet been listed in the company. These new data need a policy review, taking into account their particular sensitivity;
  3. – the results of the analysis are in themselves new data to protect.

Security and governance become critically important. In March 2014, Gartnet published a report [GAR 14] which highlighted that: “the information security managers should not manage the security of Big Data in isolation but need rules that cover all data silos to prevent this management turning into chaos”. It recommends evaluating the implementations of data protection with respect to the safety rules around databases, unstructured data, cloud storage and Big Data.

The introduction of data analysis solutions requires both a review of existing policies to integrate new uses of data and an extension of policies to incorporate issues specific to new data. This requires that the lifecycles of the data collected are optimized so that the needs are met instantly and treated properly. This highlights the interaction between multiple actors (policy makers, analysts, consultants, technicians, etc.) in a group dynamic that allows for a combination of knowledge: better understanding, better analysis of the situation and production of information that is necessary for taking action.

To ensure relevant collection and before moving on to analyzing data, it is essential to define for what analytical needs this data will be researched and collected, and with what techniques and tools. This requires defining an interdepartmental responsibility across the company. The transformation of data information into knowledge is the result of a collective process centered on the shared success of problem-solving. It should be noted that this principle refers to the notion of collective intelligence [DUP 04]. The concept of knowledge makes use of collective processes, community, apprehended at the organisational and managerial level and in line with the strategy of the company.

With the sheer volume of data that is accumulating in most organizations, it is not always easy to find the right data or even convert them in order to extract a useful analytical view. Efficient data governance is essential to preserve the quality of data, because its volume and its complexity are growing as much as there are interactions within a company (internal environment) and between the company and its external environment (customers, suppliers, competitors, etc.).

The first step in building a governance framework is identifying the extent of the requirement by comparing the existing practices and the target set for good governance:

  1. – manage all data and associated data;
  2. – manage data consumption process;
  3. – manage data lifecycle in a cross-domain view of the company;
  4. – improve the quality of data.

Data governance involves a set of people, processes and technology to ensure the quality and value of a company’s data. The technological part of governance combines data quality, integration and data management. The success of data governance strategy represents various technical challenges. Firstly, companies need to integrate different data sources and implement measures of quality to improve their functionalities. They should then establish and improve the quality of their data. Collaborative work is finally required for the different teams to work together on data quality issues.

In other words, it seems necessary to steer the managerial and organizational practices to develop a growing organization based in creativity and the use of knowledge. This brings us to a very important concept in the business world, which is “competitive intelligence” (CI) as a method of data governance that goes beyond a mere mastery of strategic information. CI helps change the representations that company shareholders have of their environment. In this sense, the strategic and collective dimension of CI lies in the triptych: ownership, interpretation and action [SAI 04, MOI 09].

CI will indeed help anticipate new opportunities for growth that the company can seize on by modifying its activities. To make the company more innovative, we must adapt the strategy by controlling the most decisive information. CI analyzes information, anticipates developments and makes decisions accordingly, thus developing information security, data protection, the anticipation of risks and the control of its information system in and beyond its networks. The set of fields that constitute competitive intelligence, such as knowledge management, information protection and lobbying, can be grouped into the overall concept of strategic intelligence.

CI is a mode of governance whose purpose is the control and protection of strategic and relevant information for any economic actor. It can be:

  1. – offensive when it collects, analyzes and disseminates useful information to economic actors;
  2. – defensive when protecting strategic information for the company from malicious acts of negligence, both internal and external.

With the advent of the new knowledge-based economy, industrial issues in companies have become more complex. In this light, CI is not confined to the management of data flows, but is fundamentally interested in their interpretation and use in creating knowledge. The pervasiveness of the Big Data phenomenon complicates the control of information that may lead to the extraction of knowledge and information. In contrast, CI practices facilitate the measuring, management and increase of the value of data, which in turns, influences the decision-making process to obtain the best possible results through the global sharing of knowledge.

image

Example 3.5. Short videos presenting CI

Our age of information is characterized by an exponential increase in data production and processing, but also by a massive increase in the speed of data transmission as well as the speed of access to stored data. We probably cannot fight Big Data, but if we want to extract a real profit from it, it is essential to master it through a strategy that creates value from data. It is only through a governance of reliable data that businesses’ data can become a strategic asset, which ultimately brings a competitive advantage and increased value to companies.

The main objective of data governance is to ensure that the use of Big Data meets a formal strategy aiming to obtain accurate results: it ensures that all the relevant questions are asked for the best use of Big Data. The governance of large volumes of data therefore remains built into companies. This requires establishing a hierarchy of data as well as ensuring their protection, which enforces general compliance throughout the company.

3.3. CI: protection and valuation of digital assets

Today, a company’s success is inextricably linked to its control of information. Access to information, its analysis, dissemination and presentation are all elements that determine success. Information is now instantly and readily available to all. It has become an economic asset and represents a valued product generated by sharing and exchange. Depending on its value, this information can become sensitive or even strategic; it is therefore linked to the notion of secrecy because it represents a key factor that affects the whole economy.

This exponential increase in the amount of available data creates new opportunities for usage of this information. The value of all of this data is nullified if there is no an existing process of transforming available information into analysis. As an overview, the factors that assisted the rise of Big Data and which continue to develop it are [SYL 15]:

  1. – the automation of data exchange;
  2. – a storage revolution;
  3. – the advent of new data science;
  4. – the progress of data visualisation;
  5. – new opportunities for monetization.

We will see just as many applications grow in strength over the coming years provided that businesses adapt their operations to this development. To do so, they must be innovative, using and mixing data from different connected objects constantly. This requires them to transform data (available in different forms, often unorganized and collected through various channels) into information, and then into knowledge. Unfortunately, the necessary information is currently hidden in a huge mass of data.

The exponential growth of data coupled with the use of algorithms has contributed to the emergence of the concept of “Big Data”. CI practices are not able to ignore this phenomenon or the revolution affecting their processing. The success of a CI strategy, thus, relies on the capacity of businesses to manage and use the mountains of data that they have, in other words “data mining”. This operation attaches even more value to information, which must be protected.

The use of Big Data allows data to be mixed as to obtain a precise mapping. This should provide quality information that allows a company to evaluate alternatives, in order to better decide its behaviour and ensure the safe management of its holdings as part of an approach based on collective intelligence. Now companies must face – in real time – a significant increase of available data that could influence the process of decision-making.

Several events help situate the concept of CI, and thus shed some light on its origins and evolution by identifying the conditions that led to its emergence and development. The methods and tools of CI help to validate collected data (from different reliable sources) into coherent information tailor-made to the company’s profile and needs.

To manage such volumes of data and information, it is absolutely essential to have sorting and selection methods that are both pragmatic and effective. Competitive intelligence is a set of steps, from surveillance to researching strategic information. These steps are implemented by companies in order to monitor their environment, increase their competitiveness and manage the risks associated with their activity.

image

Figure 3.1. A model of economic intelligence [MON 12]

The CI process will allow companies to be both reactive in adapting to change and proactive by having a pre-emptive attitude to understand the dynamics of their environment better. Setting up a CI approach centered on Big Data at the heart of a business is an effective response to the challenges of an increasingly complex and unpredictable globalized world. In this context, it is necessary to control information flows from the whole company before taking a strategic decision.

The development of a CI approach at the heart of a company can only be carried through with everyone’s participation. The confrontation of differing levels of responsibility, highlighted by a strategic line, is the best way to help the decision-maker to make the correct decision at the right time. Estimating the quality and reliability of information and determining its usefulness for the company is undoubtedly the most important part of CI.

To widen its competitive advantage, the company must be able to create an asymmetry of information to its advantage. Information is becoming a strategic issue, not only for state security but equally to defend a country’s overall competitiveness. The significance of information as a raw material for the business world was only recently discovered: economic information is unlike any commodity in our society.

CI aims for a better control of information in order to help the decision-making process. Even seemingly derisory information can, after processing and crosschecking, have an economic value. Information is thus an economic and even historical good for the whole company. Moreover, from the point of view of the amount of data, CI provides a global vision of the company’s environment and monitors the whole informational sphere (markets, technologies, etc.) and allows for absorbing strategic information from the available mass. This has become one of the growing activities around the world; it can be perceived as both:

  1. – an informational approach which includes a set of operations or processes through which collected information becomes usable and worthwhile. This approach also aims to harmonize research, processing, distribution and protection in-line with the requirements in that context and with the actors involved. The management of these actions is the purpose of the monitoring activity, which is essential to the competitive intelligence process;
  2. – mediation between the different actors:
    1. - the decision-makers: involved in this process at different levels, most often solicited through several roles, and functions,
    2. - the monitors: responsible for providing useful information in the best available conditions, speed, and cost, according to an information request explicitly or implicitly formulated.

CI can be considered as a new managerial practice for the corporate strategy, enabling it to improve its competitiveness. It constitutes a strategic response to the questions and concerns of decision-makers as well as improving the strategic position of the company and its environment. According to Salmon et al. [SAL 97] “giving decision makers the necessary information to define their medium and long-term goals in a better manner”. CI produces an “info action” enabling pro-activity and interactivity [BUL 02]. This principle of dynamic use of information takes place within a coherent structure and is based on the phenomenon of collective intelligence.

According to Martre7 [MAT 94], CI can be defined as set of coordinated actions on research, processing, and delivery of useful information to economic actors. These various actions are carried out legally with all the necessary protection guaranteed for preserving the company’s holdings quickly and cost-effectively. Useful information is that which is needed by different levels of the company or community to develop a coherent strategy, and tactics that are necessary for the company to achieve its objectives and improve its position in a competitive environment.

These actions, at the heart of the company, are organized around an uninterrupted cycle which generates a shared vision of corporate goals. CI stems from a strategic vision and promotes interactions between different business, community, territory, national, transnational and state actors. This definition helps create a global vision of the environment in which CI must emerge. The coordination of these actions offers the company a view on its different relations with its environment through constant monitoring of the behaviour of its competitors and the realities of the market.

The analysis of a larger amount of data in real time is likely to improve and accelerate decisions in multiple sectors, from finance to health, both including research. The new analytical power is seen as an opportunity to invent and explore new methods which are able to detect correlations in the available data. Everything that relates to data or CI can be qualified as “Big Data” and it seems that this craze of excitement will soon reach its peak of enthusiasm with “cloud computing”, which saw existing offers renamed and entire organisations move to “cloud” technology overnight.

3.4. Techniques of data analysis: data mining/text mining

Nowadays, we have a lot more data to manage and process: business transactions, scientific data, satellite images, text documents, text reports and multimedia channels. Due to this large increase in data, the design and implementation of efficient tools which allow users to access “information” that it considers “relevant” is increasingly necessary as this information allows for a better understanding of one’s environment, keeping up-to-date with the market and to focus its strategy better.

The diversity and growing volumes of data that rapidly converge in a business makes accessing and processing them very difficult. These reservoirs of knowledge must be explored in order to understand their meaning, to identify the relationships between the data and to have models explaining their behaviour. Faced with huge amounts of data, we have created new requirements that allow us to make the best managerial choices. These requirements represent the automatic summary of the information from stored data, in order to discover new knowledge.

In order to meet these needs, a set of structures, approaches and tools – some new and some pre-existing – were grouped under the name “data mining”. The birth of data mining, a term initially used by statisticians, is essentially due to the convergence of two factors:

  1. – the exponential increase in businesses of data related to their activities (customer data, inventories, manufacturing and accounting) and that contain vital strategic information that can assist decision-making;
  2. – very rapid advances in hardware and software.

The aim of data mining8 is thus the extraction of value from data in corporate information systems. Processing a multitude of data held within the organization in order to achieve better decision-making is therefore often seen as a major challenge for companies.

Data mining represents a set of discoveries of new structures in large data sets which involve statistical methods, artificial intelligence and database management. It is therefore a field at the intersection of statistics and information technology, with the aim of discovering structures in large data sets. Data mining as a process (as automated as possible) goes from taking basic information in a “data warehouse” to the decision-making forum, bringing “added informational value” to each step up until the automatic release of actions based on the synthesized information.

Data mining is the process of automatically extracting predictive information from large databases. According to the Gartner Group, this process can be repetitive or interactive depending on the objectives. We can say that the main task of data mining is to automatically extract useful information from data and make it available to decision makers. As a process, it is relevant here to note that data mining refers to both tools and a highly developed computer technology. Essentially, we must remove the role of the human who must be totally involved in each phase of the process. The different phases of the process are therefore tied to:

  1. – understanding the task: this first phase is essential in understanding the objectives and requirements of the task, in order to integrate them into the data mining project and to outline a plan to achieve them;
  2. – understanding the data: this involves collecting and becoming familiar with the available data. It also involves identifying at the earliest possible opportunity problems of data quality and develop the initial institutions and detect the first sets and hypothesis to be analyzed;
  3. – preparation of data: this phase is comprised of all the steps necessary to build datasets that will be used by the model(s). These steps are often performed several times based on the proposed model and the results of analysis already carried out. It involves extracting, transforming, formatting, cleaning and storing data appropriately. Data preparation constitutes about 60-70% of the work;
  4. – modeling: this is where modeling methodologies of statistics come into play. Models are often validated and built with the help of business analysts and quantitative method experts, called “data scientists”. There are in most cases several ways of modeling the same problem of data mining and several techniques for adjusting a model to data;
  5. – evaluation of a model: at this stage, one or several models are built. It must be clear that the results are deemed satisfactory and are coherent, notably in relation to their targets;
  6. – the use of the model: the development of the model is not the end of the data mining process. Once information has been extracted from the data, they still need to be organized and presented so as to make them usable for the recipient. This can be as simple as providing a descriptive summary of data or as complex as implementing a comprehensive data mining process for the final user. In any case, it is always important that the user understands the limits of data and analysis and that their decisions are made accordingly.

Data mining is a technology that creates value and extracts information and knowledge from data. The development of this technology is the result of an increase in digital data which, relative to their abundance, is underexploited without the proper tools and expertise. This technology is based on a variety of techniques (artificial intelligence, statistics, information theories, databases, etc.) that require diverse skills at a high level.

The success of this discipline has grown with the size of the databases. We can say that with the rise of the phenomenon of Big Data we have now entered a phase of mastery of the discipline. The challenge of Big Data for companies now is not so much the capacity for analysis but rather two issues that tend to be ignored [GAU 12]:

  1. – data collection methods must remain known and controlled to ensure that data mining analysis does not produce any counter-productive effects for the company;
  2. – the analysis of large amounts of data must not be done at the expense of their quality. Not all of them have the same purpose and do not add value to the company.

The types of models that can be discovered depend on the tasks of data mining used. There are two types of data mining tasks: descriptive tasks that describe the general properties of existing data, and predictive tasks to make forecasts on available data. The data mining features and the variety of areas of knowledge explored are:

  1. – description: the importance of this task is to allow the analyst to interpret the results, either from a data mining model or an algorithm, in the most efficient and transparent manner. Thus the results of the data mining model should describe clear characteristics that can lead to an interpretation and an intuitive explanation;
  2. – estimation: the main interest of this task is to arrange the results in order to retain only the most valuable information; this technique is mainly used in marketing in order to offer deals to the best prospective clients;
  3. – segmentation: this consists of, for example, allocating customers into homogeneous groups which should then be addressed by specific means adapted to the characteristics and needs of each group;
  4. – classification: this concerns aggregating data or observations of groups with similar objects. This separates the group of data to form homogeneous subgroups. These are called clusters, which are classes in which their data is similar to each other, and, by definition, different from other groups;
  5. – prediction: the results of the prediction are unknown, which differentiates it from estimation;
  6. – association: this function of data mining allows for discovering which variables go together and what rules will help quantify the relationships between two or more variables.

However, the heterogeneity of data sources and their characteristics means that data mining alone is not enough. With the evolution of the Web (the transition towards the Semantic Web) there has been an explosion of textual data; this is unstructured data which raises several possibilities for companies who cannot ignore their existence and their impact on the ecosystem (suppliers, competitors, and customers). The Web is another factor that justifies the extraction of knowledge from texts. Indeed, with the Web, unstructured data (such as text) has become the predominant type of online data.

In this context, useful information is not only found in quantitative numerical data but also in texts. The tools for accessing and collecting textual data must be equally capable of operating from the Web on HTML documents as on databases, either bibliographical or textual. This is the analysis of text, or “text mining”, which is a technique of extracting knowledge from unstructured documents or texts by using different computer algorithms.

This means modeling linguistic theories using computers and statistical techniques. The use of information technology to automate the synthesis of texts is not particularly recent. Hans Peter Luhn, a researcher at IBM, and the real creator of the term business intelligence in 1958, published a report titled: “The Automatic Creation of Literature Abstracts” [LUH 58]. This exciting study is directly available from the IBM website. Text mining requires firstly recognition of the words or phrases, then to identify their meaning and their relationships to then be able to process, interpret, and select a text. The selection criterion is divided into two types:

  1. – novelty: this consists of discovering relationships, notably the implications that were not explicit (indirect or between two distant elements in the text);
  2. – similarity or contradiction: in relation to another text or in response to a specific question, this consists of discovering texts that best match a set of descriptors in the original application.

Text mining generates information on the content of a particular document; this information will then be added to the document thus improving it. The main applications of text mining is recounting surveys and data analysis projects for which certain responses come in an unstructured or textual form (for example electronic messages, comments, suggestions in a satisfaction survey with open questions, the description of medical symptoms by patients of practitioners, claims, etc.) which are best integrated into the overall analysis.

image

Example 3.6. A base of e-prospects with Statistica as an example of data processing

These techniques are often used to produce predictive models to automatically classify texts, for example to direct emails to the most appropriate recipient or to distinguish between “spam” and important messages automatically. Text mining thus facilitates:

  1. – automatic classification of the documents;
  2. – preparation of a document overview without reading it;
  3. – supply to databases;
  4. – ensure monitoring on documents.

Two approaches can be considered:

  1. – the statistical approach: produces information on the number of uses of a word in a document;
  2. – the semantic approach: relies on an external element which is the repository, which can be static (keywords) or dynamic; it implements logic (i.e. information that is deduced from the repository).

Text mining is a technique which automates the processing of large volumes of text content to extract the key trends and to statistically identify the different topics that arise. Techniques of text mining are mainly used for data already available in digital format. Online text mining can be used to analyze the content of incoming emails or comments made on forums and social media.

The demand for different types of “data mining” can only increase. If this demand develops, it will steer research in the field of “data mining” into digital, textual, images, etc., and on the development of viable systems. This is an essential consideration that companies will have to develop. The procedural methods of mining and of updating models – with a view of automating the decisions and decision-making – must be designed in conjunction with data storage systems in order to ensure the interest and usefulness of these systems for the company.

3.5. Conclusion

The Big Data phenomenon has changed data management procedures because it introduces new issues concerning the volume, transfer speed and type of data being managed. The development of the latest technologies such as smartphones, tablets, etc., provides quick access to tools; the Internet becomes a space for data processing thanks to broadband. As previously mentioned, this phenomenon is characterized by the size or volume of data. The speed and type of data are also to be considered. Concerning the type, Big Data is often attached to unstructured content (web content, client comments, etc.) which presents a challenge for conventional storage and computing environments. In terms of speed, we can handle the speed at which the information is created.

Thanks to various new technologies, it is now possible to analyse and use large masses of data from website log files, analysis of opinions on social networks, video streaming and environmental sensors. Most of the data collected in the distributed file systems are unstructured, such as text, images or video. Some data is relatively simple (for example calculating simple numbers etc.) where others require more complex algorithms that must be developed specifically to operate efficiently in the distributed file system.

The quality of data affect the merits and appropriateness of strategic decisions in the company, hence the need to consider the “information value chain”. All the actors in the information value chain are strongly aware of the identification – and especially the reporting – of quality defects; the solution is a communication campaign and tools for catching any quality defects of data. The data governance approach is a pragmatic approach that formalizes and distributes data management responsibilities throughout the value chain.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset