35.1 Introduction

While the chapters in Sections II through IV have described the experimental systems and prototypes we have developed in big data management and analytics (BDMA) and big data security and privacy (BDSP) and the previous chapters in Section V have discussed some of our exploratory work on research in BDMA and BDSP as well as our approach to developing experimental infrastructures and course in these fields, this chapter describes the direction for BDSP and BDMA. In particular, we provide a summary of the discussions of the National Science Foundation (NSF) sponsored workshop on BDSP (including applications of BDMA for BDSP) held in Dallas, Texas on September 16–17, 2014. Our goal is to build a community in BDSP to explore the challenging research problems. We also presented the results of the workshop at the National Privacy Research Strategy meeting in Washington, DC to set the directions for research and education on these topics.

Recently a few workshops and panels have been held on BDSP. Examples include the ACM CCS workshop on Big Data Security, ACM SACMAT, and IEEE Big Data Conference panels. These workshops and panels have been influenced by different communities of researchers. For example, the ACM CCS workshop series is focusing on big data for security applications while the IEEE Big Data conference is focusing on cloud security issues. Furthermore, these workshops and panels mainly address a limited number of the technical issues surrounding BDSP. For example, the ACM CCS workshop does not appear to address the privacy issues dealing with regulations or the security violations resulting from data analytics.

To address the above limitations, we organized a workshop on Big Data Security and Privacy on September 16–17, 2014 in Dallas, Texas sponsored by the NSF [NSF]. The participants of this workshop consisted of interdisciplinary researchers in the fields of higher performance computing, systems, data management and analytics, cyber security, network science, healthcare, and social sciences who came together and determined the strategic direction for BDSP. NSF has made substantial investments both in cyber security and big data. It is therefore critical that the two areas work together to determine the direction for big data security. We made a submission based on the workshop results to the National Privacy Research Strategy [NPRS]. We also gave a presentation at the NITRD (The Networking and Information Technology Research and Development) Privacy Workshop [NITRD]. This document is the workshop report that describes the issues in BDSP, presentations at the workshop, and the discussions at the workshop. We hope that this effort will help toward building a community in BDSP.

The organization of this chapter is as follows. Section 35.2 describes the issues surrounding BDSP. The workshop participants were given these issues to build upon during the workshop discussions. A summary of the workshop presentations is provided in Section 35.3. A summary of the discussions at the workshop is provided in Section 35.4. Next steps are discussed in Section 35.5. Figure 35.1 illustrates the topics discussed in this Chapter.

79053.jpg

Figure 35.1 Research issues in big data security and privacy.

35.2 Issues in BDSP

35.2.1 Introduction

This section describes issues in BDSP that were given to the workshop participants to motivate the discussions. These issues include both security and privacy for big data as well as BDMA for cyber security. While big data has roots in many technologies, database management is at its heart. Therefore, in this section, we will discuss how data management has evolved and will then focus on the BDSP issues.

Database systems technology has advanced a great deal during the past four decades from the legacy systems based on network and hierarchical models to relational and object database systems. Database systems can now be accessed via the web and data management services have been implemented as web services. Due to the explosion of web-based services, unstructured data management and social media and mobile computing, the amount of data to be handled has increased from terabytes to petabytes and zetabytes in just two decades. Such vast amounts of complex data have come to be known as big data. Not only must big data be managed efficiently, such data also has to be analyzed to extract useful nuggets to enhance businesses as well as improve society. This has come to be known as big data analytics.

Storage, management, and analysis of large quantities of data also result in security and privacy violations. Often data has to be retained for various reasons including for regulatory compliance. The data retained may have sensitive information and could violate user privacy. Furthermore, manipulating such big data, such as combining sets of different types of data could result in security and privacy violations. For example, while the raw data removes personally identifiable information, the derived data may contain private and sensitive information. For example, the raw data about a person may be combined with the person’s address which may be sufficient to identify the person.

Different communities are working on the big data challenge. For example, the systems community is developing technologies for massive storage of big data. The network community is developing solutions for managing very large networked data. The data community is developing solutions for efficiently managing and analyzing large sets of data. Big data research and development is being carried out both in academia, industry, and government research labs. However, little attention has been given to security and privacy considerations for big data. Security cuts across multiple areas including systems, data, and networks. We need the multiple communities to come together to develop solutions for BDSP.

This section describes some of the issues in BDSP. An overview of BDMA is provided in Section 35.2.2. Security and privacy issues are discussed in Section 35.2.3. BDMA for cyber security are discussed in Section 32.2.4. Our goal toward building a community is discussed in Section 32.2.5.

35.2.2 Big Data Management and Analytics

BDMA research is proceeding in three directions. They are as follows:

1.Building infrastructure and high performance computing techniques for the storage of big data.

2.Data management techniques such as integrating multiple data sources (both big and small) and indexing and querying big data.

3.Data analytics techniques that manipulate and analyze big data to extract nuggets.

We will briefly review the progress made in each of the areas. With respect to building infrastructures, technologies such as Hadoop and MapReduce as well as Storm are being developed for managing large amounts of data in the cloud. In addition, main memory data management techniques have advanced so that a few terabytes of data can be managed in main memory. Furthermore, systems such as Hive and Cassandra as well as NoSQL databases have been developed for managing petabytes of data.

With respect to data management, traditional data management techniques such as query processing and optimization strategies are being examined for handling petabytes of data. Furthermore, graph data management techniques are being developed for the storage and management of very large networked data.

With respect to data analytics, the various data mining algorithms are being implemented on Hadoop- and MapReduce-based infrastructures. Additionally, data reduction techniques are being explored to reduce the massive amounts of data into manageable chunks while still maintaining the semantics of the data.

In summary, BDMA techniques include extending current data management and mining techniques to handle massive amounts of data as well as developing new approaches including graph data management and mining techniques for maintaining and analyzing large networked data.

35.2.3 Security and Privacy

The collection, storage, manipulation, and retention of massive amounts of data have resulted in serious security and privacy considerations. Various regulations are being proposed to handle big data so that the privacy of the individuals is not violated. For example, even if personally identifiable information is removed from the data, when data is combined with other data, an individual can be identified. This is essentially the inference and aggregation problem that data security researchers have been exploring for the past four decades. This problem is exacerbated with the management of big data as different sources of data now exist that are related to various individuals.

In some cases, regulations may cause privacy to be violated. For example, data that is collected (e.g., e-mail data) has to be retained for a certain period of time (usually 5 years). As long as one keeps such data, there is a potential for privacy violations. Too many regulations can also stifle innovation. For example, if there is a regulation that raw data has to be kept as is and not manipulated or models cannot be built out of the data, then corporations cannot analyze the data in innovative ways to enhance their business. This way innovation may be stifled.

Therefore, one of the main challenges for ensuring security and privacy when dealing with big data is to come up with a balanced approach toward regulations and analytics. That is, how can an organization carry out useful analytics and still ensure the privacy of individuals? Numerous techniques for privacy-preserving data mining, privacy-preserving data integration, and privacy-preserving information retrieval have been developed. The challenge is to extend these techniques for handling massive amounts of often networked data.

Another security challenge for BDMA is to secure the infrastructures. Many of the technologies that have been developed, including Hadoop, MapReduce, Hive, Cassandra, PigLatin, Mahout, and Storm, do not have adequate security protections. The question is, how can these technologies be secured and at the same time ensure high-performance computing?

Next, the big data management strategies such as access methods and indexing and query processing have to be secure. So the question is how can policies for different types of data such as structured, semistructured, unstructured, and graph data be integrated? Since big data may result from combining data from numerous sources, how can you ensure the quality of the data?

Finally, the entire area of security, privacy, integrity, data quality, and trust policies has to be examined within the context of big data security. What are the appropriate policies for big data? How can these policies be handled without affecting performance? How can these policies be made consistent and complete?

This section has listed some of the challenges with respect to security and privacy for big data. We need a comprehensive research program that will identify the challenges and develop solutions for BDSP. Security cannot be an afterthought. That is, we cannot incorporate security into each and every Big Data technology that is being developed. We need to have a comprehensive strategy so that security can be incorporated while the technology is being developed. We also need to determine the appropriate types of policies and regulations to enforce before Big Data technologies are employed by an organization. This means researchers from multiple disciplines have to come together to determine what the problems are and explore solutions. These disciplines include cyber security and privacy, high-performance computing, data management and analytics, network science, and policy management.

35.2.4 Big Data Analytics for Security Applications

While the challenges discussed in Section 32.2.3 deal with securing big data and ensuring the privacy of individuals, BDMA techniques can be used to solve security problems. For example, an organization can outsource activities such as identity management, email filtering, and intrusion detection to the cloud. This is because massive amounts of data are being collected for such applications and this data has to be analyzed. Cloud data management is just one example of big data management. The question is: how can the developments in big data management and analytic techniques be used to solve security problems? These problems include malware detection, insider threat detection, intrusion detection, and spam filtering.

35.2.5 Community Building

The various issues surrounding BDSP were discussed at the beginning of the workshop and five keynote presentations were given at the workshop that addressed many of these issues. In addition, several position papers were submitted by the workshop participants and subsequently, presentations based on these papers were given. These papers and presentations set the stage for the two breakout sessions held during the workshop. One of these sessions focused on the security and privacy issues while the other focused on the applications. The presentations and the discussions at the workshop are summarized in Sections 35.3 and 35.4 of this report. Our goal is to build a community in BDSP.

We have made some progress toward this goal over the past 2 years. In particular, we participated in the BDSP workshops organized at the Women in Cyber Security Conference series in Dallas in 2016 and in Tucson in 2017. In addition, we also organized a Women in Data Science and Engineering Workshop in San Diego as section of the IEEE International Conference on Data Engineering (ICDE) series. We also continue to present papers and present tutorials at various big data-related conferences.

35.3 Summary of Workshop Presentations

This section summarizes the presentations at the workshop. These presentations and the position papers can be found at http://csi.utdallas.edu/events/NSF/NSF%20papers%202014.htm.

35.3.1 Keynote Presentations

We had five keynote presentations to motivate the workshop participants. These keynote presentations discussed the various BDSP initiatives at NIST, Honeywell, and IBM as well as provided an overview of some of the research challenges. The opening keynote given by Wo Chang from NIST discussed the initiatives at NIST on big data and provided an overview of the big data workgroup. Later Arnab Roy from Fujitsu provided some details of the work by the BDSP subgroup of this workgroup. Elisa Bertino from Purdue discussed issues and challenges of providing security with privacy. Raj Rajagopalan from Honeywell discussed BDSP challenges for industrial control systems. Sandeep Gopisetty from IBM discussed the Big Data Enterprise efforts at IBM while Murat Kantarcioglu from UT Dallas provided an overview of the BDSP initiatives at UT Dallas.

There were several presentations given by the workshop participants. Below we give a summary of these presentations.

35.3.1.1 Toward Privacy Aware Big Data Analytics

Barbara Carminati from the University of Insubria Italy described a framework for privacy aware big data analytics. This framework included layers for privacy policy specifications, a unified query model, fine-grained enforcement, and a dashboard. She went on to discuss the functions of each layer.

35.3.1.2 Formal Methods for Preserving Privacy While Loading Big Data

Brian Blake from the University of Miami discussed how formal methods can be incorporated into approaches to handle privacy violations when multiple pieces of information are combined. In particular, he discussed the creation of a software life cycle and framework for big data testing.

35.3.1.3 Authenticity of Digital Images in Social Media

Balkirat Kaur from North Carolina A&T State University discussed novel solutions for detecting tampered images in social media. In particular, she discussed an approach for creating and capturing image signatures.

35.3.1.4 Business Intelligence Meets Big Data: An Overview of Security and Privacy

Claudio Ardagna from the University of Milano in Crema discussed the notions of full data and zero latency analysis within the context of BDSP.

35.3.1.5 Toward Risk-Aware Policy-Based Framework for BDSP

James Joshi from the University of Pittsburgh described a framework for BDSP that takes risk into consideration. He discussed how realizing such a framework involves the integration of policy engineering and risk management approaches.

35.3.1.6 Big Data Analytics: Privacy Protection Using Semantic Web Technologies

Csilla Farkas from the University of South Carolina discussed the use of semantic web technologies for representing policies and data and subsequently reasoning about these policies to prevent security and privacy violations.

35.3.1.7 Securing Big Data in the Cloud: Toward a More Focused and Data-Driven Approach

Ragib Hasan from the University of Alabama at Birmingham described the challenges in secure cloud computing and discussed a data-driven approach to provide some solutions. In particular, he discussed the need to look at the data life cycle and ensure trustworthy computation and attribution. He stated that provenance should be a fundamental section of clouds.

35.3.1.8 Privacy in a World of Mobile Devices

Tim Finin from the University of Maryland, Baltimore County discussed approaches to providing privacy in a mobile computing environment. He stated that our privacy is at risk due to the proliferation of mobile devices and discussed ways of ensuring privacy.

35.3.1.9 Access Control and Privacy Policy Challenges in Big Data

Ram Krishnan from the University of Texas at San Antonio stated that data is being used in unplanned ways that were unforeseen during the time of collection. He then discussed the challenges for access control and privacy policy specification and enforcement for big data applications.

35.3.1.10 Timely Health Indicators Using Remote Sensing and Innovation for the Validity of the Environment

David Lary from The University of Texas at Dallas who is a natural scientist by training discussed the big data challenges for remote sensing with applications in human health. This presentation provided an overview of an application that manages and analyzes big data and showed the need to handle data privacy.

35.3.1.11 Additional Presentations

The workshop also had additional presentations including the following. Big Noise in Big Data: Research Challenges and Opportunities in Heterogeneous Sensor Data Integration by Calton Pu from Georgia Tech and Accelerating the Performance of Private Information Retrieval Protocols using Graphical Processing Units by Gabriel Ghinita from the University of Massachusetts in Boston. Calton showed us a demonstration of integrating heterogeneous sensor data and discussed the need for data security and privacy while Gabriel discussed approaches and challenges for private information retrieval. Presentations related to BDSP were also given by Anna Squicciarini from Pennsylvania State University and Guofei Gu from Texas A&M University. Topics discussed included social media privacy and malware attacks. Finally, Andrew Greenhut from Raytheon said a few words about security and privacy needs for defense applications.

35.3.1.12 Final Thoughts on the Presentations

As can be seen, the presentations covered a wide range of topics including security and privacy issues as well as applications such as healthcare. In addition, various types of frameworks for BDSP were also discussed. Technologies discussed included social media, image processing, mobile data, and sensor information management. These presentations set the stage for the workshop discussions that took place as part of the breakout sessions. The discussions are summarized in Section 35.4.

35.4 Summary of the Workshop Discussions

35.4.1 Introduction

This section provides a summary of the discussions on BDSP at the NSF workshop. The workshop consisted of keynote presentations, presentations by the participants, and workgroup discussions. We organized two workgroups: one on BDSP led by Dr. Elisa Bertino and the other on big data analytics for cyber security led by Dr. Murat Kantarcioglu. While the major focus of the workshop was on privacy issues due to BDMA, we also had some stimulating discussions on applying big data management analytics techniques for cyber security. Therefore, this section provides a summary of the discussions of both workgroups.

The organization of this section is as follows. The philosophy behind BDSP is discussed in Section 35.4.2. Privacy-enhanced techniques are discussed in Section 35.4.3. A framework for big data privacy is discussed in Section 35.4.4. Research challenges and interdisciplinary approaches to big data privacy are discussed in Section 35.4.5. An overview of BDMA techniques for cyber security is provided in Section 4.6.

35.4.2 Philosophy for BDSP

As discussed by Bertino [BERT14], technological advances and novel applications, such as sensors, cyber-physical systems, smart mobile devices, cloud systems, data analytics, and social networks are making it possible to capture and to quickly process and analyze huge amounts of data from which to extract information critical for security-related tasks. In the area of cyber security, such tasks include user authentication, access control, anomaly detection, user monitoring, and protection from insider threat [BERT12]. By analyzing and integrating data collected on the internet and web, one can identify connections and relationships among individuals that may in turn help with homeland protection. By collecting and mining data concerning user travels and disease outbreaks, one can predict disease spreading across geographical areas. And those are just a few examples; there are certainly many other domains where data technologies can play a major role in enhancing security.

The use of data for security tasks is however raising major privacy concerns [THUR02]. Collected data even if anonymized by removing identifiers such as names or social security numbers, when linked with other data, may lead to re-identifying the individuals to which specific data items are related to. Also, as organizations such as governmental agencies often need to collaborate on security tasks, datasets are exchanged across different organizations, resulting in these datasets being available to many different parties. Apart from the use of data for analytics, security tasks such as authentication and access control may require detailed information about users. An example is multifactor authentication that may require, in addition to a password or a certificate, user biometrics. Recently, proposed continuous authentication techniques extend user authentication to include information such as user keystroke dynamics to constantly verify the user identity. Another example is location-based access control [DAMI07] that requires users to provide to the access control system information about their current location. As a result, detailed user mobility information may be collected over time by the access control system. This information, if misused or stolen, can lead to privacy breaches.

It would then seem that in order to achieve security, we must give up privacy. However, this may not be necessarily the case. Recent advances in cryptography are making possible to work on encrypted data, for example for performing analytics on encrypted data [LIU14]. However, much more needs to be done as the specific data privacy techniques to use heavily depend on the specific use of data and the security tasks at hand. Also, current techniques are not still able to meet the efficiency requirement for use with big datasets.

In this document, we first discuss a few examples of approaches that help with reconciling security with privacy. We then discuss some aspects of a framework for data privacy. Finally, we summarize research challenges and provide an overview of the multidisciplinary research needed to address these challenges.

35.4.3 Examples of Privacy-Enhancing Techniques

Many privacy-enhancing techniques have been proposed over the last 15 years, ranging from cryptographic techniques such as oblivious data structures [WANG14] that hide data access patterns to data anonymization techniques that transform the data to make it more difficult to link specific data records to specific individuals; and we refer the reader for further references to specialized conferences, such as the privacy-enhancing symposium (PET) series (https://petsymposium.org/2014/) and journals, such as Transactions on Data Privacy (http://www.tdp.cat/). However, many such techniques either do not scale to very large datasets and/or do not specifically address the problem of reconciling security with privacy. At the same time, there are a few approaches that focus on efficiently reconciling security with privacy and we discuss them as follows:

Privacy-preserving data matching: Record matching is typically performed across different data sources with the aim of identifying common information shared among these sources. An example is matching a list of passengers on a flight with a list of suspicious individuals. However, matching records from different data sources is often in contrast with privacy requirements concerning the data owned by the sources. Cryptographic approaches such as secure set intersection protocols may alleviate such concerns. However, these techniques do not scale for large datasets. Recent approaches based on data transformation and mapping into vector spaces [SCAN07] and combination of secure multiparty computation (SMC) and data sanitization approaches such as differential privacy [KUZU13] and k-anonymity ([INAN12], [INAN08]) have addressed scalability. However, work needs to be done concerning the development of privacy-preserving techniques suitable for complex matching techniques based, for example, on semantic matching. Security models and definitions also need to be developed supporting security analysis and proofs for solutions combining different security techniques, such as SMC and differential privacy.

Privacy-preserving collaborative data mining: Conventional data mining is typically performed on big centralized data warehouses collecting all the data of interest. However, centrally collecting all data poses several privacy and confidentiality concerns when data belongs to different organizations. An approach to address such concerns is based on distributed collaborative approaches by which the organizations retain their own datasets and cooperate to learn the global data mining results without revealing the data in their own individual datasets. Fundamental work in this area includes (i) techniques allowing two parties to build a decision tree without learning anything about each other’s datasets except for what can be learned by the final decision tree [LIND00] and (ii) specialized collaborative privacy-preserving techniques for association rules, clustering, k-nearest neighbor classification [VAID06]. These techniques are however still very inefficient. Novel approaches based on cloud computing and new cryptographic primitives should be investigated.

Privacy-preserving biometric authentication: Conventional approaches to biometrics authentication require recording biometrics templates of enrolled users and then using these templates for matching with the templates provided by users at authentication time. Templates of user biometrics represent sensitive information that needs to be strongly protected. In distributed environments in which users have to interact with many different service providers, the protection of biometric templates becomes even more complex. A recent approach addresses such an issue by using a combination of perceptual hashing techniques, classification techniques, and zero-knowledge proof of knowledge (ZKPK) protocols [BERT14]. Under such approach, the biometric template of a user is processed to extract from it a string of bits which is then further processed by classification and some other transformation. The resulting bit string is then used, together with a random number, to generate a cryptographic commitment. This commitment represents an identification token that does not reveal anything about the original input biometrics. The commitment is then used in the ZKPK protocol to authenticate the user. This approach has been engineered for secure use on mobile phones. Much work remains, however, to be done in order to reduce the false rejection rates. Also, different approaches to authentication and identification techniques need to be investigated based on recent homomorphic encryption techniques.

35.4.4 Multiobjective Optimization Framework for Data Privacy

Although there are attempts at coming up with a privacy solution/definition that can address many different scenarios, we believe that there is no one size fits all solution for data privacy. Instead, multiple dimensions need to be tailored for different application domains to achieve practical solutions. First of all, different domains require different definitions of data utility. For example, if we want to build privacy-preserving classification models, 0/1 loss could be a good utility measure. On the other hand, for privacy-preserving record linkage, F1 score could be a better choice. Second, we need to understand the right definitions of privacy risk. For example, in data sharing scenarios, the probability of re-identification given certain background knowledge could be considered the right measure of privacy risk. On the other hand, ε=1 could be considered an appropriate risk for differentially private data mining models. Finally, the computation, storage, and communication costs of given protocols need to be considered. These costs could be especially significant for privacy-preserving protocols that involve cryptography. Given these three dimensions, one can envisage a multiobjective framework where different dimensions could be emphasized:

Maximize utility, given risk and costs constraints: This would be suited for scenarios where limiting certain privacy risks are paramount.

Minimize privacy risks, given the utility and cost constraints: In some scenarios, (e.g., medical care), significant degradation of the utility may not be allowed. In this setting, the parameter values of the protocol (e.g., ε in differential privacy) are chosen in such a way that we try to do our best in terms of privacy given our utility constraints. Please note that in some scenarios, there may not be any parameter settings that can satisfy all the constraints.

Minimize cost, given the utility and risk constraints: In some cases, (e.g., cryptographic protocols), you may want to find the protocol parameter settings that may allow for the least expensive protocol that can satisfy all the utility and cost constraints.

To better illustrate these dimensions, consider the privacy-preserving record matching problem addressed in [INAN12]. Existing solutions to this problem generally follow two approaches: sanitization techniques and cryptographic techniques. In [INAN12], a hybrid technique that combines these two approaches is presented. This approach enables users to make trade-offs between privacy, accuracy, and cost. This is similar to the multiobjective optimization framework discussed in this chapter. These multiobjective optimizations are achieved by using a blocking phase that operates over sanitized data to filter out pairs of records, in a privacy-preserving manner, that do not satisfy the matching condition. By disclosing more information (e.g., differentially private data statistics), the proposed method incurs considerably lower costs than those for cryptographic techniques. On the other hand, it yields matching results that are significantly more accurate when compared to the sanitization techniques, even when privacy requirements are high. The use of different privacy-parameter values allows for different cost, risk, and utility outcomes.

To enable the multiobjective optimization framework for data privacy, we believe that more research needs to be done to identify appropriate utility, risk, and cost definitions for different application domains. Especially defining correct and realistic privacy risks is paramount. Many human actions ranging from oil extraction to airline travel, involve risks and benefits. In many cases, such as trying to develop an aircraft that may never malfunction, avoiding all risks are either too costly or impossible. Similarly we believe that avoiding all privacy risks for all individuals would be too costly. In addition, assuming that an attacker may know everything is too pessimistic. Therefore, coming up with privacy risk definitions under realistic attacker scenarios are needed.

35.4.5 Research Challenges and Multidisciplinary Approaches

Comprehensive solutions to the problem of security with privacy for big data require addressing many research challenges and multidisciplinary approaches. We outline significant directions in what follows:

Data confidentiality: Several data confidentiality techniques and mechanisms exist, the most notable being access control systems and encryptions. Both techniques have been widely investigated. However, for access control systems for big data we need approaches for the following:

Merging large numbers of access control policies. In many cases, big data entails integrating data originating from multiple sources; these data may be associated with their own access control policies (referred to as “sticky policies”) and these policies must be enforced even when the data is integrated with other data. Therefore, policies need to be integrated and conflicts solved.

Automatically administering authorizations for big data and in particular for granting permissions. If fine-grained access control is required, manual administration on large datasets is not feasible. We need techniques by which authorization can be automatically granted, possibly based on the user digital identity, profile, and context, and on the data contents and metadata.

Enforcing access control policies on heterogeneous multimedia data. Content-based access control is an important type of access control by which authorizations are granted or denied based on the content of data. Content-based access control is critical when dealing with video surveillance applications which are important for security. As for privacy, such videos have to be protected. Supporting content-based access control requires understanding the contents of protected data and this is very challenging when dealing with multimedia large data sources.

Enforcing access control policies in big data stores. Some of the recent big data systems allow its users to submit arbitrary jobs using programming languages such as Java. For example, in Hadoop, users can submit arbitrary MapReduce jobs written in Java. This creates significant challenges to enforce fine-grained access control efficiently for different users. Although there is some existing work ([ULUS14]) that tries to inject access control policies into submitted jobs, more research needs to be done on how to efficiently enforce such policies in recently developed big data stores.

Automatically designing, evolving, and managing access control policies. When dealing with dynamic environments where sources, users, and applications as well as the data usage are continuously changing, the ability to automatically design and evolve policies is critical to make sure that data is readily available for use while at the same time assuring data confidentiality. Environments and tools for managing policies are also crucial.

Privacy-preserving data correlation techniques: A major issue arising from big data is that in correlating many (big) datasets, one can extract unanticipated information. Relevant issues and research directions that need to be investigated include

Techniques to control what is extracted and to check that what is extracted can be used and/or shared.

Support for both personal privacy and population privacy. In the case of population privacy, it is important to understand what is extracted from the data as this may lead to discrimination. Also, when dealing with security with privacy, it is important to understand the trade-off of personal privacy and collective security.

Efficient and scalable privacy-enhancing techniques. Several such techniques have been developed over the years, including oblivious RAM, security multiparty computation, multi-input encryption, homomorphic encryption. However, they are not yet practically applicable to large datasets. We need to engineer these techniques, using for example parallelization, to fine tune their implementation and perhaps combine them with other techniques, such as differential privacy (like in the case of the record linkage protocols described in [SCAN07]). A possible further approach in this respect is to first use anonymized/sanitized data, and then depending on the specific situation to get specific nonanonymized data.

Usability of data privacy policies. Policies must be easily understood by users. We need tools for the average users and we need to understand user expectations in terms of privacy.

Approaches for data services monetization. Instead of selling data, organizations owning datasets can sell privacy-preserving data analytic services based on these datasets. The question to be addressed then is: how would the business model around data change if privacy-preserving data analytic tools were available? Also, if data is considered as a good to be sold, are there regulations concerning contracts for buying/selling data? Can these contracts include privacy clauses be incorporated requiring for example that users to whom this data pertains to have been notified?

Data publication. Perhaps we should abandon the idea of publishing data, given the privacy implications, and rather require the user of the data to utilize a controlled environment (perhaps located in a cloud) for using the data. In this way, it would be much easier to control the proper use of data. An issue would be the case of research data used in universities and the repeatability of data-based research.

Privacy implication on data quality. Recent studies have shown that people lie especially in social networks because they are not sure that their privacy is preserved. This results in a decrease in data quality that then affects decisions and strategies based on these data.

Risk models. Different types of relationship of risks with big data can be identified: (a) big data can increase privacy risks and (b) big data can reduce risks in many domains (e.g. national security). The development of models for these two types of risk is critical in order to identify suitable trade-off and privacy-enhancing techniques to be used.

Data ownership. The question about who is the owner of a piece of data is often a difficult question. It is perhaps better to replace this concept with the concept of stakeholder. Multiple stakeholders can be associated with each data item. The concept of stakeholder ties well with risks. Each stakeholder would have different (possibly conflicting) objectives and this can be modeled according to multiobjective optimization. In some cases, a stakeholder may not be aware of the others. For example, a user about whom the data pertains to (and thus a stakeholder for the data) may not be aware that a law enforcement agency is using this data. Technology solutions need to be investigated to eliminate conflicts.

Human factors. All solutions proposed for privacy and for security with privacy need to be investigated in order to determine human involvement, e.g., how would the user interact with the data and his/her specific tasks concerning the use and/or protection of the data, in order to enhance usability.

Data lifecycle framework. A comprehensive approach to privacy for big data needs to be based on a systematic data lifecycle approach. Phases in the lifecycle need to be identified and their privacy requirements and implications also need to be identified. Relevant phases include

Data acquisition: We need mechanisms and tools to prevent devices from acquiring data about other individuals (relevant when devices like Google glasses are used); for example, can we come up with mechanisms that automatically block devices from recording/acquiring data at certain locations (or notify a user that recording devices are around). We also need techniques by which each recorded subject may have a say about the use of the data.

Data sharing: Users need to be informed about data sharing/transferred to other parties.

Addressing the above challenges requires multidisciplinary research drawing from many different areas, including computer science and engineering, information systems, statistics, risk models, economics, social sciences, political sciences, human factors, and psychology. We believe that all these perspectives are needed to develop effective solutions to the problem of privacy in the era of big data as well as to reconcile security with privacy.

35.4.6 BDMA for Cyber Security

To protect important digital assets, organizations are investing in new cyber security tools that need to analyze big data ranging from log files to e-mail attachments to prevent, detect, and recover from cyber attacks [KAR14]. As a part of this workshop, we explored the following topics:

What is different about big data management analytics (BDMA) for cyber security? The workshop participants pointed out that BDMA for cyber security needs to deal with adaptive and malicious adversaries who can potentially launch attacks to avoid being detected (i.e., data poisoning attacks, denial of service, denial of information attacks, etc.). In addition, BDMA for cyber security needs to operate in high volume (e.g., data coming from multiple intrusion detection systems and sensors) and high noise environments (i.e., constantly changing normal system usage data is mixed with stealth advanced persistent threat-related data). One of the important points that came out of this discussion is that we need BDMA tools that can integrate data from hosts, networks, social networks, bug reports, mobile devices, and internet of things sensors to detect attacks.

What is the right BDMA architecture for cyber security? We also discussed whether we need different types of BDMA system architectures for cyber security. Based on the use cases discussed, participants felt that existing BDMA system architectures can be adapted for cyber security needs. One issue pointed out was that real-time data analysis must be supported by a successful BDMA system for cyber security. For example, once a certain type of attack is known, the system needs to be updated to look for such attacks in real time including re-examining the history data to see whether prior attacks have occurred.

Data sharing for BDMA for cyber security: It emerged quickly during our discussions that cyber security data needs to be shared both within as well as across organizations. In addition to obvious privacy, security and incentive issues in sharing cyber security data, participants felt that we need common languages and infrastructure to capture and share such cyber security data. For example, we need to represent certain low-level system information (e.g., memory, CPU states, etc.) so that it can be mapped to similar cyber security incidents.

BDMA for preventing cyber attacks: There was substantial discussion on how BDMA tools could be used to prevent attacks. One idea that emerged is that BDMA systems that can easily track sensitive data using the captured provenance information can potentially detect attacks before too much sensitive information is disclosed. Based on this observation, building provenance-aware BDMA systems would be needed for cyber attack prevention. Also, BDMA tools for cyber security can potentially mine useful attacker information such as their motivations, technical capabilities, modus operandi, and so on to prevent future attacks.

BDMA for digital forensics: BDMA techniques could be used for digital forensics by combining or linking different data sources. The main challenge that emerged was identifying the right data sources for digital forensics. In addition, answers to the following questions were not clear immediately: What data to capture? What to filter out (big noise in big data)? What data to link? What data to store and for how long? How to deal with machine-generated content and Internet of Things?

BDMA for understanding the users of the cyber systems: Participants believe that BDMA could be used to mine human behavior to learn how to improve the systems. For example, an organization may send phishing e-mails to its users and carry out security re-training for those who are fooled by such a phishing attack. In addition, BDMA techniques could be used to understand and build normal behavior models per user to find significant deviations from the norm.

Overall, during our workshop discussions, it became clear that all of the above topics have significant research challenges and more research needs to be done to address them. Furthermore, regardless of whether we are using BDMA for cyber security or for other applications (e.g., healthcare, finance), it is critical that we need to design scalable BDMA solutions. These include parallel BDMA techniques as well as BDMA technical implemented on cloud platforms such as Hadoop/MapReduce, Storm and Spark. In addition, we need to explore the use of BDMA systems such as HBase and CouchDB for use in various applications.

35.5 Summary and Directions

This chapter has explored the issues surrounding BDSP as well as applying BDMA techniques for cyber security. As massive amounts of data are being collected, stored, manipulated, merged, analyzed, and expunged, security and privacy concerns will explode. We need to develop technologies to address security and privacy issues throughout the lifecycle of the data. However, technologies alone will not be sufficient. We need to understand not only the societal impact of data collection, use, and analysis, we also need to formulate appropriate laws and policies for such activities. Our workshop explored the initial directions to address some of the major challenges we are faced with today. We need an interdisciplinary approach consisting of technologists, application specialists, social scientists, policy analysts, and lawyers to work together to come up with viable and practical solutions.

This chapter has described the security and privacy issues for big data as well as discussed the issues that need to be investigated for applying BDMA techniques for cyber security. We have also provided summaries of the workshop presentations and discussions. We made a submission to the National Privacy Research Strategy on October 16, 2014 that was based on the workshop summary. We also participated in the National Privacy Research Strategy Conference in Washington, DC February 18–20, 2015 and gave a presentation of the workshop summary at this event. Our goal is to build a community in BDSP. In addition, the National Institute of Standards and Technology (NIST) has also developed a report on the security and privacy for big data [NIST1]. This report is published by NIST’s BDSP Working Group which is part of NIST’s [NIST2] Big Data Working Group. Furthermore, NIST is also developing a framework called NICE (National Initiative for Cyber Security Framework) for cyber security education. We need to incorporate BDSP as well as BDMA for cyber security topics into this framework. Therefore, it is important for the different agencies to continue to work together and develop strategies not only for research but also for education in BDSP as well as on applying BDMA for cyber security.

References

[BERT12]. E. Bertino, Data Protection from Insider Threats, Morgan & Claypool, 2012.

[BERT14]. E. Bertino, “Security with Privacy—Opportunities and Challenges” Panel Statement, COMPSAC Vasteras, Sweden, pp. 436-437, 2014.

[DAMI07]. M. Damiani, E. Bertino, B. Catania, P. Perlasca, “GEO-RBAC: A Spatially Aware RBAC,” ACM Transactions on Information and System Security, 10 (1), Article No. 2, 2007.

[HAML14]. S. Khan, K. Hamlen, M. Kantarcioglu, “Silver Lining: Enforcing Secure Information Flow at the Cloud Edge,” IC2E, Boston, MA, pp. 37–46, 2014.

[INAN08]. A. Inan, M. Kantarcioglu, E. Bertino, M. Scannapieco, “A Hybrid Approach to Private Record Linkage,” ICDE, Cancun, Mexico, pp. 496–505, 2008.

[INAN12]. A. Inan, M. Kantarcioglu, G. Ghinita, E. Bertino, “A Hybrid Approach to Private Record Matching,” IEEE Transactions on Dependable Secure Computing (TDSC), 9 (5), 684–698, 2012.

[KAR14]. S. Kar, “Gartner Report: Big Data will Revolutionize Cyber Security in the Next Two Years,” cloudtimes.org, Feb. 12, 2014.

[KUZU13]. M. Kuzu et al. “Efficient Privacy-Aware Record Integration,” In Proceedings of Joint 2013 EDBT/ICDT Conferences, EDBT’13, Genoa, Italy, Mar. 18–22, ACM, 2013.

[LIND00]. Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,” In Advances in Cryptology, Aug. 20–24, Springer-Verlag, Berlin, 2000.

[LIU14]. D. Liu, E. Bertino, X. Yi, “Privacy of Outsourced K-Means Clustering,” In Proceedings of the 9th ACM Symposium on Information, Computer and Communication Security, Jun. 4–6, Kyoto (Japan), pp. 123-134, 2014.

[NITRD]. http://csi.utdallas.edu/events/NSF/NPRS%20Workshop%20Presentation.pdf

[NPRS]. https://www.nitrd.gov/cybersecurity/nprsrfi102014/BigData-SP.pdf

[NSF]. http://csi.utdallas.edu/events/NSF/NSF%20workshop%202014.htm

[NIST1]. https://bigdatawg.nist.gov/

[NIST2]. https://www.nist.gov/itl/applied-cybersecurity/nice

[SCAN07]. M. Scannapieco, I. Figotin, E. Bertino, A. Elmagarmid, “Privacy Preserving Schema and Data Matching,” In Proceedings of 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, pp. 653-664, 2007.

[THUR02]. B. Thuraisingham, “Data Mining, National Security, Privacy and Civil Liberties,” SIGKDD Explorations, 4 (2), 1–5, 2002.

[ULUS14]. H. Ulusoy et al., “Vigiles: Fine-Grained Access Control for MapReduce Systems,” In 2014 IEEE International Congress on Big Data (BigData Congress), Anchorage, AK, pp. 40–47, 2014.

[VAID06]. J. Vaidya, Y. Zhu, C. Clifton, “Privacy Preserving Data Mining,” Advances in Information Security, 19, Springer, New York,  pp. 1–121, 2006.

[WANG14]. H.X. Wang, K. Nayak, C. Liu, E. Shi, E. Stefanov, Y. Huang, “Oblivious Data Structures,” IACR Cryptology ePrint Archive, 185, 2014.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset