5.1 Introduction

Chapters 2 through 4 have discussed concepts in data security and privacy, data mining, and data mining for cyber security. These three supporting technologies are part of the foundational technologies for the concepts discussed in this book. For example, Section II describes stream data analytics for large datasets. In particular, we discuss an innovative technique called “novel class detection” where we integrate data mining with stream data management technologies. Section III describes our approach to applying the techniques for stream mining discussing Section II for insider threat detection. We utilize the cloud platform for managing and analyzing large datasets. We will see throughout this book that cloud computing is at the heart of managing large datasets. In addition, for some of our experimental systems, to be discussed in Section IV, we have utilized semantic web technologies. Therefore, in this chapter, we discuss two additional technologies that we have used in several of the chapters in this book. They are cloud computing and semantic web technologies.

Cloud computing has emerged as a powerful computing paradigm for service-oriented computing. Many of the computing services are being outsourced to the cloud. Such cloud-based services can be used to host the various cyber security applications such as insider threat detection and identity management. Another concept that is being used for a variety of applications is the notion of the semantic web. A semantic web is essentially a collection of technologies to produce machine-understandable web pages. These technologies can also be used to represent any type of data including schema for big data and malware data. We have based some of our analytics and security investigation for data represented using semantic web technologies.

The organization of this chapter is as follows. Section 5.2 discusses cloud computing concepts. Concepts in semantic web are discussed in Section 5.3. Semantic web and security concepts are discussed in Section 5.4. Cloud computing frameworks based on semantic web are discussed in Section 5.5. This chapter is concluded in Section 5.6. Figure 5.1 illustrates the concepts discussed in this chapter.


Figure 5.1 Cloud computing and semantic web technologies.

5.2 Cloud Computing

5.2.1 Overview

The emerging cloud computing model attempts to address the growth of web-connected devices and handle massive amounts of data. Google has now introduced the MapReduce framework for processing large amounts of data on commodity hardware. Apache’s Hadoop Distributed File System (HDFS) is emerging as a superior software component for cloud computing, combined with integrated parts such as MapReduce ([HDFS], [DEAN04]). Clouds such as HP’s Open Cirrus Testbed are utilizing HDFS. This in turn has resulted in numerous social networking sites with massive amounts of data to be shared and managed. For example, we may want to analyze multiple years of stock market data statistically to reveal a pattern or to build a reliable weather model based on several years of weather and related data. To handle such massive amounts of data distributed at many sites (i.e., nodes), scalable hardware and software components are needed. The cloud computing model has emerged to address the explosive growth of web-connected devices and handle massive amounts of data. It is defined and characterized by massive scalability and new Internet-driven economics.

In this chapter, we will discuss some preliminaries in cloud computing and semantic web. We will first introduce what is meant by cloud computing. While various definitions have been proposed, we will adopt the definition provided by the National Institute of Standards and Technology (NIST). This will be followed by a service-based paradigm for cloud computing. Next, we will discuss the various key concepts including virtualization and data storage in the cloud. We will also discuss some of the technologies such as Hadoop and MapReduce.

The organization of this chapter is as follows. Cloud computing preliminaries will be discussed in Section 5.2.2. Virtualization will be discussed in Section 5.2.3. Cloud storage and data management issues will be discussed in Section 5.2.4. Cloud computing tools will be discussed in Section 5.2.5. Figure 5.2 illustrates the components addressed in this section.


Figure 5.2 Cloud computing components.

5.2.2 Preliminaries

As stated in [CLOUD], cloud computing delivers computing as a service, while in traditional computing, it is provided in the form of a product. Therefore, users pay for the services based on a pay-as-you-go model. The services provided by a cloud may include hardware services, systems services, data services, and storage services. Users of the cloud need not know where the software and data are located; that is, the software and data services provided by the cloud are transparent to the user. NIST has defined cloud computing to be the following [NIST]:

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

The cloud model is composed of multiple deployment models and service models. These models are described next. Cloud Deployment Models

There are multiple deployment models for cloud computing. These include the public cloud, community cloud, hybrid cloud, and the private cloud. In a public cloud, the service provider typically provides the services over the World Wide Web that can be accessed by the general public. Such a cloud may provide free services or pay-as-you-go services. In a community cloud, a group of organizations get together and develop a cloud. These organizations may have a common objective to provide features such as security and fault tolerance. The cost is shared among the organizations. Furthermore, the cloud may be hosted by the organizations or by a third party. A private cloud is a cloud infrastructure developed specifically for an organization. This could be hosted by the organization or by a third party. A hybrid cloud consists of a combination of public and private clouds. This way in a hybrid cloud, an organization may use the private cloud for highly sensitive services, while it may use the public cloud for less sensitive services and take advantage of what the World Wide Web has to offer. Kantarcioglu and his colleagues have stated that the hybrid cloud is deployment model of the future [KHAD12a]. Figure 5.3 illustrates the cloud deployment models.


Figure 5.3 Cloud deployment models. Service Models

As stated earlier, cloud computing provides a variety of services. These include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), and Data as a Service (DaaS). In IaaS, the cloud provides a collection of hardware and networks for use by the general public or organizations. The users install operations systems and software to run the applications. The users will be billed according to the resources they utilize for their computing. In PaaS, the cloud provider will provide to their users the systems software such as operating systems (OS) and execution environments. The users will load their applications and run them on the hardware and software infrastructures provided by the cloud. In SaaS, the cloud provider will provide the applications for the users to run. These applications could be say billing applications, tax computing applications, and sales tools. The cloud users access the applications through cloud clients. In the case of DaaS, the cloud provides data to the cloud users. Data may be stored in data centers that are accessed by the cloud users. Note that while DaaS used to denote Desktop as a Service, more recently it denotes Data as a Service. Figure 5.4 illustrates the services models.


Figure 5.4 Cloud service models.

5.2.3 Virtualization

Virtualization essentially means creating something virtual and not actual. It could be hardware, software, memory, and data. The notion of virtualization has existed for decades with respect to computing. Back in the 1960s, the concept of virtual memory was introduced. This virtual memory gives the application program the illusion that it has contiguous working memory. Mapping is developed to map the virtual memory to the actual physical memory.

Hardware virtualization is a basic notion in cloud computing. This essentially creates virtual machines hosted on a real computer with an OS. This means while the actual machine may be running a Windows OS, through virtualization it may provide a Linux machine to the users. The actual machine is called the host machine while the virtual machine is called the guest machine. The term virtual machine monitor, also known as the hypervisor, is the software that runs the virtual machine on the host computer.

Other types of virtualization include OS level virtualization, storage virtualization, data virtualization, and database virtualization. In OS level virtualization, multiple virtual environments are created within a single OS. In storage virtualization, the logical storage is abstracted from the physical storage. In data virtualization, the data is abstracted from the underlying databases. In network virtualization, a virtual network is created. Figure 5.5 illustrates the various types of virtualizations.


Figure 5.5 Types of virtualization.

As we have stated earlier, at the heart of cloud computing is the notion of hypervisor or the virtual machine monitor. Hardware virtualization techniques allow multiple OSs (called guests) to run concurrently on a host computer. These multiple OSs share virtualized hardware resources. Hypervisor is not a new term; it was first used in the mid 1960s in the IBM 360/65 machines. There are different types of hypervisors; in one type the hypervisor runs on the host hardware and manages the guest OSs. Both VMware and XEN which are popular virtual machines are based on this model. In another model, the hypervisor runs within a conventional OS environment. Virtual machines are also incorporated into embedded systems and mobile phones. Embedded hypervisors have real-time processing capability. Some details of virtualization are provided in [VIRT].

5.2.4 Cloud Storage and Data Management

In a cloud storage model, the service providers store massive amounts of data for customers in data centers. Those who require storage space will lease the storage from the service providers who are the hosting companies. The actual location of the data is transparent to the users. What is presented to the users is virtualized storage; the storage managers will map the virtual storage with the actual storage and manage the data resources for the customers. A single object (e.g., the entire video database of a customer) may be stored in multiple locations. Each location may store objects for multiple customers. Figure 5.6 illustrates cloud storage management.


Figure 5.6 Cloud storage management.

Virtualizing cloud storage has many advantages. Users need not purchase expensive storage devices. Data could be placed anywhere in the cloud. Maintenance such as backup and recovery are provided by the cloud. The goal is for users to have rapid access to the cloud. However, due to the fact that the owner of the data does not have complete control of his data, there are serious security concerns with respect to storing data in the cloud.

A database that runs on the cloud is a cloud database manager. There are multiple ways to utilize a cloud database manager. In the first model, for users to run databases on the cloud, a virtual machine image must be purchased. The database is then run on the virtual machines. The second model is the database as a service model; the service provider will maintain the databases. The users will make use of the database services and pay for the service. An example is the Amazon relational database service which is a Structured Query Language (SQL) database service and has a MySQL interface [AMAZ]. A third model is the cloud provider which hosts a database on behalf of the user. Users can either utilize the database service maintained by the cloud or they can run their databases on the cloud. A cloud database must optimize its query, storage, and transaction processing to take full advantage of the services provided by the cloud. Figure 5.7 illustrates cloud data management.


Figure 5.7 Cloud data management.

5.2.5 Cloud Computing Tools

Processing large volumes of provenance data require sophisticated methods and tools. In recent years, cloud computing tools, such as cloud-enabled NoSQL systems, MongoDB, CouchDB as well as frameworks such as Hadoop, offer appealing alternatives and great promises for systems with high availability, scalability, and elasticity ([CATT11], [CHOD10], [ANDE10], [WHIT10]). In this section, we will briefly survey these systems and their applicability and usefulness for processing large-scale datasets. More details on some of these systems will be provided in Chapter 7. Apache Hadoop

Apache Hadoop is an open source software framework that allows batch processing tasks to be performed on vast quantities of data [WHIT10]. Hadoop uses the HDFS, a Java-based open source distributed file system which employs the Google File System as its underlying storage mechanism. HDFS provides several advantages such as data replication and fault tolerance [GHEM03]. HDFS uses a master/slave architecture that consists of a single namenode process (running on the master node) and several datanode processes (usually one per slave node). MapReduce

A MapReduce job consists of three phases: (1) A “map” phase in which each slave node performs some computation on the data blocks of the input that it has stored. The output of this phase is a key–value pair based on the computation that is performed. (2) An intermediate “sort” phase in which the output of the map phase is sorted based on keys. (3) A “reduce” phase in which a reducer aggregates various values for a shared key and then further processes them before producing the desired result. CouchDB

Apache CouchDB is a distributed, document-oriented database which can be queried and indexed in a MapReduce fashion [ANDE10]. Data is managed as a collection of JSON documents [CROC06]. Users can access the documents with a web browser, via HTTP as well as querying, combining, and transforming documents with JavaScript. HBase

Apache HBase is a distributed, versioned, column-oriented store modeled after Google’s Bigtable, written in Java. Organizations such as Mendeley, Facebook, and Adobe are using HBase [GEOR11]. MongoDB

It is an open source, schema-free, (JSON) document-oriented database written in C++ [CHOD10]. It is developed and supported by 10gen and is part of the NoSQL family of database systems. MongoDB stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Hive

Apache Hive is a data warehousing framework that provides the ability to manage, query, and analyze large datasets stored in HDFS or HBase [THUS10]. Hive provides basic tools to perform extract-transfer-load (ETL) operations over data, project structure onto the extracted data, and query the structured data using a SQL-like language called HiveQL. HiveQL performs query execution using the MapReduce paradigm, while allowing advanced Hadoop programmers to plug in their custom-built MapReduce programs to perform advanced analytics not supported by the language. Some of the design goals of Hive include dynamic scale-out, user-defined analytics, fault-tolerance, and loose coupling with input formats. Apache Cassandra

Apache Cassandra is an open source distributed database management system [HEWI10]. Apache Cassandra is a fault tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high-volume websites. Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur.

5.3 Semantic Web

As we have mentioned earlier in this chapter, some of our experimental big data systems we have developed have utilized cloud and semantic web technologies. While cloud computing was the subject of Section 5.2, in this section we will provide an overview of semantic web technologies.

While the current web technologies facilitate the integration of information from a syntactic point of view, there is still a lot to be done to handle the different semantics of various systems and applications. That is, current web technologies depend a lot on the “human-in-the-loop” for information management and integration. Tim Berners Lee, the father of the World Wide Web, realized the inadequacies of current web technologies and subsequently strived to make the web more intelligent. His goal was to have a web that would essentially alleviate humans from the burden of having to integrate disparate information sources as well as to carry out extensive searches. He then came to the conclusion that one needs machine-understandable web pages and the use of ontologies for information integration. This resulted in the notion of the semantic web [LEE01]. The web services that take advantage of semantic web technologies are semantic web services.

A semantic web can be thought of as a web that is highly intelligent and sophisticated so that one needs little or no human intervention to carry out tasks such as scheduling appointments, coordinating activities, searching for complex documents, as well as integrating disparate databases and information systems. While much progress has been made toward developing such an intelligent web, there is still a lot to be done. For example, technologies such as ontology matching, intelligent agents, and markup languages are contributing a lot toward developing the semantic web. Nevertheless, one still needs the human to make decisions and take actions. Since the 2000s, there have been many developments on the semantic web. The World Wide Web consortium (W3C) is specifying standards for the semantic web [W3C]. These standards include specifications for XML (eXtensible Markup Language), RDF (resource description framework), and interoperability.

Figure 5.8 illustrates the layered technology stack for the semantic web. This is the stack that was developed by Tim Berners Lee. Essentially the semantic web consists of layers where each layer takes advantage of the technologies of the previous layer. The lowest layer is the protocol layer and this is usually not included in the discussion of the semantic technologies. The next layer is the XML layer. XML is a document representation language. While XML is sufficient to specify syntax, semantics such as “the creator of document D is John” is hard to specify in XML. Therefore, the W3C developed RDF which uses XML syntax. The semantic web community then went further and came up with a specification of ontologies in languages such as Web Ontology Language (OWL). Note that OWL addresses the inadequacies of RDF. In order to reason about various policies, the semantic web community has come up with web rules language such as Semantic Web Rules Language (SWRL). Next, we will describe the various technologies that constitute the semantic web.


Figure 5.8 Technology stack for the semantic web.

5.3.1 XML

XML is needed due to the limitations of hypertext markup language and complexities of standard generalized markup language. XML is an extensible markup language specified by the W3C and designed to make the interchange of structured documents over the Internet easier. An important aspect of XML used to be Document Type Definitions which define the role of each element of text in a formal model. XML schemas have now become critical to specify the structure of data. XML schemas are also XML documents [BRAY97].

5.3.2 RDF

The RDF is a standard for describing resources on the semantic web. It provides a common framework for expressing this information so it can be exchanged between applications without loss of meaning. RDF is based on the idea of identifying things using web identifiers (called uniform resource identifiers (URIs)), and describing resources in terms of simple properties and property values [KLYN04].

The RDF terminology T is the union of three pairwise disjoint infinite sets of terms: the set U of URI references, the set L of literals (itself partitioned into two sets, the set Lp of plain literals and the set Lt of typed literals), and the set B of blanks. The set U L of names is called the vocabulary.

Definition 5.1 (RDF Triple)

A RDF Triple (s, p, o) is an element of (U B) × U × T. A RDF Graph is a finite set of triples.

A RDF triple can be viewed as an arc from s to o, where p is used to label the arc. This is represented as spo. We also refer to the ordered triple (s, p, o) as the subject, predicate, and object of a triple.

RDF has a formal semantics which provide a dependable basis for reasoning about the meaning of an RDF graph. This reasoning is usually called entailment. Entailment rules state which implicit information can be inferred from explicit information. In general, it is not assumed that complete information about any resource is available in an RDF query. A query language should be aware of this and tolerate incomplete or contradicting information. The notion of class and operations on classes are specified in RDF though the concept of RDF schema [ANTO08].

5.3.3 SPARQL

SPARQL (Simple Protocol and RDF Query Language) [PRUD06] is a powerful query language. It is a key semantic web technology and was standardized by the RDF Data Access Working Group of the W3C. SPARQL syntax is similar to SQL, but it has the advantage whereby it enables queries to span multiple disparate data sources that consist of heterogeneous and semistructured data. SPARQL is based around graph pattern matching [PRUD06].

Definition 5.2 (Graph Pattern)

A SPARQL graph pattern expression is defined recursively as follows:

1.A triple pattern is a graph pattern.

2.If P1 and P2 are graph patterns, then expressions (P1 AND P2), (P1 OPT P2), and (P1 UNION P2) are graph patterns.

3.If P is a graph pattern, V a set of variables, and X U V, then (XGRAPH P) is a graph pattern.

4.If P is a graph pattern and R is a built in SPARQL condition, then the expression (P FILTER R) is a graph pattern.

5.3.4 OWL

The OWL [MCGU04] is an ontology language that has more expressive power and reasoning capabilities than RDF and RDF schema (RDF-S). It has additional vocabulary along with a formal semantics. OWL has three increasingly expressive sublanguages: OWL Lite, OWL DL, and OWL Full. These are designed for use by specific communities of implementers and users. The formal semantics in OWL is based on description logics (DL), which is a decidable fragment of first-order logics.

5.3.5 Description Logics

DL is a family of knowledge representation (KR) formalisms that represent the knowledge of an application domain [BAAD03]. It defines the concepts of the domain (i.e., its terminology) as sets of objects called classes, and it uses these concepts to specify properties of objects and individuals occurring in the domain. A DL is characterized by a set of constructors that allow one to build complex concepts and roles from atomic ones.

80042.jpg: A DL language 80048.jpg consists of a countable set of individuals Ind, a countable set of atomic concepts CS, a countable set of roles RS and the concepts built on CS and RS as follows:

C,D :=A|¬A|CD|CD|RC|RC|( nRC)|(nRC)

where A CS, R RS, C, and D are concepts and n is a natural number. Also, individuals are denoted by a, b, c, … (e.g., lower case letters of the alphabet).

This language includes only concepts in negation normal form. The complement of a concept ¬(C) is inductively defined, as usual, by using the law of double negation, de Morgan laws and the dualities for quantifiers. Moreover, the constants 80055.jpg and abbreviate A 80061.jpg¬A and A 80067.jpg¬A, respectively, for some A CS.

An interpretation I consists of a nonempty domain, ΔI, and a mapping, aI, that assigns

To each individual a Ind an element aI ΔI

To each atomic concept A CS a set AI ΔI

To each role R RS a relation RI ΔI × ΔI

The interpretation extends then on concepts as follows:


We can define the notion of a knowledge base and its models. An 80395.jpg knowledge base is the union of the following.

1.A finite terminological set (TBox) of inclusion axioms that have the form 80072.jpg C, where C is called inclusion concept.

2.A finite assertional set (ABox) of assertions of the form a:C (concept assertion) or (a, b):R (role assertion) where R is called assertional role, and C is called assertional concept.

We denote the set of individuals that appear in KB by Ind(KB). An interpretation I is a model of

An inclusion axiom 80077.jpgC(I 80164.jpg 80083.jpgC) if CI = ΔI

A concept assertion a: C(I 80092.jpg a:C) if aI CI

A role assertion a, b: (I 80164.jpg (a, b): R) if (aI, bI) RI

Let K be the 80099.jpg-knowledge base of a TBox, T, and an ABox 80107.jpg. An interpretation I is a model of K if I 80140.jpgφ, for every φ T 80560.jpg. A knowledge base K is consistent if it has a model. Moreover, for ϕ an inclusion axiom or an assertion, we say that K 80145.jpgϕ (in words, K entails ϕ) if for every model I of K, I 80150.jpgϕ also holds.

The consistency problem for 80112.jpg is ExpTime-complete. The entailment problem is reducible to the consistency problem as follows:

Let K be an 80119.jpg knowledge base and d be an individual not belonging to Ind(K). Then,

K 80124.jpgC iff K {d: ¬C} is inconsistent.

K 80156.jpga: C iff K {a: ¬C} is inconsistent.

This shows that an entailment can be decided in ExpTime. Moreover, the inconsistency problem is reducible to the entailment problem and so deciding an entailment is an ExpTime-complete problem too.

5.3.6 Inferencing

The basic inference problem for DL is checking a knowledge base consistency. A knowledge base K is consistent if it has a model. The additional inference problems are

Concept Satisfiability. A concept C is satisfiable relative to K if there is a model I of K such that CI .

Concept Subsumption. A concept C is subsumed by concept D relative to K if, for every model I of K, CI 80661.jpgDI.

Concept Instantiation. An individual a is an instance of concept C relative to K if, for every model I of K, aI CI.

All these reasoning problems can be reduced to KB consistency. For example, concept C is satisfiable with regard to the knowledge base K if K C(a) is consistent where a is an individual not occurring in K.

5.3.7 SWRL

The SWRL extends the set of OWL axioms to include horn-like rules, and it extends the Horn-like rules to be combined with an OWL knowledge base [HORR04].

Definition 5.3 (Horn Clause)

A Horn Clause C is an expression of the form.

D0D1 80676.jpg Dn, where each Di is an atom. The atom D0 is called the head and the set D1 80681.jpg Dn is called the body. Variables that occur in the body at most once and do not occur in the head are called unbound variables; all other variables are called bound variables.

The proposed rules are of the form of an implication between an antecedent (body) and a consequent (head). The intended meaning can be read as: whenever the conditions specified in the antecedent hold, the conditions specified in the consequent must also hold. Both the antecedent (body) and consequent (head) consist of zero or more atoms. An empty antecedent is treated as trivially true (i.e., satisfied by every interpretation), so the consequent must also be satisfied by every interpretation. An empty consequent is treated as trivially false (i.e., not satisfied by any interpretation), so the antecedent must not be satisfied by any interpretation.

Multiple atoms are treated as a conjunction, and both the head and body can contain conjunction Wof such atoms. Note that rules with conjunctive consequents could easily be transformed (via Lloyd-Topor transformations) into multiple rules each with an atomic consequent. Atoms in these rules can be of the form C(x), P(x, y), SameAs(x, y) or DifferentFrom(x, y) where C is an OWL description, P is an OWL property, and x, y are either variables, OWL individuals, or OWL data values.

5.4 Semantic Web and Security

We first provide an overview of security issues for the semantic web and then discuss some details on XML security, RDF security, and secure information integration which are components of the secure semantic web. As more progress is made on investigating these various issues, we hope that appropriate standards would be developed for securing the semantic web. Security cannot be considered in isolation. Security cuts across all layers.

For example, consider the lowest layer. One needs secure TCP/IP, secure sockets, and secure HTTP. There are now security protocols for these various lower layer protocols. One needs end-to-end security. That is, one cannot just have secure TCP/IP built on untrusted communication layers, we need network security. The next layer is XML and XML schemas. One needs secure XML. That is, access must be controlled to various portions of the document for reading, browsing, and modifications. There is research on securing XML and XML schemas. The next step is securing RDF. Now with RDF not only do we need secure XML, but we also need security for the interpretations and semantics. For example, under certain contexts, portions of the document may be unclassified, while under certain other contexts the document may be classified.

Once XML and RDF have been secured, the next step is to examine security for ontologies and interoperation. That is, ontologies may have security levels attached to them. Certain parts of the ontologies could be secret while certain other parts may be unclassified. The challenge is how does one use these ontologies for secure information integration? Researchers have done some work on the secure interoperability of databases. We need to revisit this research and then determine what else needs to be done so that the information on the web can be managed, integrated, and exchanged securely. Logic, proof, and trust are at the highest layers of the semantic web. That is, how can we trust the information that the web gives us? Next we will discuss the various security issues for XML, RDF, ontologies, and rules.

5.4.1 XML Security

Various research efforts have been reported on XML security (see e.g., [BERT02]. We briefly discuss some of the key points. The main challenge is whether to give access to all the XML documents or to parts of the documents. Bertino and Ferrari have developed authorization models for XML. They have focused on access control policies as well as on dissemination policies. They also considered push and pull architectures. They specified the policies in XML. The policy specification contains information about which users can access which portions of the documents. In [BERT02], algorithms for access control as well as computing views of the results are presented. In addition, architectures for securing XML documents are also discussed. In [BERT04] and [BHAT04], the authors go further and describe how XML documents may be published on the web. The idea is for owners to publish documents, subjects request access to the documents, and untrusted publishers give the subjects the views of the documents they are authorized to see. W3C is specifying standards for XML security. The XML security project is focusing on providing the implementation of security standards for XML. The focus is on XML-Signature Syntax and Processing, XML-Encryption Syntax and Processing, and XML Key Management. While the standards are focusing on what can be implemented in the near term, much research is needed on securing XML documents (see also [SHE09]).

5.4.2 RDF Security

RDF is the foundation of the semantic web. While XML is limited in providing machine understandable documents, RDF handles this limitation. As a result, RDF provides better support for interoperability as well as searching and cataloging. It also describes contents of documents as well as relationships between various entities in the document. While XML provides syntax and notations, RDF supplements this by providing semantic information in a standardized way [ANTO08].

The basic RDF model has three components: they are resources, properties, and statements. Resource is anything described by RDF expressions. It could be a web page or a collection of pages. Property is a specific attribute used to describe a resource. RDF statements are resources together with a named property plus the value of the property. Statement components are subject, predicate, and object. So, for example, if we have a sentence of the form “John is the creator of xxx,” then xxx is the subject or resource, property, or predicate is “creator” and object or literal is “John.” There are RDF diagrams very much like, say, the entity relationship diagrams or object diagrams to represent statements. It is important that the intended interpretation be used for RDF sentences. This is accomplished by RDF-S. Schema is sort of a dictionary and has interpretations of various terms used in sentences.

More advanced concepts in RDF include the container model and statements about statements. The container model has three types of container objects and they are bag, sequence, and alternative. A bag is an unordered list of resources or literals. It is used to mean that a property has multiple values but the order is not important. A sequence is a list of ordered resources. Here the order is important. Alternative is a list of resources that represent alternatives for the value of a property. Various tutorials in RDF describe the syntax of containers in more detail. RDF also provides support for making statements about other statements. For example, with this facility, one can make statements of the form “The statement A is false,” where A is the statement “John is the creator of X.” Again, one can use object-like diagrams to represent containers and statements about statements. RDF also has a formal model associated with it. This formal model has a formal grammar. The query language to access RDF document is SPARQL. For further information on RDF, we refer to the excellent discussion in the book by Antoniou and van Harmelen [ANTO08].

Now to make the semantic web secure, we need to ensure that RDF documents are secure. This would involve securing XML from a syntactic point of view. However, with RDF, we also need to ensure that security is preserved at the semantic level. The issues include the security implications of the concepts resource, properties, and statements. That is, how is access control ensured? How can statements and properties about statements be protected? How can one provide access control at a finer grain of granularity? What are the security properties of the container model? How can bags, lists, and alternatives be protected? Can we specify security policies in RDF? How can we resolve semantic inconsistencies for the policies? What are the security implications of statements about statements? How can we protect RDF-S? These are difficult questions and we need to start research to provide answers. XML security is just the beginning. Securing RDF is much more challenging (see also [CARM04]).

5.4.3 Security and Ontologies

Ontologies are essentially representations of various concepts in order to avoid ambiguity. Numerous ontologies have been developed. These ontologies have been used by agents to understand the web pages and conduct operations such as the integration of databases. Furthermore, ontologies can be represented in languages such as RDF or special languages such as OWL. Now, ontologies have to be secure. That is, access to the ontologies has to be controlled. This means that different users may have access to different parts of the ontology. On the other hand, ontologies may be used to specify security policies just as XML and RDF have been used to specify the policies. That is, we will describe how ontologies may be secured as well as how ontologies may be used to specify the various policies.

5.4.4 Secure Query and Rules Processing

The layer above the secure RDF layer is the secure query and rule processing layer. While RDF can be used to specify security policies (see e.g., [CARM04]), the web rules language developed by W3C is more powerful to specify complex policies. Furthermore, inference engines were developed to process and reason about the rules (e.g., the Pellet engine developed at the University of Maryland). One could integrate ideas from the database inference controller that we have developed (see [THUR93]) with web rules processing to develop an inference or privacy controller for the semantic web. The query processing module is responsible for accessing the heterogeneous data and information sources on the semantic web. Researchers are examining ways to integrate techniques from web query processing with semantic web technologies to locate, query, and integrate the heterogeneous data and information sources.

5.5 Cloud Computing Frameworks Based on Semantic Web Technologies

In this section, we introduce a cloud computing framework that we have utilized in the implementation of our systems for malware detection as well as social media applications, some of which are discussed in this book. In particular, we will discuss our framework for RDF integration and provenance data integration.

5.5.1 RDF Integration

We have developed an RDF-based policy engine for use in the cloud for various applications including social media and information sharing applications. The reasons for using RDF as our data model are as follows: (1) RDF allows us to achieve data interoperability between the seemingly disparate sources of information that are cataloged by each agency/organization separately. (2) The use of RDF allows participating agencies to create data-centric applications that make use of the integrated data that is now available to them. (3) Since RDF does not require the use of an explicit schema for data generation, it can be easily adapted to ever-changing user requirements. The policy engine’s flexibility is based on its accepting high-level policies and executing them as rules/constraints over a directed RDF graph representation of the provenance and its associated data. The strength of our policy engine is that it can handle any type of policy that could be represented using RDF technologies, horn logic rules (e.g., SWRL), and OWL constraints. The power of these semantic web technologies can be successfully harnessed in a cloud computing environment to provide the user with capability to efficiently store and retrieve data for data intensive applications. Storing RDF data in the cloud brings a number of new features such as: scalability and on-demand services, resources and services for users on demand, ability to pay for services and capacity as needed, location independence, guarantee quality of service for users in terms of hardware/CPU performance, bandwidth, and memory capacity. We have examined the following efforts in developing our framework for RDF integration.

In [SUN10], the authors adopted the idea of Hexastore and considered both RDF data model and HBase capability. They stored RDF triples into six HBase tables (S_PO, P_SO, O_SP, PS_O, SO_P and PO_S), which covered all combinations of RDF triple patterns. They indexed the triples with HBase-provided index structure on row key. They also proposed a MapReduce strategy for SPARQL basic graph pattern (BGP) processing, which is suitable for their storage schema. This strategy uses multiple MapReduce jobs to process a typical BGP. In each job, it uses a greedy method to select join key and eliminates multiple triple patterns. Their evaluation result indicated that their approach worked well against large RDF datasets. In [HUSA09], the authors described a framework that uses Hadoop to store and retrieve large numbers of RDF triples. They described a schema to store RDF data in the HDFS. They also presented algorithms to answer SPARQL queries. This made use of Hadoop’s MapReduce framework to actually answer the queries. In [HUAN11], the authors introduced a scalable RDF data management system. They introduced techniques for (1) leveraging state-of-the-art single node RDF-store technology and (2) partitioning the data across nodes in a manner that helps accelerate query processing through locality optimizations. In [PAPA12], the authors presented H2RDF, which is a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed data store. Their system features unique characteristics that enable efficient processing of both simple and multijoin SPARQL queries on virtually unlimited number of triples. These include join algorithms that execute joins according to query selectivity to reduce processing, and include adaptive choice among centralized and distributed (MapReduce-based) join execution for fast query responses. They claim that their system can efficiently answer both simple joins and complex multivariate queries, as well as scale up to 3 billion triples using a small cluster consisting of nine worker nodes. In [KHAD12b], the authors designed a Jena-HBase framework. Their HBase-backed triple store can be used with the Jena framework. Jena-HBase provides end users with a scalable storage and querying solution that supports all features from the RDF specification.

5.5.2 Provenance Integration

While our approach for assured information sharing in the cloud for social networking applications is general enough for any type of data including cyber security data, we have utilized provenance data as an example. We will discuss the various approaches that we have examined in our work on provenance data integration. More detailed of our work can be found in [THUR15].

In [IKED11], the authors considered a class of workflows which they call generalized map and reduce workflows. The input datasets are processed by an acyclic graph of map and reduce functions to produce output results. They also showed how data provenance (lineage) can be captured for map and reduce functions transparently. In [CHEB13], the authors explored and addressed the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. In [PARK11], they proposed reduce and map provenance (RAMP) as an extension to Hadoop that supports provenance capture and tracing for workflows of MapReduce jobs. The work discussed in [ABRA10] proposed a system to show how HBase Bigtable-like capabilities can be leveraged for distributed storage and querying of provenance data represented in RDF. In particular, their ProvBase system incorporates an HBase/Hadoop backend, a storage schema to hold provenance triples, and a querying algorithm to evaluate SPARQL queries in their system. In [AKOU13], the authors’ research introduced HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs. Their system is designed to minimize provenance capture overheads by (i) treating provenance tracking in Map and Reduce phases separately and (ii) deferring construction of the provenance graph to the query stage. The provenance graphs are later joined on matching intermediate keys of the RAMP files.

5.6 Summary and Directions

This chapter has introduced the notions of the cloud and semantic web technologies. We first discussed concepts in cloud computing including aspects of virtualization. We also discussed the various service models and deployment models for the cloud and provided a brief overview of cloud functions such as storage management and data management. In addition, some of the cloud products, especially that relate to big data technologies, were also discussed. Next, we discussed technologies for the semantic web including XML, RDF, Ontologies, and OWL. This was followed by a discussion of security issues for the semantic web. Finally, we discussed cloud computing frameworks based on semantic web technologies. More details on cloud computing and semantic web can be found in [THUR07] and [THUR14].

Our discussion of cloud computing and semantic web will be useful in understanding some of the experimental systems discussed in Section IV of this book. For example, we have discussed experimental very large data processing systems that function in a cloud. We have also discussed access control for big data systems represented using semantic web technologies. These topics will be discussed in Section IV of this book.


