Hongming Cai⁎; Athanasios V. Vasilakos† ⁎School of Software, Shanghai Jiao Tong University, Shanghai, China
†Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, Luleå, Sweden
With the wide spread of Web of Things (WoT) technology, massive data are generated by huge amounts of distributed sensors and different applications. WoT related applications have emerged as an important area for both engineers and researchers. As a consequence, how to acquire, integrate, store, process and use these data has become an urgent and important problem for enterprises to achieve their business goals. Based on data processing functional analysis, a framework is provided to identify the representation, management, and disposing areas of WoT data. Several associated functional modules are defined and described in terms of their key characteristics and capabilities. Then, current researches in WoT applications are organized and compared to show the state-of-the-art achievements in literature from the view of data processing process. Next, some WoT storage techniques are discussed to enable WoT applications to move into cloud platforms. Lastly, based on application requirement analysis, some future technical tendencies are also proposed.
Web of things; Data storage; Cloud computing; Semantic disposing; Linked data
With the wide spread of Web of Things (WoT) technology, massive data have been generated by huge amounts of distributed sensors and different applications. WoT applications have emerged as an important area for both engineers and researchers. As a consequence, how to acquire, integrate, store, dispose and use these data has become an urgent and important problem for enterprises to implement their business applications such as intelligent transportation, smart home, intelligent manufacturing and wisdom medical system.
The features of WoT data can be summarized as follows:
• Highly heterogeneous data: WoT data are acquired from different sorts of distributed sensors and applications. The data types vary from structured data such as table data, semi-structured like eXtensible Markup Language(XML) or Resource Description Framework(RDF), and unstructured data like images and videos.
• Massive dynamic data: WoT applications are always connected to a huge quantity of sensors or devices. Communications between different objects always generate a large volume of real-time, high-speed, uninterrupted data streams, which change rapidly.
• Weakly semantic data: WoT data are event-driven low-level data with little semantic meaning. We could find little business value unless these raw data are integrated and processed.
For the reason that WoT data are always distributed, unstructured, event-based and time-related, interoperability between massive data generated by heterogeneous WoT objects brings new challenges, especially in cloud environment. Different requirements are given for these massive data processing covering different levels of data representation, data storage, data analysis and data utility. Traditional data storage focuses on resource measurement, management and provision in web-based environment. Therefore, Service Level Agreement (SLA) factors such as performance, scalability, availability, management and price are mainly concerned by owners of information infrastructure. Aiming to trace the latest progress in WoT-based data storage systems, a comprehensive process of WoT data applications and various relevant topics are discussed thoroughly.
First, based on data processing functional analysis, a framework is provided to identify the representation, storage, management, and processing areas of WoT data. Several associated functional modules are defined and described in terms of their key characteristics and capabilities.
Then, current researches in WoT data storage are classified and compared. This paper proposes a timely research of the current WoT data storage methods especially in cloud platform, and gives a timely survey which describe the state-of-the-art techniques from the view of data disposing process.
Next, some WoT data storage techniques are given to enable WoT applications to move into cloud platforms. Some key techniques related to WoT data storage, for the purpose of archiving higher availability and flexible resource provision, are discussed to provide an overview and essential information for current Cloud-based WoT applications.
As WoT technologies are evolving, a substantial amount of related applications have been founded in many industries. Based on research analysis, some future technical tendencies are also described and discussed.
A common WoT framework consists of perception layer, network layer and application layer. Based on the process of WoT data disposing, a framework of cloud-based data storage system for WoT application is given. The framework consists of several modules covering data storage, data representation, data management, inner or external data processing, and also a optimization module based on cloud platform, as Fig. 12.1 shows.
Descriptions of modules are given as follows:
• Data Storage Module: Considering WoT data can be structured, semi-structured, and unstructured format, effective data storage should combine different kinds of data storage type into one body so as to build intelligent complex WoT applications.
• Data Representation Module: How to define and describe heterogeneous data from distributed and mobile devices is a fundamental problem for the data disposing process. Therefore, simple models such as event, message, rdf and other data format, and complex models such as contextual information and semantic relations are both required to represent WoT data.
• Data Management Module: For the reason that data from sensors are always raw or low-level data, different data management approaches are implemented based on data index, metadata, semantic relations and linked data, so as to retrieve and access data from distributed data sources with high efficiency.
• Inner Data Operation Module: For the purpose of disposing data in the distributed platform, massive data processing mechanisms are constructed for parallel and distributed data processing. And querying and reasoning operations can be carried out in a more flexible way inside the platforms.
• External Data Service Module: For the purpose of application, data should be composed to construct a functional service for business users, or interoperate with other applications or services. Then, high-level information needs to be extracted, classified, abstracted and encapsulated for end-user utility.
• Cloud-based Data Optimization Module: Cloud platform brings a high efficiency for current WoT applications. Optimization methods are required for processing WoT data to provide high performances, such as decreased I/O, scalability, availability etc. in cloud platform.
On the whole, the framework of WoT data storage is critical because it is composed of general middlewares and functional models to implement real large-scale WoT applications. Considering that cloud platforms bring a high efficiency and flexible way for end-user currently, much attention should be paid to enable effective and intelligent data processing based on cloud platform.
In the section, referring to the above framework, related research is given as data storage, data representation, data management, data operations for inner support and data services for external application utility.
After being attained from different data sources, WoT data can be persisted for further disposing. There are several data storage types. Relational database management system (RDBMS) is the basic and traditional storage type, which use Structured Query Language (SQL) as its basic query language. Based on RDBMS, lots of storage type are extended or developed, such as Not only SQL (NoSQL) database, database based on Hadoop Distributed File System (HDFS), In-Memory database, Bigtable database, and Graph database. Based on these different data type, the features are given and discussed as follows.
Relational database management system has been a popular data storage type for a long time, which was proposed in 1970 in [1]. This model protects users from the details about data organization in machines, and only provides a high level accessing-query language to operate data. However, as the development of Web 2.0 and cloud computing, RDBMS has its shortage. With static schema [2], no linear query execution time and unstable query plan, RDBMS is poor in scalability. For faster and more efficient operations for big data, the authors of [3] provided Cache Augmented Database Management System (CADBMS), improving speed of queries that read and write a certain part of data by caching. CADBMS is very useful for social network applications and others systems with high read-write ratio.
Traditional database queries follow a simple policy that defined constraints must satisfied by each tuple in the query result. This policy is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However, many practical real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually. In [4], a new query model named package queries is presented to extend traditional database queries to handle complex constraints. They design PaQL, a SQL-based query language that supports the declarative specification of package queries.
NoSQL database is also called non-relational database. Data in NoSQL database has no explicit types or patterns, they are in different buckets and related data are linked to each other. In fact, NoSQL database is a general designation, people usually divide them into three main categories: key-value stores, document-based and column-oriented. Data are stored as key-value pairs in key-value database like Amazon's SimpleDB, which supports both structured and unstructured data storage. Document-based databases such as MongoDB and Apache CouchDB store data as a collection of documents, usually JSON-based. Any fields of any length can be added, any type of data can be easily stored. As for column-oriented databases, fairly related data would be linked as an extensible column, which is different to the strictly structured table in RDBMS.
In [5], the authors said that NoSQL database systems nowadays need to make trade-off among consistency, availability and partition tolerance to optimize for their applications. While a hybrid database system can use various kinds of database softwares and take advantage of their features for individual applications and workloads, how to make the database software work together to achieve the highest performance is still a challenging problem. They also provide an extensible database interface for integrating NoSQL databases and adding database operations.
NoSQL databases are mostly non-relational, distributed, open-source and horizontally scalable. The main characteristics of these databases are schema-free, no join, non-relational, easy replication support, simple API and eventually consistent.
Hadoop is now one of the most popular MapReduce data storage solution. However, the programming model of Hadoop is very low level, which makes developers unable to reuse and hard to maintain these programs. Then HDFS [6] comes up. This distributed file system can run on daily-used devices with low cost and high tolerant. The high throughput makes applications with big data set more available and efficient.
Hive [7] is another open-source big data storage solution on the basis of Hadoop. What makes Hive different is that it provides HiveQL, a SQL-like declarative language. Hive compiles HiveSQL into MapReduce jobs executed using Hadoop. The language includes type system and allows user-defined script and custom type. Hive also provides schemas and statistics functions, which make it more useful in query optimization, query compilation and data exploration.
Still, the result of [8] experiments in throughput over the numbers of files, and shows that Hadoop performs poorer when the number grows larger. The bottlenecks are the size of files used, the number of data nodes available and the number of reducers used.
In-memory database management systems (IMDBMS) are designed for analysis usage like On-Line Transaction Processing (OLTP) and On-Line Analysis Processing (OLAP). MonetDB and Vectorwise are traditional OLAP engines. Nowadays more modern engines occur, including Microsoft Hekaton, HStore/VoltDB, Shore-MT etc. Recently, following the trend of executing OLTP and OLAP within same system on same database state, SAP's Hana and HyPer are developed [9].
In-memory database offers great performance for data with high update rates, thus it can be used in many daily services. One usage scenario of IMDBMS is Location-Based Service (LBS) which plays an important role in different area of WoT applications. In [10], the authors combined a series of techniques to implement in-memory storage for LBS with high inserting efficiency.
Another scenario is managing vector spatial data, which is similar to the former one. The authors of [11] concentrated on reducing the cost of I/O and improving algorithm efficiency by designing and realizing a spatial data access system on the basis of in-memory database.
In-memory database can process large in-memory datasets entirely due to the growth of main memory space currently. However, the speed of main memory operations can not be as fast as CPU's for now. Therefore, the bottleneck of main-memory techniques lies in the process of moving data from memory to CPU caches. The mainstream research aspect turns to near memory computation capabilities by making good use of hardware advantages. A near data processing accelerator named JAFAR [12], is presented for pushing “select” queries down to memory instead of pulling data into caches. By this mean, select operations in column-based data system can enjoy an improvement up to nine times as before.
BigTable is a distributed storage system which is designed by Google, just as its name, it is proposed to deal with data in large scale. Different from another popular system HDFS, BigTable only supports structured data. Thanks to the distributed features of BigTable, developers and researchers can easily get a cloud storage solution for large-scale data task with no need to build clusters by themselves. However, the public cloud service also becomes a concern for users. How to ensure the integrity of data in cloud becomes a big issue in BigTable.
BigTable serves quantities of projects at Google [13]. Data sizes of these projects can be several petabytes in different data centre and different server. BigTable has been always satisfying the demands on large data scale and low latency.
Aiming to enhance BigTable by providing an integrity solution, the authors of [14] present iBigTable. iBigtable consists of a series of security protocols based on designed data structure and BigTable. These protocols efficiently assure that the data returned by BigTable are integral. Moreover, iBigtable inherits great features of BigTable and has a great compatibility, which allows existing BigTable applications transferring to iBigTable with little change of code.
BigTable provides a flexible, high-performance solution for various products. It is implemented by three significant components: many tablet servers, a master server and a client-based library. Tablet servers manage a set of tablet, including dealing with reading and writing operations on loaded tablets and splitting super large tablets into small ones. These servers are added or removed dynamically from a cluster to accommodate changes in workloads. The master server assigns each tablet to a tablet server in cluster, detects the change in tablet servers, balances load of tablet-servers, collects garbage of files in Google file system and handles schema changes such as creating a table or a column family.
Graph database utilizes features of graph to provide a scalable data storage. The queries are based on nodes, properties and edges that represent or store data. Recently, more focus is put in graph for the usability in complicated structure modelling.
In [15], some experiments on graph database Neo4j and relational database MySQL are carried out. The result shows that graph database have great advantage over relational database on structured type queries and full-text searches.
Graph databases have no schema, which is very suitable for XML document storage and biological or chemical data storage. Compare to storage in graph, retrieving data efficiently from large graph database via indices is more difficult and desirable.
Aiming to realize graph mining, a novel solution of indexing graph called gIndex [16] is proposed. Distinguished from existing methods based on path, gIndex utilizes frequent substructure as the basic indices. Frequent substructures have high stability during updates and show the intrinsic features of data, which make it ideal for graph indices. However, the size of indices will grow large in a large data warehouse, so two techniques are proposed to reduce the size, size-increasing support constraint and discriminative fragments. Besides the elegant solving in graph indexing, gIndex also illustrates that data mining can do great help to indexing and query processing, especially frequent pattern mining.
To query graph database is a big issue. Query navigation is the most important part and is heavily used in graph databases. For now, using reachability patterns with regular constraints is widely adopted for query. XPath-like languages is an example [17]. XPath is widely used in XML navigation for its ability to express queries of interest, easy query evaluation for fragments and close connection to yardstick database query languages.
Inspired to use graph to represent genome data, the authors of [18] carry out an investigation in graph-based database and are inspired to use graph to represent genome data. Researchers may build a database based on the adapted graph model storing genome data, which makes genome data storage and retrieval efficient and stable.
Based on the above references analysis, a comparison is given as Table 12.1.
Table 12.1
Comparison Between Data Storage Types
Product | RDBMS | NoSQL database | HDFS-based database | In-memory database | BigTable | Graph database | ||
MySQL | MangoDB | FB Cassandra | HBase | Amazon SimpleDB | SAP's Hana | Google BigTable | Neo4j | |
Data Model | Relational database | Document Oriented | Column database | Column database | Document Oriented | Multi-column database | Column database | Graph database |
Interface | TCP/IP | TCP/IP | TCP/IP | HTTP/REST | TCP/IP | TCP/IP | TCP/IP | HTTP/REST |
Data Storage | Disk | Disk | Disk | HDFS | S3 (Simple Storage Solution) | Memory and disk | GFS | Disk |
Query Method | SQL | Map/Reduce | Map/Reduce | Map/Reduce | String-based query language | SQL and MDX | Map/Reduce | Cyphe query language |
Replication | Asynchronous | Asynchronous | Asynchronous | Asynchronous | Asynchronous | Synchronous | Asynchronous/Synchronous | Asynchronous |
Concurrency Control | Locks | Locks | Multi Version Concurrency Control | Locks | None | – | Locks | Locks |
Transactions | Local | No | Local | Local | No | No | Local | Local |
Written In | C, C++ | C++ | Java | Java | Erlang | – | C, C++ | Java |
Characteristics | Static Schema Consistency High Availability Partition Tolerance Persistence | Consistency Partition Tolerance Persistence | High Availability Partition Tolerance Persistence | Consistency Partition Tolerance Persistence | High Availability Scalability | High Availability | Consistency High Availability Partition Tolerance Persistence | High Availability Scalability |
From the table, we could find that WoT data storage is similar to other application data. As a traditional data storage type, RDBMS have many complex restrictions which ensure the reliability and consistency, but also make it lack scalability. However, SQL query language still makes great effect in both SQL and NoSQL database. NoSQL database has no tabular relations like traditional RDBMS does, but it owns a unique mechanism for data storage and retrieval. For the NoSQL database, HDFS-based database and BigTable perform great in distributed storage system. In big data era, they will be significant components. In-memory DB improves the performance on frequently updated data, and will show its power in geographical information systems or location-based applications. Graph database is useful for graph storage and retrieval, which makes a significant effort to social networking and semantic web applications [19].
In general, aiming to adapt to the high heterogeneity of WoT data from distributed data sources, it is a popular way to combine different data types such as RDBMS integrated with HDFS, so as to construct a scalable data storage for WoT applications.
Data Representation models are always fundamental for WoT applications. Based on the data disposing level, we divided these data model into three types: simple data model, integrated data model and semantic data model. Simple models are connected with sensor devices, such as messages, events, pictures, videos, and other data. Integrated model is composed of several simple models to construct an integrated view. Semantic model is a combination of simple models, model relationships with related contextual data.
The authors of [20] demonstrated the most significant factors in WoT: physical entity, resource and service, which can be concluded as physical entities and relationships between them. To describe these key concepts more accurately, the authors built an interlinked metadata model using micro-format such as RDF and micro-data to break the limitation of HTML format and enhance the surface representation metadata.
The authors of [21] explored a well-defined model with good extensibility for WoT information representation and organization. Based on the proposed three mainstream data type as object-cored organizing data, event-based explaining data and knowledge-based using data, the authors presented a model framework. It using two data types as different layers: the object layer and the event layer, to make an improvement over using single type only. The object layer using object-based organizing data represents all objects and relations between them. The event layer contains event-based explaining data which is extracted from raw detailed data processed by the object layer. The event layer regulates events and relations between them based on event semantic link network model with a given reasoning rule set.
The authors of [22] focused on the extraction of event information from heterogeneous and massive raw data. They proposed an approach that can effectively extract events and internal links between them from large dataset based on existing event types in a particular domain. The conceptions of event, event type, link type, and event schema are introduced and a three-layered model which consists of the data-collecting-layer, the event-extracting-layer and the presenting-layer is raised to compress the redundant data.
Aiming to describe dynamic entities in WoT applications, the authors of [23] proposed a specification model to specify the entity services. The model extends OWL-S with service status ontology to illustrate information involved in the services. The extension issues entity status in real-time and releases the information as dynamic services to requesters. With this method, the model constructs and executes transactions intelligently.
The authors of [24] proposed an approach of creating ontological models to describe connected objects to implement support of WoT and finally to achieve unified communication between objects. Moreover, a framework is presented to allow seamless integration of semantic models and objects of web applications.
Thing Broker [25] integrates WoT objects with different characteristics, based on different protocols, providing different interfaces and constraints, meanwhile kept simple and flexible to meet the requirement from different applications. Thing Broker provides a uniform Twitter-like RESTful interface to different WoT objects. By giving one abstraction containing configurable attributes to represent each WoT object, Thing Broker manages to involve all kinds of objects in WoT, from physical entities to high-level services.
The authors of [26] delivered a formal model that provides a formal ontology representation of relations between geographic events and observations. The model exploits SEGO, a mechanism based on rules, to reason information about events via in-situ observations, and it illustrates the scenario that ontological vocabularies can be well utilized by a reasoning and querying approach to retrieve events data and sensing information.
The authors of [27] proposed an ontology-based WoT data model called the continuum model to reflect entities evolving as space and time change. This model is important in studying the history and predicting future trends and it can track spatial entities evolving through the time and space, which play an important role in capturing semantics of modelled phenomena. This model well combines the spatial functions and temporal capabilities.
The authors of [28] proposed a general methodology that develops consumable semantic data models for smart cities. It transfers large city data of different sources into a uniformed and integrated semantic data model (RDF/OWL) by using different engineering approaches, and it enables semantic interoperability at the concept level and support application developers to design advanced city services and applications.
Based on the above references from the complex level, a comparison is given as Table 12.2.
Table 12.2
Comparison Between Representative Data Models
Article | Aim | Basic Data Structure | Main Methods | Representation Type | |
[21] | An extensible and active semantic model of information organizing for the Internet of Things | intelligent reasoning | Object and event | Object-cored organizing data, event-based explaining data, and knowledge-based using data | Simple data model |
[22] | Constructing the web of events from raw data in the web of things | to integrate heterogeneous and massive raw data | event | extracting events and their internal links from large scale data | Simple data model |
[23] | An OWL-S based specification model of dynamic entity services for Internet of Things | to construct and execute the transactions intelligently | OWL | extending OWL-S with Service Status ontology to describe information involved in the services | Simple data model |
[24] | Semantic surface representation of physical entity in the WEB of things | to enhance the metadata elements of surface representations | RDF | describing physical entities, resources and services by means of an interlinked metadata model | integrated data model |
[25] | Thing Broker: A Twitter for Things | to integrate WoT objects with different characteristics for further disposing | protocols, interfaces | providing a uniform Twitter-like RESTful interface to different IoT objects | integrated data model |
[26] | A formal model to infer geographic events from sensor observations | to infer information about geographic events from these observations | Ontology-based | exploiting a rule-based mechanism called SEGO to infer information about events | integrated data model |
[27] | Continuum: A spatio temporal data model to represent and qualify filiation relationships | entities evolving in space and time | ontology-based spatiotemporal data model | tracking the evolution of spatial entities or objects through the time, and combining the spatial functions provided by GeoSPARQL | Semantic data model |
[28] | A Smart City Data Model based on Semantics Best Practice and Principles | to enable semantic interoperability at the concept level | RDF, OWL | transferring large city data sources of different nature into RDF/OWL | Semantic data model |
In short, data representation is used for further WoT data disposing. Simple models such as event, RDF combined with REST API provide a common format for WoT applications. Aiming to support intelligent interaction for WoT in a contextual level, the data representation should focus on integrated model by the integration of multiple simple model such as sensors, event, RDF and other format. Considering not only the data content, but also data relationships, the semantic model based on ontology and linked open data is a promising new model for web-based applications especially with social network.
WoT enables billions of smart things to be accessible based on the RESTful architecture and protocols like HTTP and Constrained Application Protocol (CoAP). Meanwhile, seamless integration and wide scale interoperability are the critical challenges of data management of WoT. The WoT data management can be divided into two kinds: Metadata-based data index methods and semantic-based model annotation methods.
Metadata is a kind of special data defined for data management. It can make data easily organized and understood by users without being involved with everything concerning the accessing solution.
In [29], an efficient distributed metadata management scheme was proposed for data management in cloud platform. It uses the technique of metadata distribution based on parent directory path, hierarchical directory structure, cooperative double layer cache mechanism to access the distributed data with reduced latency.
Mobile Metadata [30] were proposed to build a mobile code agent supporting image retrieval based on client in web-based computing environments. Both the data model and mapping functions as a mashup and moves them to the client side. The model provides clear object model and view construction based on client, quick query response time and better exploit of network resources, and it is flexible in expansion.
By means of embedding metadata to represent smart things, a system was developed to control and monitor the state of WoT application environment [31]. The system produces a machine-readable state description of application environment. A Web request will be generated when a smart device reach this state, then an application will implement related operations to reconfigure the user's smart environment automatically. Therefore, intelligent application is implemented by means of metadata even huge quantity requests are involved.
The authors of [32] presented a framework for semantic and location-based services exploiting enriched maps. In particular, the framework contains mainly an approach for semantically annotating crowd-sourced cartographic data, and an innovative ontology-based function for semantic-based searching and disposing capabilities in current navigation systems.
To extract and link related concepts from raw sensor data and represent them as a topical ontology, a clustering approach extended k-means was used on the basis of rules extracted from external sources. The authors of [33] introduced a knowledge acquisition technique for real-world data processing aiming at topical ontologies creation and evolution. Then, these concepts are marked to make them understandable for user for the purpose of data analysis and data reasoning, and a related system is proposed for software support.
On handing unstructured models, the authors of [34] provided a semi-automatic semantic annotation in visualization scene based on three-layer ontology. The three-layer ontology including general ontology, domain ontology and scene ontology is constructed to form a comprehensive knowledge representation for semantic annotation. It is effective for large scale model management.
In short, metadata and simple index are good for structured data management. However, since WoT data are always unstructured or semi-structured in a dynamic and contextual environment, semantic-based techniques are widely studied, designed and applied to overcome these challenges in the past few years. With the rapid increase in the amount of data and their correlation, automatic metadata generation, ontology generation and evolution, and efficient, low cost and dynamic-updating semantic notation for WoT data indexing and model annotation have attracted great attention.
In the WoT applications, data produced and consumed are mostly composed of the sensory data and generated data in stream. Based on data disposing process, related researches can be divided into data collection, data pre-disposing, information fusion and distributed data disposing.
A data collection protocol named EDAL [35] is modelled similarly to the open vehicle routing problem which is proved to be NP-hard. The EDAL is efficient in energy using, aware of delay and balancing in lifetime and used to collect data in wireless sensor network (WSN) domain.
The authors of [36] constructed a universal mobile data collection framework for WoT services. They addressed four basic requirements that are task specification, task managing, status sensing and data managing. Architecture for a general-purpose mobile data collection is also proposed which separates the whole system into two parts: one is the back-end operating on server sides and the other is font-end on mobile devices.
Aiming to carry out data stream analytic, OpenWoT approach [37] was proposed. It designs an event and clustering analytic server to collect sensor data from mobile devices and serves as an interface for data stream analysis. In detail, it uses intelligent servers and edge servers for real-time data collection, annotation and processing and adds some extensions for WoT data streams.
A sensor data stream delivery system was proposed with different delivery cycles for WoT environments [38]. When connected to servers and delivering sensor data stream, the system could provide a dynamic computational and communication performance according to the different requirement of clients.
The authors of [39] presented a hybrid system based on RFID and WSN called HRW, which integrates the traditional RFID systems and WSN systems for efficient data collection. HRW has a set of smart nodes which have both RFID and WSN functions and take place of RFID readers to gather data. Moreover, an enhanced data transmission algorithm which avoids the data redundancy and unnecessary overhead in transmission and security mechanisms which avoid data manipulation and data selective forwarding are proposed.
Considering that redundant information will be generated among nodes close to each other, a spatial correlation model [40] was proposed to minimize total consumption in process of data preservation. The problem of data preservation with data correlation can be transformed to the minimum cost flow problem, thus a more efficient and optimal solution than the greedy algorithm is proposed.
To solve the low efficiency and data redundancy of communications in integration, a data-cleaning algorithm called cross-redundant algorithm was proposed to archive higher performance. The authors of [41] proposed a five-layer system architecture for the integration of WSN and RFID, and chose Bluetooth and ZigBee as the communication protocols.
The authors of [42] proposed a classification method for the data streams based on some supervised classification techniques like SVM (Support Vector Machine) and reduces the volume of data by simple aggregation and approximation density. The former classification and labelling steps are the fundamentals of knowledge discovery in data stream.
A framework processing raw RFID data was proposed to reduce the uncertainty of data [43]. This framework is composed of two parts: a model tracking global objects and a model cleaning local RFID data. The former one is implemented with a Markov-based model and the latter one is implemented with a particle filter based approach.
For the purpose of communication and data collection among devices in heterogeneous network interfaces, a middleware that consists of a Multiple Protocol Transport Network (MPTN) gateway and a coordinated model was proposed [44]. Messaging and data alignment among multiple networks are implemented for concurrent data stream collections.
The authors of [45] proposed a technique preserving privacy while collecting data that can be used in healthcare applications with sensor and RFID. It assures secrecy of data via a data privacy protection mechanism and has been tested to pass various attacks. Moreover, it can be adapted for different scales of network.
In [46], the authors applied OLAP techniques on sensor data to integrate data from different sources and to gather the correlative information for analysis and decision-making. An on-the-fly generating solution is proposed with metadata using the W3C semantic sensor network ontology and W3C RDF data cube vocabulary for generating multidimensional data cubes.
An automatic segmentation methodology [47] was proposed for real-time high-level activity prediction. The end of the predicated activity can be automatically marked and the training dataset can be divided into segments according the previous tagging.
In [48], the authors presented an online sensor data segmentation methodology for real-time activity recognition. A two-layer strategy composed of sensor correlation and time correlation manipulation is introduced to facilitate dynamic segmentation.
A data stream clustering algorithm [49] was proposed, which makes use of sliding window and micro-clusters merging in order to batch the similar quality farm products in the agricultural WoT platform.
The authors of [50] proposed an approach to implement some operations such as inquiries for heterogeneous sensor data that is in the format of RDF. The approach can process multiple data resources at the same time on the basis of ontology and can also integrate heterogeneous sensor data. It can construct SPARQL query statements automatically and query sensor data semantically according to the requirements of users.
In WoT data environment, data are changing on types, states and analysis purposes. Other than centralized master-server implementations, a parallel and particle data processing framework is need to enable the execution of MapReduce pattern in dynamic information infrastructures.
MapReduce is not perfect for every large-scale analytical task, and the cost of high communication and redundant processing makes a big challenge on WoT application. An approach which uses the MapReduce framework for large-scale graph data processing was given in [51]. The approach relies on a density-based partitioning to build balanced partitions of a graph database over a set of machines. The experiments show that the performance and scalability are satisfying for large scale data processing. However, in [52], a technical framework for improving MapReduce was given.
In [53], a parallel distributed processing system was proposed for data analysis. The system manages dependent relations between data and data, as well as data and analytic programs. The system aims to illustrate dependency and uses Hadoop Streaming for distributed parallel processing requirement. There are certain repeated executions in a program and may be executed with distinguished data each time. The specification filters these executions and check dependencies separately at each execution.
In [54], a storage system with high security and scalability on the basis of revised secret sharing scheme was proposed. The system is composed of two scalable, flexible and reliable layer: data layer and system layer. Using secret sharing scheme can avoid the complicated key management when using traditional cryptographic algorithms. Moreover, multiple storage servers for WoT data work together to achieve large storage capacity. However individual servers can still join or leave flexibly at system layer.
In [55], the authors proposed vRead, a programmable framework connecting HDFS I/O flows to the application data directly. vRead supports VMs to ‘read’ data node in disk images, which improves I/O flow without the overhead of virtualization.
Data operations plat a fundamental role for inner platform support. Based on disposing steps, a comparison is given as Table 12.3.
Table 12.3
Comparison Between Data Disposing Methods
Researches | Data Resources | Semi-Structured Data | Data Stream | Generic | Main Methods | Topics |
EDAL [35] | Data in WSN (Wireless Sensor Network) Domain | ✓ | ✓ |
1. Modelled like OVR problems. 2. A centralized meta-heuristic. 3. A distributed heuristic. |
Data Collection | |
Mobile Data Collection Framework [36] | Mobile Sensor Data | ✓ | ✓ | #4 Basic Rqrmnts + Additional Issues | Data Collection | |
A Sensor Data Stream Delivery System [38] | Sensory Data Stream | ✓ | ✓ | A sensor delivery system for flexible delivery cycles for different clients. | Data Collection (Data Stream Delivery) | |
Data Collection for Large-Scale Mobile Monitoring Applications [39] | Sensory Data with tag | ✓ | Methods to improve data transmission efficiency and to protect data privacy and avoid malicious data selective forwarding in data transmission. | Data Collection | ||
Data Alignment for Multiple Temporal Data Streams [44] | Data Stream in Heterogeneous Network | ✓ | ✓ |
1. A MPTN gateway for messaging and data alignment among multiple networks. 2. A coordinated model to collect concurrent data streams and convert time. |
Data Collection (Data Alignment) | |
Constructing the web of events from raw data in the web of things [22] | Heterogeneous and Massive Raw Data | ✓ | ✓ |
1. Conceptions (Event, Event Type, Link Type). 2. A Three-layered Model. |
Data Pre-disposing (Information Extraction) | |
Data Preservation in Data-Intensive Sensor Networks with Spatial Correlation [40] | Sensory Data | ✓ | ✓ | Considering spatial correlation to reduce the redundant information. | Data Pre-disposing | |
Data Cleaning for RFID and WSN Integration [41] | RFID Data | ✓ | ✓ |
1. A five-layer system architecture developed to integrate WSNs and RFID 2. Bluetooth and ZigBee are selected as communication protocols. 3. ICRDC is used for redundant data elimination. |
Data Pre-disposing (Data Cleaning) | |
A Novel Learning Method to Classify Data Streams in the Internet of Things [42] | High Volume of Multi-dimensional Unlabeled Data Stream | ✓ | ✓ |
1. Data Stream Classification methods based on SVM. 2. Dimension Reduction methods based on SAX Density. |
Information fusion (Data Classification and Labelling) | |
Automatic Sensor Data Stream Segmentation for Real-time Activity Prediction in Smart Spaces [47] | Sensor Data Stream and Time Window | ✓ | ✓ | Automatic segmentation methods based on the peak value of JWD. | Information fusion (Activity Prediction) | |
A new Clustering Algorithm for sensor Data Streams in an Agricultural IoT [49] | Various Types of Sensor Data Stream in Agricultural IoT | ✓ | ✓ | A new data stream clustering algorithm based on sliding window and micro-clusters merging. | Information fusion (Data Stream Clustering ) | |
Dynamic sensor event segmentation for real-time activity recognition in a smart home context [48] | Sensory Data | ✓ | ✓ | An online sensor data segmentation with two-layer strategy: sensor correlation and time correlation. | Information fusion (Data Segmentation) | |
Parallel, distributed, and differential processing system [53] | Sensory Data | ✓ | ✓ | managing dependent relations between data and data, data and analytic programs | Distributed data disposing |
(1) Data collection is always the first step in WoT applications. Considering WoT Data is often large-scale, dynamic, and at high sampling rates, some researches focus on how to organize the data collection tasks. And also a big part of researches focus on data transmission [35,38,49]. The authors of [36] built a general mobile data collection framework and talked about four basic requirements and some open issues in current mobile data collection framework. The authors of [37] made use of the OpenWoT middleware and designed an intelligent server for real time acquisition. Moreover, some researches paid attention to some open issues in data collection, such as privacy [45].
(2) Data pre-disposing is carried out for further data operation. With different disposing purposes, data pre-disposing methods can be divided into data preservation [40], data cleaning [41,43], data alignment [44] and so on. The authors of [42] proposed a classification method for the data streams based on some supervised classification approaches like Support Vector Machine (SVM) and reduced the volume of data.
(3) After collecting data from sensors, how to dispose the semi-structured, streaming data to extract information is another problem. The authors of [22] proposed an approach to extract event information from heterogeneous and massive raw data. The authors of [47,48] proposed some methods of data segmentation in order to recognize and predict human activities. And the authors of [49] designed a new data stream clustering algorithm in the agricultural WoT platform. Some of these researches take use of sliding windows to implement real-time processing.
(4) Despite its evident merits such as scalability, fault-tolerance, and flexibility, MapReduce [51] has limitation in interactive or real-time processing on handling distributed WoT data disposing. It is not perfect for every large-scale analytical task [54], and the high communication cost and redundant processing make a big challenge for IoT application. Therefore, some optimization work on WoT data are still needed for a large-scale processing purpose.
To sum up, the researches for WoT data operation are mainly concentrated on the following respects.
Firstly, researchers will pay more attention to disposing of different characteristics of WoT data, such as the elimination of redundant data, the alignment and merging of heterogeneous data, and online analysis for dynamic data. Secondly, researchers will take the integrations with existing technologies into consideration. On grounds that data in WoT applications are always at large-scale and with high sampling rate, the data disposing in WoT will be combined with the technologies of distributed computing, and streaming computing, such as Hadoop and Storm [56]. Thirdly, researchers will focus on more usage scenarios, such as activity recognition, complex cooperation, etc.
Data service are used to provide functional support for WoT applications. The purposes of data service construction can be divided into three functional aspects as data interoperability, data-centric service composition and data analysis.
Although data interoperability development keeps innovative, some challenges exist. The researchers proposed a hub-centric framework [57] on interoperability and validated this framework in a large-scale WoT environment.
A novel Semantic WoT framework was proposed based on the Constrained Application Protocol (CoAP) [58]. The framework supports annotated resources retrieval and logical ranking on the basis of semantic matchmaking services with non-standard interface. To detect high-level events and specify them using machine-readable metadata, the framework also includes some approaches on data mining to deal with raw data gathered.
The authors of [59] proposed SAMPLES to classify network traffic generated by mobile applications. SAMPLES is composed of an offline part and an online part. The offline part is a training system in charge of rule generation and the online part is an engine for application identification and traffic classification. For each input flow, a subset of conjunctive rules is applied to the flow on the basis of pre-filtering conditions. Conjunctive rules are decided by context of lexica and a unique identifier of application in HTTP header.
Mashup tools [60] are used for the development of WoT applications, which can connect the dataflow between applications and devices in a graphical way. RESTful interfaces are generated from the WoT data models, which represent a set of sensors and actuators. Generic components extend existing mashup concepts and employ concepts to polymorphic functions as in many programming languages.
By means of the composition of web services with data streams from WoT devices [61], WoT devices are connected with web service in an efficient and extensible way. Thus real-time communication, device integration and data stream mashups are elaborated.
DiscoWoT [62] provided a semantic discovery service that supports semantic discovery of the functionality of smart things by human and machines. On the basis of multiple discovery strategies, the service provided by DiscoWoT supports all strategies using RESTful interface created or updated by users at runtime.
To make WoT smarter [63], data mining was introduced into applications. A system architecture for WoT and big data mining system was proposed, in which lots of WoT devices are integrated into this system to perceive the world and generate data continuously. The system focuses on the integration with devices and data mining technologies, where data mining functions will be provided as service.
Condor [64] was proposed to handle data-parallel style execution of analysis algorithms in WoT system. The analytic processes are naturally data-parallel but the executions are not. Therefore, how to execute these processes in fixed time simultaneously becomes an important challenge. The architecture of the framework allows to synchronously execute any algorithms considering them as black boxes.
In [65], an overview of related issues and challenges in aspect of big data provenance research was presented, like Accessing Big Data, Minimum Computational Overhead Requirement and so on.
To sum up, RESTful service is the main form for external applications support. Application integration across heterogeneous and distributed environments is implemented by means of RESTful service. Thus, a flexible application construction and execution environment is provided for application interaction. However, how to combine REST APIs with inner distributed disposing environment for massive data analysis is not an easy task.
In WoT applications, massive data from sensors consume large storage space. Meanwhile, since different roles and tenants require different service and security levels, data should be isolated for various requirements of performance and safety. How to share and isolate these data in cloud platform are the main challenges in WoT data storage.
The development in cloud computing and WoT provides a hopeful way for the increasing WoT applications. CloudWoT [66] was proposed to integrates Cloud computing and WoT to bridge the gap between Cloud and WoT, which brings new opportunities in both technology and business areas.
The conception of Database-as-a-Service (DBaaS) [67] was constructed to move operational burden from database users to service operators, which means how to configure, adjust performance, backup and so on are not responsible for database users but the service users. Early DBaaS such as Microsoft SQL Azure and Amazon RDS always try to provide such services but do not pay much attention to multi-tenancy, flexible scalability challenge and database privacy.
A new vehicular data cloud with multiple layers [68] was presented under the support of cloud computing and WoT techniques. Two fresh and original cloud services: smart parking service and vehicular data mining service, were also presented to analyze vehicle warranty in the WoT environment. Two models integrating all available sensors or devices in vehicles and road based on Naïve Bayes and logistic regression models were proposed.
Links as a Service (LaaS) [69] was proposed to act as an innovative abstraction in cloud aspect. It provides isolation of network links to decrease interference in the cloud network. A unique link set is assigned to tenants and a virtual fat-tree is formed by these links. With these links, tenants feel just like it is the only one application in the shared cloud by getting the same bandwidth and delay. Finally, the forwarding mechanism will perfectly fit each tenant.
A transactional DBaaS named Relational Cloud [70] was introduced to solve the challenge of DBaaS. Relational Cloud has three significant technical characteristic: workload awareness, graph-based data partition and adjustable security. Firstly, the Cloud has multiple tenant and the system implements an approach identifying co-located workloads on server to gain good performance and high consolidation. Secondly, by exploiting a data partitioning algorithm based on graph, the system achieves near-linear flexible scale-out no matter the transactions are simple or complicated. Thirdly, the system provides an adaptable security scheme that allows certain queries to access encrypted data under secure situations. The concept of workload awareness is an underlying key of the system design. By supervising data accesses and query patterns, the system gathers useful information for sorts of optimization and security functions, which eliminates the effort of configuration for users and operators.
As adoption of cloud-based WoT is hindered by severe privacy concerns, to make users widely apply it in different areas, the authors of [71] presented a comprehensive privacy solution. In this approach, the potentially sensitive data are protected before uploaded to the cloud, the privacy functionality are packed as a service, whether the information is private is decided by users instead of developers, and users can configure privacy easily with transparent interfaces.
Aiming to assure the isolation within tenants, the authors proposed an approach based on a fitness function [72] and made some optimization to gain accurate weights to reflect different requirements.
An abstraction data model for performance isolation named SQLVM [73], was presented by the researchers at Microsoft. It is implemented in the condition of reserving key resources for tenants on database server, including CPU, I/O and memory. The main issue is that in a relational database system resource allocation is static, but the abstraction needs to allocate resources dynamically to tenants. Meanwhile, the overheads need to be low and the scale may grow very large. So the overhead and scalability are also great challenge. SQLVM can effectively isolate the performance of a tenant from the others while these tenants are co-located in same database server. And multiple scripted scenarios and a framework of data collection and visualization are applied to demonstrate the abstraction of SQLVM on performance isolation.
The authors of [74] focused on performance isolation when executing multi-tenant SaaS applications. They proposed a middleware architecture that uses a scheduler and a profiler known by tenant on the basis of tenant-specific SLAs to enhance performance isolation. The prototype they implemented reveals satisfying primary results.
In [75], the authors presented a resource allocation method in multi-tenant cloud environments by understanding the subtle interference between network, compute, and storage resources. The experiments provide insight that help cloud administrators know how to best distribute virtual cores to physical cores considering the effect of advanced virtual network technologies on remote block I/O performance.
To compare different performance isolation strategies, a standard metric is needed to quantitatively measure the capability of performance isolation in cloud. The metric should treat the cloud environment as a black box by running external benchmarks. In [76], the authors proposed three different metrics and applied them to a stimulant case mocking various tenants sharing one SaaS application instance.
The authors of [77] presented a tenant-isolated and fair system where every tenant in a data centre is isolated and shares key-value storage averagely. The previous resource allocation strategies rely on per-VM allocations and fix rate limits to make whole workloads achieve a high level. Pisces proposed by the authors corresponds to the weight of each tenant to assign the shared resources and services. The approach also works on the co-located situation and works when the request for many partitions is skewed time-varying or bottlenecked by different server resources.
In [78], the authors implemented an adaptive middleware that enables SaaS providers to efficiently enforce different and competing performance constraints in multi-tenant Software-as-a-Service(SaaS) applications. It can manage a combination of performance constraints in terms of latency, throughput and deadlines at a fine-grained level, and enables rapid response on changing circumstances, while preserving the resource usage efficiency of application-level multi-tenancy.
In short, the problem of sharing and isolating these data in cloud platform is still a main challenge in WoT data storage considering characteristics of different applications. There is still great contradiction between user authority and performance flexibility. Performance isolation has to be implemented in different levels with a consideration of different data types. Therefore, how to implement a data management model that solves the contradiction between secured sharing and performance isolation is the difficulty in current study of data management in cloud computing.
Currently, we are stepping into a new stage of Web3.0, which attracts more widely cooperation and crowd-sourcing both in information creation and information cosumption. Therefore, data storage techniques for WoT applications have also move forwards to a new stage. Future technical tendency and some open issues are given from aspects of complex data representation, data storage and management, and real-time disposing mechanism, such as smart contextual models for data representation, big linked data for semantic data storage and management, and data stream mining for real-time data analysis and application.
WoT data as a service faces the issues of interoperability and re-usability for massive heterogeneous sensor data and data services. Therefore, how to develop a smart device as an intelligent and self-organizational contextual model in the cloud platform is an important open issue.
The authors of [79] designed a developed platform named Semantic Web of Things (SWoT). The platform can provide semantic-based WoT application templates for developers so as to construct interoperable SWoT applications. And high-level abstractions are carried out to add sensor measurements into templates which help to reuse background or domain knowledge. Therefore, a unified platform for the implementation of interoperable semantic-based WoT applications is provided easily.
Considering complex data associations are generated from different sources or complex data structure, extracting relevant information in multilingual context from massive amounts of unstructured, structured and semi-structured data is a challenging task. Various theories have been developed and applied to ease the access to multicultural and multilingual resources. With the development of intelligent WoT applications, enhanced intelligence and contextualization models will enrich WoT with more expressive semantic association and support social interaction reasoning between smart things. It will facilitate smart things to construct a convenient and powerful devices or environment for intelligent WoT applications.
Linked Data is defined as relationships or connections between data from different data sources such as databases and the Web. For the purpose of effective data management, semantic annotation based on linked data provides a new issue in a massive, complex associated and contextual application scene. These associated and contextual data play a critical role for intelligent application.
The authors of [80] described and annotated WoT data stream by means of linked-data. A novel semantic model containing the observation and measurement data is built to create expressive descriptions of sensor streams. And the semantic model is proved to be efficient, which can reduce the size of the representations of the stream data.
In short, driven-by semantic technology such as linked data and ontology, we could predict that semantic data processing approaches will get a great improvement in the near future. And a more natural and meaningful way with high-level information will be common in different WoT areas. Combined with Natural Language Processing, the semantic technology will be used to create more intelligent applications.
Unstructured data such as video data can not be stored into a structured database system for analysis purpose. And data mining on data stream form different data sources with non-persisted association is a new but important issue. There are several different directions to process data stream with some dynamic methods, for example, to retrieve features from continuous data stream so as to build data association, or to process the whole body of a fragment of data stream by function transformation.
In [81], WoT infrastructure should focus on real-time interaction in the future research. Therefore, a WoT micro-benchmark is designed to combine cloud computing, service decomposition and multi-threading programming. And the benchmark is evaluated over a real WoT system.
The Streaming Linked Data (SLD) framework [82] provides a pluggable system for analysis of RDF streams. By means of a set of visualization widgets, data streams could be collected and analysed based on semantic techniques. And the Streaming Linked Data Format can be used in distributed environments with flexibility.
Data Stream mining involves uncertain reasoning based on partition data and utilizes intermediate result for high efficiency. When unstructured and semi-structured data are also involved in the processing process, there are lots of researches and technical problems left to do.
As WoT technologies are evolving and acting an important role in many applications, the article surveys the timely literatures to give an overview of WoT data storage researches.
For the purpose of providing a clear insight for different WoT systems and techniques, a WoT data storage framework with multi-layer structure is given firstly. Then related techniques are described and discussed from the view of data disposing process such as data representation, storage, management, inner data operations and external data services and so on.
Cloud platform is a popular information infrastructure for current WoT applications. Data isolation and multi-tenant data storage are discussed to provide a critical and accurate knowledge of the current WoT data management in cloud platform. It is significant for current WoT applications to achieve higher availability and flexible resource provision in cloud platform.
Aiming to provide a future tendency for WoT data storage techniques, some open issues are given from the considerations of complex data model, semantic data management and real-time data disposing.
In short, data storage techniques can be utilized to offer competitive advantage to the intelligent WoT applications. However, lots of efforts still need to be made to respond to the high heterogeneous, massive dynamic, weak semantic features of WoT data.