Chapter 8: Challenges for Stream Data Classification

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data streams are continuous flows of data being generated from various computing machines such as clients and servers in networks, sensors, call centers, and so on. Analyzing these data streams has become critical for many applications including for network data, financial data, and sensor data. However, mining these ever growing data is a big challenge to the data mining community ([CAI04], [CHEN02], [FAN04a], [GABE05], [GANT01]). Data stream classification ([CHI05], [DING02], [DOMI00], [GANT02], [GEHR99], [HULT01], [JIN03], [KUNC08], [LAST02], [MASU09a], [SCHO05], [WANG03], [WANG06], [WANG07] is one major aspect of data stream mining. Data stream classification mainly consists of two steps. Building (or learning) a classification model using historical labeled data and classifying (or predicting the class of) future instances using the model. Building a classification model from a data stream is more challenging than building a model from static data because of several unique properties of data streams. In this chapter, we will discuss the challenges involved in analyzing such streams.

The organization of this chapter is as follows. Section 8.2 provides an overview of the challenges. The notions of infinite length and concept drift in streaming data are discussed in Section 8.3. The notion of concept evolution is discussed in Section 8.4. Aspects of limited labeled data are discussed in Section 8.5. The experiments we have carried out are discussed in Section 8.6. Our contributions to the field are discussed in Section 8.7. This chapter is summarized in Section 8.8.

8.2 Challenges

First, data streams are assumed to have infinite length. It is impractical to store and use all the historical data for learning, since it would require an infinite amount of storage and learning time. Therefore, traditional classification algorithms that require several passes over the training data are not directly applicable to data streams.

Second, data streams observe concept drift, which occurs when the underlying concept of the data changes over time. For example, consider the problem of credit card fraud detection. Here our goal is to detect whether a particular transaction is authentic or fraud. Since the behavior of authentic users as well as the techniques of forgery change over time, what is considered authentic now may appear to be fraud in the next year, and vice versa. In other words, the characteristics/patterns of these two classes of data (i.e., fraud/authentic) change over time. Therefore, the underlying concept (i.e., class characteristics) of the data is dynamic. Traditional classification techniques that assume static concept of the data are not applicable to data streams. In order to address concept drift, a classification model must continuously adapt itself to the most recent concept.

Third, data streams also observe concept evolution which occurs when a novel class appears in the stream. For example, consider an intrusion detection problem, where network traffic is analyzed to determine whether it is benign, or it contains some kind of intrusion. It is possible that a completely new kind of intrusion occurs in the network. In that case, traditional classification techniques, which assume that the total number of classes in the data is fixed, would misclassify the new intrusion either as benign, or as a known intrusion. In order to cope with concept evolution, a classification model must be able to automatically detect novel classes when they appear, before being trained with the labeled instances of the novel class.

Finally, high-speed data streams suffer from insufficient labeled data. This is because manual labeling is both costly and time-consuming. Therefore, the speed at which data points are labeled lags far behind the speed at which data points arrive in the stream. As a result, most of the data points in the stream would be left as unlabeled. Most classification algorithms apply a supervised learning technique, which require completely labeled training data. By completely labeled we mean that all instances in the training data are labeled. Therefore, supervised classification techniques suffer from the scarcity of labeled data for learning, resulting in a poorly built classifier. In order to deal with this scarcity of labeled data, we need a learning algorithm that is capable of producing a good classification model even if it is supplied with partially labeled data for training. By partially labeled, we mean that only P% (P < 100) instances in the training data are labeled.

Most data stream classification techniques concentrate only on the first two issues, namely, infinite length, and concept drift. Our goal is to address all the four issues, providing a more realistic solution than the state of the art.

We propose three different techniques that address two or more of these problems. This is shown in a tabular form in Table 8.1. For example, ECSMiner, the novel class detection technique proposed in Chapter 11, solves the infinite length, concept-drift and concept-evolution problems.

Table 8.1 Data Stream Classification Problems and Proposed Solutions
Proposed Technique	Infinite Length	Concept Drift	Concept Evolution	Limited Labeled Data
MPC ensemble (Chapter 10)	√	√	√	√
ECSMiner (Chapter 11)	√	√
ReaSC (Chapter 12)	√

8.3 Infinite Length and Concept Drift

Traditional data stream classification techniques solve the infinite length problem by providing a one-pass learning paradigm using one of the two approaches: single model ([DOMI00], [GEHR99], [HULT01]) or ensemble ([GAO07], [KOLT03], [KOLT05], [SCHO05], [STRE01], [WANG03]) classification. Single model approaches incrementally update their classification model with the progression of the data stream. As opposed to the batch learning algorithms, which require multiple passes over the training data, incremental learning algorithms require a single pass. However, some of the single model classification techniques (e.g., [DOMI00]) consider a stationary data distribution. Therefore, they are not applicable to concept-drifting data streams. Ensemble approaches maintain an ensemble of classification models, which is also updated incrementally. Many ensemble approaches such as [MASU09a], [SCHO05], and [WANG03] follow a hybrid batch-incremental technique. In this technique, the data stream is divided into equally sized chunks, and one classifier is trained with one data chunk using a batch-learning algorithm. Whenever a new classifier is trained, the existing ensemble of classifiers is updated by replacing an older classifier in the ensemble with the newest classifier based on a particular replacement strategy (e.g., accuracy of each classifier on the latest training chunk). Therefore, the ensemble itself is updated incrementally, keeping it up to date with the stream evolution. Ensemble approaches require relatively simpler operations to update the model than their single model counterparts. Besides, ensemble techniques can efficiently handle concept drift. In Chapter 10, we describe our proposed novel ensemble technique [MASU09a] called the multipartition and multichunk (MPC) ensemble approach. MPC ensemble addresses both the infinite length and concept-drift problems, and offers a lower classification error than other existing ensemble approaches.

As discussed earlier, we assume that data stream is divided into equally sized chunks. The chunk size is chosen so that all the data in a chunk may fit into the main memory. Each chunk, when labeled, is used to train classifiers. In our approach, there are three parameters that control the MPC ensemble: v, r, and L. The parameter v determines the number of partitions (v = 1 means single-partition ensemble), the parameter r determines the number of chunks (r = 1 means single-chunk ensemble), and the parameter L controls the ensemble size. The MPC ensemble consists of Lv classifiers. This ensemble is updated whenever a new data chunk is labeled. We take the most recent labeled r consecutive data chunks and train v classifiers using v-fold partitioning of these chunks. We then update the ensemble by choosing the best (based on accuracy) Lv classifiers among the newly trained v classifiers and the existing Lv classifiers. Thus, the total number of classifiers in the ensemble is always kept constant.

Figure 8.1 illustrates concept drift in a two-dimensional data stream. Each box represents a chunk in the data stream. The white circles represent the negative class, and the black circles represent the positive class. The dark straight line (analogous to a hyperplane) inside each box separates the data classes, and defines the current concept. The dotted straight line represents the concept corresponding to the previous chunk. As a result of concept drift, some data points have different class labels in the current concept than in the previous concept. Blue circles represent the data points that are negative (positive) according to the current concept but positive (negative) according to the previous concept.

008x001.tif

Figure 8.1 Illustrating concept drift in a two-class data.

It should be noted that when a new data point appears in the stream, it may not be labeled immediately. We defer the ensemble updating process until the data points in the latest data chunk have been labeled, but we keep classifying new unlabeled data using the current ensemble. For example, consider the online credit card fraud detection problem. When a new credit card transaction takes place, its class ({fraud,authentic}) is predicted using the current ensemble. Suppose a fraudulent transaction has been misclassified as authentic. When the customer receives the bank statement, he will identify this error and report to the authority. In this way, the actual labels of the data points will be obtained, and the ensemble will be updated accordingly.

8.4 Concept Evolution

Concept evolution occurs in data streams when novel classes emerge. It is important to detect concept evolution in data streams, because if a novel class appears in the stream, traditional stream classification techniques would misclassify the instances of the novel class, reducing the accuracy and reliability of the classifier. To the best of our knowledge, no data stream classification technique to date addresses the concept-evolution problem. Our approach ECSMiner, presented in Chapter 11, is an extension to the approach described in [MASU09a].

ECSMiner solves the concept-evolution problem by automatically detecting novel classes in a data stream. Novel class detection should be an integral part of any realistic data stream classification technique because of the evolving nature of streams. It can be useful in various domains, such as network intrusion detection ([KHAN07], [MASU08ab]), fault detection [CRUP04], malware detection ([MASU06], [MASU07a], [MASU07b], [MASU07d], [MASU08a]), and credit card fraud detection ([WANG03]). For example, in case of intrusion detection, a novel kind of intrusion might go undetected by traditional classifiers, but our approach should not only be able to detect the intrusion, but also deduce that it is a novel kind of intrusion. This discovery would lead to an intense analysis of the intrusion by human experts in order to understand its cause, find a remedy, and make the system more secure.

Figure 8.2 shows an example of concept evolution in data streams in a two-dimensional feature space. The left graph shows the data distribution of two different classes (+ and −) in a data chunk. A rule-based learner learns two rules (shown in the figure) from this data chunk. The right graph shows the data distribution in the next chunk, where a novel class (denoted by x) has evolved. Instances of this class would be misclassified by the rules learned in the previous chunk since class x was not present when the rules were learned. In fact, no traditional classification model can detect the novel class. ECSMiner provides a solution to the concept-evolution problem by enriching each classifier in the ensemble with a novel class detector. If the arrival of a novel class is discovered, potential novel class instances are separated and classified as novel class. Thus, a novel class can be automatically identified without manual intervention.

Figure 8.2 Illustrating concept evolution in a two-dimensional feature space.

The novel class detection technique proposed in ECSMiner is different from traditional one-class novelty detection techniques ([MARK03], [ROBE00], [YAMA01]) that can only distinguish between the normal and anomalous data. That is, the traditional novelty detection techniques assume that there is only one normal class and any instance that does not belong to the normal class is an anomaly/novel class instance. Therefore, they are unable to distinguish among different types of anomaly. But ECSMiner offers a multiclass framework for the novelty detection problem that can distinguish between different classes of data and discover the emergence of a completely novel class. Furthermore, traditional novelty detection techniques simply identify data points as outliers/anomalies that deviate from the normal class. On the other hand, ECSMiner not only detects whether a single data point deviates from the existing classes, but also discovers whether a group of such outliers possesses the potential of forming a novel class by showing strong cohesion among themselves. Therefore, ECSMiner is a synergy of a multiclass classification model and a novel class detection model.

Traditional data stream classification techniques also make impractical assumptions about the availability of labeled data. Most techniques ([CHEN08], [HULT01], [YANG05]) assume that the true label of a data point can be accessed as soon as it has been classified by the classification model. Thus, according to their assumption, the existing model can be updated immediately using the labeled instance. In reality, we would not be so lucky in obtaining the label of a data instance immediately, since manual labeling of data is time consuming and costly. For example, in a credit card fraud detection problem, the actual labels (i.e., authentic/fraud) of the credit card transactions by a customer usually become available in the next billing cycle after the customer reviews all his transactions in the last statement and reports fraud transactions to the credit card company. Thus, a more realistic assumption would be to have a data point labeled after Tl time units of its arrival. For simplicity, we assume that the ith instance in the stream arrives at ith time unit. Thus, Tl can be considered as a time constraint imposed on the data labeling process. Note that traditional stream classification techniques assume Tl = 0. Finally, we impose another time constraint, Tc, on classification decision. That is, an instance must be classified by the classification model within Tc time units of its arrival. If it assumed that there is no concept-evolution, it is customary to have Tc = 0, that is, an instance should be classified as soon as it arrives. However, when novel concepts evolve, classification decisions may have to be postponed until enough instances are seen by the model to gain confidence in deciding whether a novel class has emerged or not. Tc is the maximum allowable time up to which the classification decision can be postponed for any instance. Note that Tc < Tl must be maintained in any practical classification model. Otherwise, we would not need the classifier at all, we could just wait for the labels to arrive. We will discuss this issue in detail in Chapter 10.

Figure 8.3 illustrates the significance of Tl and Tc with an example. Here xk is the last instance that has arrived in the stream. Let xj be the instance that arrived Tc time units earlier, and xi be the instance that arrived Tl time units earlier. Then xi and all instances that arrived before xi (shown with dark-shaded area) are labeled, since all of them are at least Tl time units old. Similarly, xj and all instances that arrived before xj (both the light-shaded and dark-shaded areas) are classified by the classifier since they are at least Tc time units old. However, the instances inside the light-shaded area are unlabeled. Instances that arrived after xj (age less than Tc) are unlabeled, and may or may not be classified (shown with the unshaded area).

Figure 8.3 Illustration of time constraints Tl and Tc.

Integrating classification with novel class detection is a nontrivial task, especially in the presence of concept drift, and under time constraints. We assume an important property of each class: the data points belonging to the same class should be closer to each other (cohesion) and should be far apart from the data points belonging to other classes (separation). If a test instance is well separated from the training data, it is identified as a Raw outlier. Raw outliers that possibly appear as a result of concept drift or noise are filtered out. An outlier that passes the filter (called F outlier) has potential to be a novel class instance. However, we must wait to see whether more such F outliers appear in the stream that observes strong cohesion among themselves. If a sufficient number of such strongly cohesive F outliers are observed, a novel class is assumed to have appeared, and the F outliers are classified as novel class instances. However, we can defer the classification decision of a test instance at most Tc time units after its arrival, which makes the problem more challenging. Furthermore, we must keep detecting novel class instances in this ‘un-supervised’ fashion for at least Tl time units from the arrival of the first novel class instance, since labeled training data of the novel class(es) would not be available before that.

8.5 Limited Labeled Data

Almost all of the existing data stream classification techniques ([AGGA06], [CHEN08], [FAN04b], [GAO07], [HULT01], [KOLT05], [SCHO05], [WANG03], [YANG05]) are based on an impractical assumption that the true label of a data point becomes available immediately after it is tested by the classifier. In other words, the data stream is assumed to be completely labeled, meaning, the true labels of all historical data are known. This assumption is impractical because manual labeling of data is usually costly and time-consuming. So, in a streaming environment, where data appear at a high speed, it is not possible to manually label all the data as soon as they arrive. If it were possible, we would not need a classifier to begin with. Thus, in practice, only a small fraction of the stream can be labeled by human experts. So, the traditional stream classification algorithms would have very few instances to update their model, leading to a poorly built classifier. But a realistic data stream classification model should demonstrate satisfactory performance even if it is trained with insufficient labeled training data. In Chapter 11, we propose ReaSC which is capable of building efficient classification models with a partially labeled data stream, compared to other stream classification techniques that require completely labeled stream.

As an example, suppose an organization receives flight reports as text documents from all over the world, at the rate of a thousand reports per day, and categorizes the reports into different classes: normal, minor mechanical problem, minor weather problem, major mechanical problem, and so on. Based on the categories, warning messages are sent to the corresponding airlines and aviation authorities for proper action. Important decision-making actions such as flight planning, resource allocation, and personnel assignment are affected by these warnings. Therefore, timely delivery of these warnings is necessary to avoid both financial loss, and customer dissatisfaction. Without loss of generality, suppose only 200 of the reports can be labeled manually by human experts each day. So, an automated stream document classification system is employed so that all the 1000 documents can be classified each day. If a traditional stream classification technique is used, it will have to deal with a trade-off when updating the classifier. Either the classifier will have to be updated with only 200 labeled data per day, or it will have to wait 5 days to be updated with all the 1000 labeled data that arrived today. None of the trade-offs are acceptable since the former will lead to a poor classifier, and the latter will lead to an outdated classifier. In order to completely avoid these problems, the organization must increase its manpower (and cost) five times and classify all the 1000 instances manually.

Considering these difficulties, we propose our technique (ReaSC) that updates the existing classification model utilizing the available 200 labeled and 800 unlabeled instances, while achieving the same or better classification accuracy than a classification model that is updated using 1000 labeled instances. Thus, ReaSC offers a practical data stream classifier that not only views the data stream classification problem from a real perspective, but also provides a cost-effective solution. Figure 8.4 shows the basic idea of a data stream classification using traditional approach and limited labeled data. Traditional approach (Figure 8.4a) requires a completely labeled data chunk to update the existing model. Whereas ReaSC requires a partially labeled (P% labeled) data chunk to update the existing classification model. After updating the existing model, the latest data chunk, which is entirely unlabeled, is classified using the model. Eventually, when the i+ 2nd data chunk arrives, P% instances of the i+ 1st data chunk are assumed to have been labeled and that chunk is used to update the model. This process repeats indefinitely.

Figure 8.4 Illustrating the limited labeled data problem in data streams.

Black circles represent labeled instances and white circles represent unlabeled instances. (a) Data stream classification with traditional classifiers require fully labeled data to update the existing model. (b) Our approach requires only a limited amount of labeled data to update the existing model.

Naturally, stream data could be stored in the buffer and processed when the buffer is full, so we divide the stream data into equally sized chunks. We train a classification model from each chunk. We propose a semisupervised clustering algorithm to create K clusters from the partially labeled training data ([MASU08b], [WOOL09]). A summary of the statistics of the instances belonging to each cluster is saved as a microcluster. The microclusters created from each chunk serve as a classification model for the nearest neighbor algorithm. In order to cope with concept drift, we keep an ensemble of L models. Whenever a new model is built from a new data chunk, we update the ensemble by choosing the best L models from the L + 1 models (previous L models and the new model), based on their individual accuracies on the labeled training data of the new data chunk. Besides, we refine the existing models in the ensemble whenever a novel class of data evolves in the stream.

8.6 Experiments

We evaluate our approaches on several synthetic and real datasets. We use two kinds of synthetic datasets:

•Synthetic data with only concept drift: This synthetic data was generated with a moving hyperplane. It contains 10 real-valued attributes and two classes. The hyperplane is moved gradually to simulate concept drift in the data. Details of this dataset are discussed in the experiment sections in Chapters 10 through 12.

•Synthetic data with concept drift and concept evolution: This synthetic data was generated with Gaussian distribution. It contains 20 real-valued attributes and 10 classes. Different classes of the data were generated using different parameters of the (Gaussian) distribution. Here concept drift is simulated by gradually changing the parameter values for each class. Concept evolution is simulated by changing class distributions in such a way that novel classes appear and old classes disappear at different times in the stream. Details of this dataset are discussed in the experiment sections in Chapter 10.

•We also use four different real datasets:

•Botnet dataset: We generate real peer-to-peer (P2P) botnet traffic in a controlled environment, where we run a P2P bot named Nugache [LEMO06]. Here the goal is to classify network traffic as either benign or botnet.

•KDD cup 1999 intrusion detection dataset: This dataset contains TCP connection records extracted from LAN network traffic at MIT Lincoln Labs over a period of 2 weeks. We have used the 10% version of the dataset, which is more concentrated, and challenging than the full version. Each instance in the dataset refers to either to a normal connection or an attack. There are 22 types of attacks, such as buffer overflow, portsweep, guess-passwd, neptune, rootkit, smurf, spy, etc. So, there are 23 different classes of data, including the normal class. Here the goal is to classify an instance into one of the classes. This dataset is discussed in the experiment sections of Chapters 11 and 12.

•NASA Aviation Safety Reporting Systems (ASRS) dataset: This dataset contains around 150,000 text documents. Each document is actually a report corresponding to a flight anomaly. There are a total of 55 anomalies. The goal is to classify a document into one of these anomalies. Details of this dataset are discussed in Chapter 11.

•Forest cover dataset: This dataset contains geospatial descriptions of different types of forests. It contains seven classes, and the goal is to classify an instance into one of the forest classes. Details of this dataset are discussed in Chapter 10.

We evaluate our approaches with the state-of-the-art data stream classification techniques on these datasets. In each dataset, our approaches show significantly better performance both in running times and classification accuracies. Besides, we also analyze the sensitivity of different parameters used in our techniques on classification accuracies and running times.

8.7 Our Contributions

In this book, we propose solutions to different data stream classification problems, namely, concept drift, concept evolution, and limited labeled data.

An MPC ensemble for classifying concept-drifting data streams

•We propose a generalized MPC ensemble technique that significantly reduces the expected classification error over the existing single-partition, single-chunk ensemble methods. The MPC ensemble technique addresses both infinite length and concept drift.

•We have theoretically justified the effectiveness of the MPC ensemble approach.

•We apply the MPC ensemble on synthetically generated data as well as on real botnet traffic, and achieve better detection accuracies than other stream data classification techniques.

Classification and novel class detection in concept-drifting data streams (ECSMiner)

•To the best of our knowledge, no other data stream classification technique addresses the concept-evolution problem. This is a major problem with data streams that must be dealt with. In this light, ECSMiner offers a more realistic solution to data stream classification. ECSMiner also addresses infinite length and concept-drift problems.

•ECSMiner offers a more practical framework for stream classification by introducing time constraints for delayed data labeling and making classification decision.

•ECSMiner enriches a traditional classification model with a novel class detection mechanism.

•We apply ECSMiner on both synthetic and real-world data and obtain much better results than the state-of-the-art stream classification algorithms.

Data stream classification with scarcely labeled training data

(ReaSC)

•We provide a solution to the more practical situation of data stream classification when labeled data are scarce. We show ReaSC, using only 20% labeled training data, achieves better classification accuracy in real-life datasets than other stream classification approaches that use 100% labeled training data. ReaSC addresses infinite length, concept drift and limited labeled data problems.

•ReaSC applies semisupervised clustering to build classification models. To the best of our knowledge, no other stream data classification algorithms apply semisupervised clustering.

•We propose an efficient semisupervised clustering algorithm based on a cluster-impurity measure.

We believe that the proposed methods provide promising, powerful, and practical techniques to the stream classification problem in general.

8.8 Summary and Directions

This chapter has stressed the need for mining data streams and discussed the challenges. The challenges include infinite length, concept drift, concept evolution and limited labeled data. We also provided an overview of our approach to mining data streams. Specifically, our approach determines whether an item belongs to a pre-existing class or whether it is a novel class.

Our work has applications in numerous fields. In the real world, it could very well be the case that an item does not belong to a specific class. For example, it does not have to be either black or white. We could create a new color called “BLITE” which could have properties of both black and white. Our approaches will be elaborated in Chapters 9 through 13.

References

[AGGA06]. C.C. Aggarwal, J. Han, J. Wang, P.S. Yu. “A Framework for On-Demand Classification of Evolving Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 18 (5), 577–589, 2006.

[CAI04]. Y.D. Cai, D. Clutter, G. Pape, J. Han, M. Welge, L. Auvil, “Maids: Mining Alarming Incidents from Data Streams,” In 23rd ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, ACM, 2004.

[CHEN08]. S. Chen, H. Wang, S. Zhou, P.S. Yu, “Stop Chasing Trends: Discovering High Order Models in Evolving Data,” In ICDE ’08: Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, April 7–12, pp. 923–932, IEEE Computer Society, 2008.

[CHEN02]. Y. Chen, G. Dong, J. Han, B.W. Wah, J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data Streams,” In VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 20–23, pp. 323–334, VLDB Endowment, 2002.

[CHI05]. Y. Chi, P.S. Yu, H. Wang, R.R. Muntz, “Loadstar: A Load Shedding Scheme for Classifying Data Streams,” In SDM ’05: Proceedings of the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, USA, April 21–23, p. 3, SIAM, 2005.

[CRUP04]. V. Crupi, E. Guglielmino, G. Milazzo, “Neural-Network-Based System for Novel Fault Detection in Rotating Machinery,” Journal of Vibration and Control, 10(8):1137–1150, 2004.

[DING02]. Q. Ding, Q. Ding, W. Perrizo, “Decision Tree Classification of Spatial Data Streams Using Peano Count Trees,” In SAC ’02: Proceedings of the 2002 ACM symposium on Applied Computing, Madrid, Spain, March 10–14, pp. 413–417, ACM, 2002.

[DOMI00]. P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” In KDD ’00: Proceedings of the 2000 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20–23, Boston, MA, USA, pp. 71–80, ACM, 2000.

[FAN04a]. W. Fan, Y. an Huang, H. Wang, P. S. Yu, “Active Mining of Data Streams” In SDM ’04: Proceedings of the 2004 SIAM International Conference on Data Mining, April 22–24, Lake Buena Vista, Florida, USA, pp. 457–461, SIAM, 2004.

[FAN04b]. W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” In KDD ’04: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22–25, Seattle, WA, USA, pp. 128–137, ACM, 2004.

[GABE05]. M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, “Mining Data Streams: A Review,” ACM SIGMOD Record, 34 (2), 18–26, June 2005.

[GANT01]. V. Ganti, J. Gehrke, R. Ramakrishnan, “Demon: Mining and Monitoring Evolving Data,” IEEE Transactions on Knowledge and Data Engineering, 13 (1), 50–63, 2001.

[GANT02]. V. Ganti, J. Gehrke, R. Ramakrishnan, “Mining Data Streams Under Block Evolution,” ACM SIGKDD Explorations Newsletter, 3 (2), 1–10, 2002.

[GAO07]. J. Gao, W. Fan, J. Han. “On Appropriate Assumptions to Mine Data Streams,” In ICDM ’07: Proceedings of the 2007 International Conference on Data Mining, October 28–31, Omaha, NE, USA, pp. 143–152, IEEE Computer Society, 2007.

[GEHR99]. J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh, “Boat-Optimistic Decision Tree Construction,” In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Jun. 1–3, Philadelphia, PA, USA, pp. 169–180, ACM, 1999.

[HULT01]. G. Hulten, L. Spencer, P. Domingos, “Mining Time-Changing Data Streams,” In KDD ’01: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 26–29, San Francisco, CA, USA, pp. 97–106, ACM, 2001.

[JIN03]. R. Jin and G. Agrawal, “Efficient Decision Tree Construction on Streaming Data,” In KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24–27, Washington, DC, pp. 571–576, ACM, 2003.

[KHAN07]. L. Khan, M. Awad, B.M. Thuraisingham, “A New Intrusion Detection System Using Support Vector Machines and Hierarchical Clustering,” VLDB Journal, 16 (4), 507–521, 2007.

[KOLT03]. J.Z. Kolter and M.A. Maloof, “Dynamic Weighted Majority: A New Ensemble Method for Tracking Concept Drift,” In ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining, Nov. 19–22, Melbourne, FL, USA, IEEE Computer Society, pp. 123–130, 2003.

[KOLT05]. J.Z. Kolter and M.A. Maloof, “Using Additive Expert Ensembles to Cope with Concept Drift,” In ICML ’02: Proceedings of the Twenty-Second International Conference on Machine Learning, August 7–11, Morgan Kaufmann, Bonn, Germany, pp. 449–456, 2005.

[KUNC08]. L.I. Kuncheva and J. Salvador Śanchez, “Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling,” In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, December 15–19, Pisa, Italy, pp. 869–874, IEEE Computer Society, 2008.

[LAST02]. M. Last, “Online Classification of Nonstationary Data Streams,” Intelligent Data Analysis, 6(2), 129–147, 2002.

[LEMO06]. R. Lemos, Bot software looks to improve peerage, http://www.securityfocus.com/news/11390, 2006.

[MARK03a]. M. Markou and S. Singh, “Novelty Detection: A Review—Part 2: Neural Network Based Approaches,” Signal Processing, 83(12), 2499–2521, 2003.

[MASU07a]. M.M. Masud, L. Khan, B.M. Thuraisingham, “E-mail Worm Detection Using Data Mining,” International Journal of Information Security and Privacy, 1 (4), 47–61, 2007.

[MASU07b]. M.M. Masud, L. Khan, B.M. Thuraisingham, “Feature Based Techniques for Auto-Detection of Novel Email Worms,” In PAKDD ’07: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 22–25, Springer-Verlag, Nanjing, China, pp. 205–216, 2007.

[MASU06]. M.M. Masud, L. Khan, E. Al-Shaer, “Email Worm Detection Using Näıve Bayes and Support Vector Machine,” In ISI ’06: Proceedings of the 2006 IEEE Intelligence and Security Informatics Conference, May 23–24, San Diego, CA, USA, pp. 733–734, IEEE Computer Society, 2006.

[MASU07c]. M.M. Masud, L. Khan, B.M. Thuraisingham, “A Hybrid Model to Detect Malicious Executables,” In ICC ’07: Proceedings of the 2007 IEEE International Conference on Communications, June 24–28, Glasgow, Scotland, pp. 1443–1448, IEEE Computer Society, 2007.

[MASU08a]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM ’08: Proceedings of the 2008International Conference on Data Mining, December 15–19, Pisa, Italy, pp. 929–934, IEEE Computer Society, 2008.

[MASU08b]. M.M. Masud, L. Khan, B.M. Thuraisingham, “A Scal- Able Multi-Level Feature Extraction Technique to Detect Malicious Executables,” Information Systems Frontiers, 10 (1), 33–45, 2008.

[MASU09a]. M.M. Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “A Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams,” In PAKDD09: Proceedings of The 13th Pacific- Asia Conference on Knowledge Discovery and Data Mining, April 27–30, Springer-Verlag, Bangkok, Thailand, pp. 363–375, 2009. (also Advances in Knowledge Discovery and Data Mining).

[MASU09b]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection with Classification for Concept- Drifting Data Streams,” In ECML PKDD ’09: Proceedings of the 2009 European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in Databases, volume II, Springer-Verlag, Bled, Slovenia, September 7–11, pp. 79–94, 2009.

[ROBE00]. S.J. Roberts, “Extreme Value Statistics for Novelty Detection in Biomedical Signal Processing,” In Proceedings of the First International Conference on Advances in Medical Signal and Information Processing, Bristol, UK, pp. 166–172, 2000.

[SCHO05]. M. Scholz and R. Klinkenberg, “An Ensemble Classifier for Drifting Concepts,” In IWKDDS ’05: Proceedings of the Second International Workshop on Knowledge Discovery in Data Streams, Porto, Portugal, Oct. 3–7, pp. 53–64, 2005.

[WANG03]. H. Wang, W. Fan, P. S. Yu, J. Han, “Mining Concept-Drifting Data Streams Using Ensemble Classifiers,” In KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24–27 pp. 226–235, ACM, 2003.

[WANG06]. H. Wang, J. Yin, J. Pei, P.S. Yu, J.X. Yu, “Suppressing Model Overfitting in Mining Concept-Drifting Data Streams,” In KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, August 20–23, pp. 736–741, ACM, 2006.

[WANG07]. P. Wang, H. Wang, X. Wu, W. Wang, B. Shi, “A Low- Granularity Classifier for Data Streams with Concept Drifts and Biased Class Distri bution,” IEEE Transactions on Knowledge and Data Engineering, 19 (9), 1202–1213, 2007.

[WOOL09]. C. Woolam, M.M. Masud, L. Khan, “Lacking Labels in the Stream: Classifying Evolving Stream Data with Few Labels,” In ISMIS ’09: Proceedings of the 18th International Symposium on Methodologies for Intelligent Systems, Springer, Prague, Czech Republic, September 14–17, pp. 552–562, 2009.

[YAMA01]. K. Yamanishi and J. ichi Takeuchi, “Discovering Outlier Filtering Rules from Unlabeled Data: Combining a Supervised Learner with An Uunsupervised Learner,” In KDD ’01: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August pp. 389–394, ACM, 2001.

[YANG05]. Y. Yang, X. Wu, X. Zhu, “Combining Proactive and Reactive Predictions for Data Streams,” In KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, August 21–24, pp. 710–715, ACM, 2005.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Challenges for Stream Data Classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8: Challenges for Stream Data Classification