Formatted contents note |
Chapter I Introduction 1<br/>1.1 Why Data Mining? I<br/>I.I.I Moving toward the Information Age I<br/>1. 1.2 Data Mining as the Evolution of Information Technology 2<br/>1.2 What Is Data Mining? 5<br/>1.3 What Kinds of Data Can Be Mined? 8<br/>1.3.1 Database Data 9<br/>1.3.2 Data Warehouses 10<br/>1.3.3 Transactional Data 13<br/>1.3.4 Other Kinds of Data 14<br/>1.4 What Kinds of Patterns Can Be Mined? IS<br/>1.4.1 Class/Concept Description: Characterization and Discrimination<br/>1.4.2 Mining Frequent Patterns, Associations, and Correlations 17<br/>1.4.3 Classification and Regression for Predictive Analysis 18<br/>1.4.4 Cluster Analysis 19<br/>1.4.5 Outlier Analysis 20<br/>1.4.6 Are All Patterns Interesting? 21<br/>1.5 Which Technologies Are Used? 23<br/>1.5.1 Statistics 23<br/>1.5.2 Machine Learning 24<br/>1.5.3 Database Systems and Data Warehouses 26<br/>1.5.4 Information Retrieval 26<br/>Contents<br/>1.6 Which Kinds of Applications Are Targeted? 27<br/>1.6.1 Business Intelligence 27<br/>1.6.2 Web Search Engines 28<br/>1.7 Major Issues in Data Mining 29<br/>1.7.1 Mining Methodology 29<br/>1.7.2 User Interaction 30<br/>1.7.3 Efficiency and Scalability 31<br/>1.7.4 Diversity of Database Types 32<br/>1.7.5 Data Mining and Society 32<br/>1.8 Summary 33<br/>1.9 Exercises 34<br/>1.10 Bibliographic Notes 35<br/>Chapter 2 Getting to Know Your Data 39<br/>2.1 Data Objects and Attribute Types 40<br/>2.1.1 What Is an Attribute? 40<br/>2.1.2 Nominal Attributes 41<br/>2.1.3 Binary Attributes 41<br/>2.1.4 Ordinal Attributes 42<br/>2.1.5 Numeric Attributes 43<br/>2.1.6 Discrete versus Continuous Attributes 44<br/>2.2 Basic Statistical Descriptions of Data 44<br/>2.2.1 Measuring the Central Tendency. Mean, Median, and Mode 45<br/>2.2.2 Measuring the Dispersion of Data; Range, Quartiles, Variance,<br/>Standard Deviation, and Interquartile Range 48<br/>2.2.3 Graphic Displays of Basic Statistical Descriptions of Data 51<br/>2.3 Data Visualization 56<br/>2.3.1 Pixel-Oriented Visualization Techniques 57<br/>2.3.2 Geometric Projection Visualization Techniques 58<br/>2.3.3 Icon-Based Visualization Techniques 60<br/>2.3.4 Hierarchical Visualization Techniques 63<br/>2.3.5 Visualizing Complex Data and Relations 64<br/>2.4 Measuring Data Similarity and Dissimilarity 65<br/>2.4.1 Data Matrix versus Dissimilarity Matrix 67<br/>2.4.2 Proximity Measures for Nominal Attributes 68<br/>2.4.3 Proximity Measures for Binary Attributes 70<br/>2.4.4 Dissimilarity of Numeric Data: Minkowski Distance 72<br/>2.4.5 Proximity Measures for Ordinal Attributes 74<br/>2.4.6 Dissimilarity for Attributes of Mixed Types 75<br/>2.4.7 Cosine Similarity 77<br/>2.5 Summary 79<br/>2.6 Exercises 79<br/>2.7 Bibliographic Notes 81<br/>Chapter 3 Data Preprocessing 83<br/>3.1 Data Preprocessing: An Overview 84<br/>3.1.1 Data Quality: Why Preprocess the Data? 84<br/>3.1.2 Major Tasks in Data Preprocessing 85<br/>3.2 Data Cleaning 88<br/>3.2.1 Missing Values 88<br/>3.2.2 Noisy Data 89<br/>3.2.3 Data Cleaning as a Process 91<br/>3.3 Data integration 93<br/>3.3.1 Entity Identification Problem 94<br/>3.3.2 Redundancy and Correlation Analysis 94<br/>3.3.3 Tuple Duplication 98<br/>3.3.4 Data Value Conflict Detection and Resolution 99<br/>3.4 Data Reduction 99<br/>3.4.1 Overview of Data Reduction Strategies 99<br/>3.4.2 Wavelet Transforms 100<br/>3.4.3 Principal Components Analysis 102<br/>3.4.4 Attribute Subset Selection 103<br/>3.4.5 Regression and Log-Linear Models; Parametric<br/>Data Reduction 105<br/>3.4.6 Histograms 106<br/>3.4.7 Clustering 108<br/>3.4.8 Sampling 108<br/>3.4.9 Data Cube Aggregation 1 10<br/>3.5 Data Transformation and Data Discretization 111<br/>3.5.1 Data Transformation Strategies Overview 1 12<br/>3.5.2 Data Transformation by Normalization 1 13<br/>3.5.3 Discretization by Binning 1 15<br/>3.5.4 Discretization by Histogram Analysis 1 15<br/>3.5.5 Discretization by Cluster, Decision Tree, and Correlation<br/>Analyses 1 16<br/>3.5.6 Concept Hierarchy Generation for Nominal Data 1 17<br/>3.6 Summary 120<br/>3.7 Exercises 121<br/>3.8 Bibliographic Notes 123<br/>Chapter 4 Data Warehousing and Online Analytical Processing 125<br/>4.1 Data Warehouse: Basic Concepts 125<br/>4.1.1 What Is a Data Warehouse? 126<br/>4.1.2 Differences between Operational Databas'^ Systems<br/>and Data Warehouses 128<br/>4.1.3 But Why Have a Separate Data Warehou el 129<br/>4.1 .4 Data Warehousing: A Multitiered Architecture 130<br/>4.1.5 Data Warehouse Models: Enterprise Warehouse. Data Mart,<br/>and Virtual Warehouse 132<br/>4.1 .6 Extraction, Transformation, and Loading 134<br/>4.1.7 Metadata Repository 134<br/>4.2 Data Warehouse Modeling: Data Cube and OLAP 135<br/>4.2.1 Data Cube: A Multidimensional Data Model 136<br/>4.2.2 Stars, Snowflakes. and Fact Constellations; Schemas<br/>for Multidimensional Data Models 139<br/>4.2.3 Dimensions: The Role of Concept Hierarchies 142<br/>4.2.4 Measures: Their Categorization and Computation 144<br/>4.2.5 Typical OLAP Operations 146<br/>4.2.6 A Starnet Query Model for Querying Multidimensional<br/>Databases 149<br/>4.3 Data Warehouse Design and Usage 150<br/>4.3.1 A Business Analysis Framework for Data Warehouse Design I SO<br/>4.3.2 Data Warehouse Design Process 151<br/>4.3.3 Data Warehouse Usage for Information Processing 153<br/>4.3.4 From Online Analytical Processing to Multidimensional<br/>Data Mining 155<br/>4.4 Data Warehouse Implementation 156<br/>4.4.1 Efficient Data Cube Computation: An Overview 156<br/>4.4.2 Indexing OLAP Data: Bitmap Index and Join Index 160<br/>4.4.3 Efficient Processing of OLAP Queries 163<br/>4.4.4 OLAP Server Architectures; ROLAP versus MOLAP<br/>versus HOLAP 164<br/>4.5 Data Generalization by Attribute-Oriented Induction 166<br/>4.5.1 Attribute-Oriented Induction for Data Characterization 167<br/>4.5.2 Efficient Implementation of Attribute-Oriented Induction 172<br/>4.5.3 Attribute-Oriented Induction for Class Comparisons 175<br/>4.6 Summary 178<br/>4.7 Exercises 180<br/>4.8 Bibliographic Notes 184<br/>Chapter 5 Data Cube Technology 187<br/>5.1 Data Cube Computation: Preliminary Concepts 188<br/>5.1 .1 Cube Materialization: Full Cube. Iceberg Cube. Closed Cube,<br/>and Cube Shell 188<br/>5.1.2 General Strategies for Data Cube Computation 192<br/>5.2 Data Cube Computation Methods 194<br/>5.2.1 Multiway Array Aggregation for Full Cube Computation 195<br/>5.2.2 BUC: Computing Iceberg Cubes from the Apex Cuboid<br/>Downward 200<br/>5.2.3 Star-Cubing: Computing Iceberg Cubes Using a Dynamic<br/>Star-Tree Structure 204<br/>5.2.4 Precomputing Shell Fragments for Fast High-Dimensional OLAP 210<br/>5.3 Processing Advanced Kinds of Queries by Exploring Cube<br/>Technology 218<br/>5.3.1 Sampling Cubes: OLAP-Based Mining on Sampling Data 218<br/>5.3.2 Ranking Cubes: Efficient Computation of Top-k Queries 225<br/>5.4 Multidimensional Data Analysis in Cube Space 227<br/>5.4.1 Prediction Cubes: Prediction Mining in Cube Space 227<br/>5.4.2 Multifeature Cubes: Complex Aggregation at Multiple<br/>Granularities 230<br/>5.4.3 Exception-Based, Discovery-Driven Cube Space Exploration 231<br/>5.5 Summary 234<br/>5.6 Exercises 235<br/>5.7 Bibliographic Notes 240<br/>Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic<br/>Concepts and Methods 243<br/>6.1 Basic Concepts 243<br/>6.1.1 Market Basket Analysis: A Motivating Example 244<br/>6.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules 246<br/>6.2 Frequent Itemset Mining Methods 248<br/>6.2.1 Apriori Algorithm: Finding Frequent Itemsets by Confined<br/>Candidate Generation 248<br/>6.2.2 Generating Association Rules from Frequent Itemsets 254<br/>6.2.3 Improving the Efficiency of Apriori 254<br/>6.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets 257<br/>6.2.5 Mining Frequent Itemsets Using Vertical Data Format 259<br/>6.2.6 Mining Closed and Max Patterns 262<br/>6.3 Which Patterns Are Interesting?—Pattern Evaluation<br/>Methods 264<br/>6.3.1 Strong Rules Are Not Necessarily Interesting 264<br/>6.3.2 From Association Analysis to Correlation Analysis 265<br/>6.3.3 A Comparison of Pattern Evaluation Measures 267<br/>6.4 Summary 271<br/>6.5 Exercises 273<br/>6.6 Bibliographic Notes 276<br/>Chapter 7 Advanced Pattern Mining 279<br/>7.1 Pattern Mining: A Road Map 279<br/>7.2 Pattern Mining in Multilevel, Multidimensional Space 283<br/>7.2.1 Mining Multilevel Associations 283<br/>7.2.2 Mining Multidimensional Associations 287<br/>7.2.3 Mining Quantitative Association Rules 289<br/>7.2.4 Mining Rare Patterns and Negative Patterns 291<br/>7.3 Constraint-Based Frequent Pattern Mining 294<br/>7.3.1 Metarule-Guided Mining of Association Rules 295<br/>7.3.2 Constraint-Based Pattern Generation: Pruning Pattern Space<br/>and Pruning Data Space 296<br/>7.4 Mining High-Dimensional Data and Colossal Patterns 301<br/>7.4.1 Mining Colossal Patterns by Pattern-Fusion 302<br/>7.5 Mining Compressed or Approximate Patterns 307<br/>7.5.1 Mining Compressed Patterns by Pattern Clustering 308<br/>7.5.2 Extracting Redundancy-Aware Top-k Patterns 310<br/>7.6 Pattern Exploration and Application 313<br/>7.6.1 Semantic Annotation of Frequent Patterns 313<br/>7.6.2 Applications of Pattern Mining 317<br/>7.7 Summary 319<br/>7.8 Exercises 32 i<br/>7.9 Bibliographic Notes 323<br/>Chapter 8 Classification: Basic Concepts 327<br/>8.1 Basic Concepts 327<br/>8.1.1 What Is Classification? 327<br/>8.1.2 General Approach to-Classification 328<br/>8.2 Decision Tree Induction 330<br/>8.2.1 Decision Tree Induction 332<br/>8.2.2 Attribute Selection Measures 336<br/>8.2.3 Tree Pruning 344<br/>8.2.4 Scalability and Decision Tree Induction 347<br/>8.2.5 Visual Mining for Decision Tree Induction 348<br/>8.3 Bayes Classification Methods 350<br/>8.3.1 Bayes' Theorem 350<br/>8.3.2 Naive Bayesian Classification 351<br/>8.4 Rule-Based Classification 355<br/>8.4.1 Using IF-THEN Rules for Classification 355<br/>8.4.2 Rule Extraction from a Decision Tree 357<br/>8.4.3 Rule Induction Using a Sequential Covering Algorithm 359<br/>8.5 Model Evaluation and Selection 364<br/>8.5.1 Metrics for Evaluating Classifier Performance 364<br/>8.5.2 Holdout Method and Random Subsampling 370<br/>8.5.3 Cross-Validation 370<br/>8.5.4 Bootstrap 371<br/>8.5.5 Model Selection Using Statistical Tests of Significance 372<br/>8.5.6 Comparing Classifiers Based on Cost-Benefit and ROC Curves 373<br/>8.6 Techniques to Improve Classification Accuracy 377<br/>8.6.1 Introducing Ensemble Methods 378<br/>8.6.2 Bagging 379<br/>8.6.3 Boosting and AdaBoost 380<br/>8.6.4 Random Forests 382<br/>8.6.5 Improving Classification Accuracy of Class-lmbalanced Data 383<br/>8.7 Summary 385<br/>8.8 Exercises 386<br/>8.9 Bibliographic Notes 389<br/>Chapter 9 Classification: Advanced Methods 393<br/>9.1 Bayesian Belief Networks 393<br/>9.1.1 Concepts and Mechanisms 394<br/>9.1.2 Training Bayesian Belief Networks 396<br/>9.2 Classification by Backpropagation 398<br/>9.2.1 A Multilayer Feed-Forward Neural Network 398<br/>9.2.2 Defining a Network Topology 400<br/>9.2.3 Backpropagation 400<br/>9.2.4 Inside the Black Box: Backpropagation and Interpretability 406<br/>9.3 Support Vector Machines 408<br/>9.3.1 The Case When the Data Are Linearly Separable 408<br/>9.3.2 The Case When the Data Are Linearly Inseparable 413<br/>9.4 Classification Using Frequent Patterns 415<br/>9.4.1 Associative Classification 416<br/>9.4.2 Discriminative Frequent Pattern-Based Classification 419<br/>9.5 Lazy Learners (or Learning from Your Neighbors) 422<br/>9.5.1 k-Nearest-Neighbor Classifiers 423<br/>9.5.2 Case-Based Reasoning 425<br/>9.6 Other Classification Methods 426<br/>9.6.1 Genetic Algorithms 426<br/>9.6.2 Rough Set Approach 427<br/>9.6.3 Fuzzy Set Approaches 428<br/>9.7 Additional Topics Regarding Classification 429<br/>9.7.1 Multiclass Classification 430<br/>9.7.2 Semi-Supervised Classification 432<br/>9.7.3 Active Learning 433<br/>9.7.4 Transfer Learning 434<br/>9.8 Summary 436<br/>9.9 Exercises 438<br/>9.10 Bibliographic Notes 439<br/>Chapter 10 Cluster Analysis: Basic Concepts and Methods 443<br/>10.1 Cluster Analysis 444<br/>10.1.1 What Is Cluster Analysis? 444<br/>10.1.2 Requirements for Cluster Analysis 445<br/>10.1.3 Overview of Basic Clustering Methods 448<br/>10.2 Partitioning Methods 451<br/>10.2.1 k-Means; A Centroid-Based Technique 451<br/>10.2.2 k-Medoids: A Representative Object-Based Technique 454<br/>10.3 Hierarchical Methods 457<br/>10.3.1 Agglomerative versus Divisive Hierarchical Clustering 459<br/>10.3.2 Distance Measures in Algorithmic Methods 461<br/>10.3.3 BIRCH: Multiphase Hierarchical Clustering Using Clustering<br/>Feature Trees 462<br/>10.3.4 Chameleon: Multiphase Hierarchical Clustering Using Dynamic<br/>Modeling 466<br/>10.3.5 Probabilistic Hierarchical Clustering 467<br/>10.4 Density-Based Methods 471<br/>10.4.1 DBSCAN: Density-Based Clustering Based on Connected<br/>Regions with High Density 471<br/>10.4.2 OPTICS: Ordering Points to Identify the Clustering Structure 473<br/>10.4.3 DENCLUE: Clustering Based on Density Distribution Functions 476<br/>10.5 Grid-Based Methods 479<br/>10.5.1 STING: STatistical INformation Grid 479<br/>10.5.2 CLIQUE: An Apriori-like Subspace Clustering Method 481<br/>10.6 Evaluation of Clustering 483<br/>10.6.1 Assessing Clustering Tendency 484<br/>10.6.2 Determining the Number of Clusters 486<br/>10.6.3 Measuring Clustering Quality 487<br/>10.7 Summary 490<br/>10.8 Exercises 491<br/>10.9 Bibliographic Notes 494<br/>Chapter 1 1 Advanced Cluster Analysis 497<br/>1 1 .1 Probabilistic Model-Based Clustering 497<br/>1 1.1.1 Fuzzy Clusters 499<br/>I 1 . 1.2 Probabilistic Model-Based Clusters 501<br/>I 1. 1.3 Expectation-Maximization Algorithm 505<br/>I 1.2 Clustering High-Dimensional Data 508<br/>I 1.2.1 Clustering High-Dimensional Data: Problems, Challenges,<br/>and Major Methodologies 508<br/>1 1.2.2 Subspace Clustering Methods 510<br/>1 1.2.3 Biclustering 512<br/>I 1.2.4 Dimensionality Reduction Methods and Spectral Clustering 519<br/>1 1.3 Clustering Graph and Network Data 522<br/>1 1.3. 1 Applications and Challenges 523<br/>1 1.3.2 Similarity Measures 525<br/>1 1.3.3 Graph Clustering Methods 528<br/>1 1.4 Clustering with Constraints 532<br/>1 1.4.1 Categorization of Constraints 533<br/>1 1.4.2 Methods for Clustering with Constraints 535<br/>1 1.5 Summary 538<br/>1 1.6 Exercises 539<br/>1 1.7 Bibliographic Notes 540<br/>Chapter 12 Outlier Detection 543<br/>12.1 Outliers and Outlier Analysis 544<br/>12.1.1 What Are Outliers? 544<br/>12.1.2 Types of Outliers 545<br/>12.1.3 Challenges of Outlier Detection 548<br/>12.2 Outlier Detection Methods 549<br/>12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods 549<br/>12.2.2 Statistical Methods, Proximity-Based Methods, and<br/>Clustering-Based Methods 551<br/>12.3 Statistical Approaches 553<br/>12.3.1 Parametric Methods 553<br/>12.3.2 Nonparametric Methods 558<br/>12.4 Proximity-Based Approaches 560<br/>12.4.1 Distance-Based Outlier Detection and a Nested Loop<br/>Method 561<br/>12.4.2 A Grid-Based Method 562<br/>12.4.3 Density-Based Outlier Detection 564<br/>12.5 Clustering-Based Approaches 567<br/>12.6 Classification-Based Approaches 571<br/>12.7 Mining Contextual and Collective Outliers 573<br/>12.7.1 Transforming Contextual Outlier Detection to C onventional<br/>Outlier Detection 573<br/>12.7.2 Modeling Normal Behavior with Respect to Contexts 574<br/>12.7.3 Mining Collective Outliers 575<br/>12.8 Outlier Detection In HIgh-Dlmenslonal Data 576<br/>12.8.1 Extending Conventional Outlier Detection 577<br/>12.8.2 Finding Outliers in Subspaces 578<br/>12.8.3 Modeling High-Dimensional Outliers 579<br/>12.9 Summary 581<br/>12.10 Exercises 582<br/>12.1 1 Bibliographic Notes 583<br/>Chapter 13 Data Mining Trends and Research Frontiers 585<br/>13.1 Mining Complex Data Types 585<br/>13.1.1 Mining Sequence Data: Time-Series, Symbolic Sequences,<br/>and Biological Sequences 586<br/>13.1.2 Mining Graphs and Networks 591<br/>13.1.3 Mining Other Kinds of Data 595<br/>13.2 Other Methodologies of Data Mining 598<br/>13.2.1 Statistical Data Mining 598<br/>13.2.2 Views on Data Mining Foundations 600<br/>13.2.3 Visual and Audio Data Mining 602<br/>13.3 Data Mining Applications 607<br/>13.3.1 Data Mining for Financial Data Analysis 607<br/>13.3.2 Data Mining for Retail and Telecommunication Industries <br/>13.3.3 Data Mining in Science and Engineering 61 1<br/>13.3.4 Data Mining for Intrusion Detection and Prevention 614<br/>13.3.5 Data Mining and Recommender Systems 615<br/>13.4 Data Mining and Society 618<br/>13.4.1 Ubiquitous and Invisible Data Mining 618<br/>13.4.2 Privacy, Security, and Social Impacts of Data Mining 620<br/>13.5 Data Mining Trends 622<br/>13.6 Summary 625<br/>13.7 Exercises 626<br/>13.8 Bibliographic Notes 628<br/>Bibliography 633<br/>Index 673 |