Discovering Knowledge in Data.pdf

(4369 KB) Pobierz
Frontmatter
DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining
DANIEL T. LAROSE
Director of Data Mining
Central Connecticut State University
A JOHN WILEY & SONS, INC., PUBLICATION
433800825.004.png
Copyright © 2005 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400,
fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken,
NJ 07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department
within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data:
Larose, Daniel T.
Discovering knowledge in data : an introduction to data mining / Daniel T. Larose
p. cm.
Includes bibliographical references and index.
ISBN 0-471-66657-2 (cloth)
1. Data mining. I. Title.
QA76.9.D343L38 2005
006.3 12—dc22
2004003680
Printed in the United States of America
10987654321
433800825.005.png
Dedication
To my parents,
And their parents,
And so on...
For my children,
And their children,
And so on...
2004 Chantal Larose
433800825.006.png 433800825.007.png 433800825.001.png
CONTENTS
PREFACE
xi
1 INTRODUCTION TO DATA MINING
1
What Is Data Mining?
2
Why Data Mining?
4
Need for Human Direction of Data Mining
4
Cross-Industry Standard Process: CRISP–DM
5
Case Study 1: Analyzing Automobile Warranty Claims: Example of the
CRISP–DM Industry Standard Process in Action
8
Fallacies of Data Mining
10
What Tasks Can Data Mining Accomplish?
11
Description
11
Estimation
12
Prediction
13
Classification
14
Clustering
16
Association
17
Case Study 2: Predicting Abnormal Stock Market Returns Using
Neural Networks
18
Case Study 3: Mining Association Rules from Legal Databases
19
Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees
21
Case Study 5: Profiling the Tourism Market Using k -Means Clustering Analysis
23
References
24
Exercises
25
2 DATA PREPROCESSING
27
Why Do We Need to Preprocess the Data?
27
Data Cleaning
28
Handling Missing Data
30
Identifying Misclassifications
33
Graphical Methods for Identifying Outliers
34
Data Transformation
35
Min–Max Normalization
36
Z -Score Standardization
37
Numerical Methods for Identifying Outliers
38
References
39
Exercises
39
vii
433800825.002.png
viii CONTENTS
3 EXPLORATORY DATA ANALYSIS
41
Hypothesis Testing versus Exploratory Data Analysis
41
Getting to Know the Data Set
42
Dealing with Correlated Variables
44
Exploring Categorical Variables
45
Using EDA to Uncover Anomalous Fields
50
Exploring Numerical Variables
52
Exploring Multivariate Relationships
59
Selecting Interesting Subsets of the Data for Further Investigation
61
Binning
62
Summary
63
References
64
Exercises
64
4 STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION
67
Data Mining Tasks in Discovering Knowledge in Data
67
Statistical Approaches to Estimation and Prediction
68
Univariate Methods: Measures of Center and Spread
69
Statistical Inference
71
How Confident Are We in Our Estimates?
73
Confidence Interval Estimation
73
Bivariate Methods: Simple Linear Regression
75
Dangers of Extrapolation
79
Confidence Intervals for the Mean Value of y Given x
80
Prediction Intervals for a Randomly Chosen Value of y Given x
80
Multiple Regression
83
Verifying Model Assumptions
85
References
88
Exercises
88
5 k-NEAREST NEIGHBOR ALGORITHM
90
Supervised versus Unsupervised Methods
90
Methodology for Supervised Modeling
91
Bias–Variance Trade-Off
93
Classification Task
95
k -Nearest Neighbor Algorithm
96
Distance Function
99
Combination Function
101
Simple Unweighted Voting
101
Weighted Voting
102
Quantifying Attribute Relevance: Stretching the Axes
103
Database Considerations
104
k -Nearest Neighbor Algorithm for Estimation and Prediction
104
Choosing k
105
Reference
106
Exercises
106
433800825.003.png
Zgłoś jeśli naruszono regulamin