Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

26

Transcript of Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Page 1: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.
Page 2: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Introduction Pertemuan 01

Matakuliah : M0614 / Data Mining & OLAP Tahun : Feb - 2010

Page 3: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Bina Nusantara

Pada akhir pertemuan ini, diharapkan mahasiswa

akan mampu :• Mahasiswa dapat menjelaskan konsep dasar dan

pengertian umum data mining. (C2)• Mahasiswa dapat menerangkan arsitektur data

mining. (C2)

Learning Outcomes

3

Page 4: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Bina Nusantara

Acknowledgments

These slides have been adapted from Han, J., Kamber, M., & Pei, Y. Data Mining: Concepts and Technique.

Page 5: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Bina Nusantara

• Why data mining• Data mining: On what kind of data?• Data mining functionality• Data mining algorithms• Integration data mining and data

warehousing• Architecture: Typical data mining system• Major issues in data mining• Summary

Outline Materi

5

Page 6: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Why Data Mining?• The Explosive Growth of Data: from terabytes to petabytes

– Data collection and data availability

• Automated data collection tools, database systems, Web, computerized

society

– Major sources of abundant data

• Business: Web, e-commerce, transactions, stocks, …

• Science: Remote sensing, bioinformatics, scientific simulation …

• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—Automated analysis of

massive data sets

Page 7: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Why Mine Data? Commercial Viewpoint

• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/

grocery stores– Bank/Credit Card

transactions

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Page 8: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Mining Large Data Sets - Motivation

• There is often information “hidden” in the data that is not readily evident

• Human analysts may take weeks to discover useful information• Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

Page 9: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

What Is Data Mining?

• Data mining (knowledge discovery from data)

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

– Data mining: a misnomer?

• Alternative names

– Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• Watch out: Is everything “data mining”?

– Simple search and query processing

– (Deductive) expert systems

Page 10: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

What is (not) Data Mining?

• What is not Data Mining?– Look up phone number in

phone directory– Query a Web search

engine for information about “Amazon”

• What is Data Mining?– Certain names are more

prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Page 11: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Knowledge Discovery (KDD) Process• Data mining—core of

knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 12: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining and Business IntelligenceIncreasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision

MakingData Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Page 13: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Page 14: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Why Not Traditional Data Analysis?• Tremendous amount of data

– Algorithms must be highly scalable to handle such as tera-bytes of data

• High-dimensionality of data

– Micro-array may have tens of thousands of dimensions

• High complexity of data

– Data streams and sensor data

– Time-series data, temporal data, sequence data

– Structure data, graphs, social networks and multi-linked data

– Heterogeneous databases and legacy databases

– Spatial, spatiotemporal, multimedia, text and Web data

– Software programs, scientific simulations

• New and sophisticated applications

Page 15: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining: On What Kinds of Data?• Database-oriented data sets and applications

– Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications

– Data streams and sensor data

– Time-series data, temporal data, sequence data (incl. bio-sequences)

– Structure data, graphs, social networks and multi-linked data

– Object-relational databases

– Heterogeneous databases and legacy databases

– Spatial data and spatiotemporal data

– Multimedia database

– Text databases

– The World-Wide Web

Page 16: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views lead to different classifications

– Data view: Kinds of data to be mined

– Knowledge view: Kinds of knowledge to be discovered

– Method view: Kinds of techniques utilized

– Application view: Kinds of applications adapted

Page 17: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining Functionalities• Multidimensional concept description: Characterization and discrimination

– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

• Frequent patterns, association, correlation vs. causality

– Diaper Beer [0.5%, 75%] (Correlation or causality?)

• Classification and prediction [Predictive]

– Construct models (functions) that describe and distinguish classes or concepts for future prediction

• E.g., classify countries based on (climate), or classify cars based on (gas mileage)

– Predict some unknown or missing numerical values

Page 18: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining Functionalities (2)• Cluster analysis [Descriptive]

– Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

– Maximizing intra-class similarity & minimizing interclass similarity• Outlier analysis

– Outlier: Data object that does not comply with the general behavior of the data– Noise or exception? Useful in fraud detection, rare events analysis

• Trend and evolution analysis– Trend and deviation: e.g., regression analysis– Sequential pattern mining: e.g., digital camera large SD memory– Periodicity analysis– Similarity-based analysis

• Other pattern-directed or statistical analyses

Page 19: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Data Mining Algorithms:

• Classification• Statistical Learning• Clustering• Sequential PatternsOthers :• Link Mining• Bagging and Boosting• Integrated Mining• Rough Sets• Graph Mining

Page 20: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Multi-Dimensional View of Data Mining• Data to be mined

– Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

• Knowledge to be mined

– Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

– Multiple/integrated functions and mining at multiple levels

• Techniques utilized

– Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.

• Applications adapted

– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Page 21: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

OLAP Mining: An Integration of Data Mining and Data Warehousing

• Data mining systems, DBMS, Data warehouse systems coupling

– No coupling, loose-coupling, semi-tight-coupling, tight-coupling

• On-line analytical mining data

– integration of mining and OLAP technologies

• Interactive mining multi-level knowledge

– Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

• Integration of multiple mining functions

– Characterized classification, first clustering and then association

Page 22: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

An OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 23: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Major Issues in Data Mining• Mining methodology

– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web

– Performance: efficiency, effectiveness, and scalability

– Pattern evaluation: the interestingness problem

– Incorporation of background knowledge

– Handling noise and incomplete data

– Parallel, distributed and incremental mining methods– Integration of the discovered knowledge with existing one: knowledge fusion

• User interaction

– Data mining query languages and ad-hoc mining

– Expression and visualization of data mining results

– Interactive mining of knowledge at multiple levels of abstraction• Applications and social impacts

– Domain-specific data mining & invisible data mining– Protection of data security, integrity, and privacy

Page 24: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Challenges of Data Mining

• Scalability• Dimensionality• Complex and Heterogeneous Data• Data Quality• Data Ownership and Distribution• Privacy Preservation• Streaming Data

Page 25: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Summary• Data mining: Discovering interesting patterns from large amounts of data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of information repositories

• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

• Data mining systems and architectures

• Major issues in data mining

Page 26: Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb - 2010.

Bina Nusantara

Dilanjutkan ke pert. 02Data Warehousing, Data Generalization, and

Online Analytical Processing