Cette page appartient aux archives web de l'EPFL et n'est plus tenue à jour.
This page belongs to EPFL's web archive and is no longer updated.

Surender Reddy Yerva

A brief overview of datamining algorithms

**   Summary of DataMining Algorithms     **

1R (or One Rule)
This Procedure is simplest-first data mining methodology. Make rules based on single attribute.
+ Points:
Often performs well and close enough to complex sophisticated algorithms
Takes care of missing values
Extremely fast

- Points
Overfitting Problem
Only one attribute need not capture everything

Statistical Modeling: Naive Bayes Classifier
Makes use of multiple attributes. Assumes all attributes are equally important and independent.
Based on training data compute the prior probs and likelihoods. Using which one can compute the posterior probs.
Nominal values.. plain probs (Binomial or Multinomial distributions)
Numerical attrs.. Normal distribution.

Usage: Bayes model for document classification.

Simple approach
Impressive results possible on most datasets.

Independence assumption among attributes, which need not be true.

Divide & Conquer: Decision Trees(TOP DOWN Induction of Decision Trees)
Choose one attr as a root of tree and based on its value partition the original set into subsets. Continue the similar procedure on each subset. The challenge here is to choose the correct order of attributes. Use INFORMATION GAIN for choosing the attribute order. Information gain is similar to entropy measure.

Overfiting i.e identity attribute would result in maximum info gain but is of no use to capture the trends. So use GAIN RATIO instead on INFO GAIN measure.

Popular methods: ID3, C4.5.

Covering Algorithms: Constructing rules
Take a particular class and try deciding the rule for  it based on designing a predicate with certain accuracy.
Some what similar to the decision trees.

Association Rules:
Rules with multiple attributes.
Coverage and Accuracy taken into account.

Linear Models:
Numeric prediction: Linear regression
The class is a linear combination of the attributes. Attributes usually found using minimizing the error function i.e. mean square errors.

One Linear regression moder per class. For multiple classes we have multiresponse linear regression.

Two drawbacks with multi-linear regression: 1. the measure given is not prob. 2. All classifiers assume same error distribution.

Logistic Regression: It uses a sigmod kind classification function. Log-likelihood is used as error function.

Perceptron algorithm, Winnow, Balanced Winnow algorithms to learn the Hyperplane separating the classes.

Instance based Learning

Given a new instant, find the nearest neighbor based on certain distance measure. Assign the nearest neighbors class to the new instance.
kD-trees are efficient datastructures for computing the nearest neighbors.
So are BALL Trees, Metric Trees.
Nearest nbr can be extended to k-Nearest ngbrs.

K-Means : a classic clustering algorithm. k represents the number of clusters we are looking for.
K-Means augmented with KD-Trees for efficient processing.


Posted by Surender Reddy Yerva at 15:28