ADABoost (Adaptive Boosting) is a meta-algorithm used to improve classification results. The concept is to make a lot of weak classifiers cooperate to boost results. Adaptability means in this case that detection of the wrong classification makes the algorithm do more work on it (by changing the wages and setting algorithm to do more effort where it failed).
AdaBoost is sensitive to noisy data or outliers.
[http://www.cs.princeton.edu/~schapire/boost.html; Wu i inni "Top 10 algorithms in data mining" Springer 2008]
CART (regression and classification tress) – decision trees algorithm. Trees created by CART are binary – there are two branches coming out of the node. The algorithm goes as follow: look for every partition possible, and choose the best one (“goodness” criterium). To reduce the complexity there are some pruning (=cutting branches) techniques.
C4.5 is also decision trees algorithm. What differs is the possibility to create more-than-binary trees. It is also the ‘information gain” that decides about attributes selection. Attribute with the biggest information gain (or lowest entropy reduction) ensure classification with the lowest amount of information needed to classify correctly.
Entropy is a number of bits needed to send information about the result of the occurrence with probability p. In the possible spilt of the training set to the sub-sets, it is possible to calculate the requisition on the information (as an weighted sum of entropy for the sub-sets). Algorithm chooses optimal split, the one with the biggest information gain.
Disadvantages of the C4.5 algorithm are huge memory and processor capacity requirements, which are necessary to produce rules.
The C5.0 algorithm was presented in 1997, as a commercial version of C4.5. Important step ahead was made as tests provided, both with better classification results and supported types of data.
[Wu i inni "Top 10 algorithms in data mining" Springer 2008; Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 118]
CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a methodology used in processing data mining projects, as data exploration like the other business processing techniques demands a general guide to follow.
Basic methodology is split into four parts:
- problem identification
- data preprocessing (turn data into information, whatever it means)
- data exploration
- evaluation (result examination)
Data mining is in general a mechanism that let us make better decision in the future, by analysing (in very fancy way) past data. There are two moments in the data mining process which we have to be careful – when we discover a pattern, which can be false or when pattern is true, but useless. The 1st is a straight danger, because business decissions made on false basis simply cost money (sometimes awful lot of money). 2nd one has additional, hidden trap, because it becomes clear the rule is useless after implementing i – system doesn’t simply pass the reality check. Maintaining the methodology provides us with the mechanism to minimize probability of making such a mistake.
According to crisp-dm.org, the open methodology to keep data mining industrial process close to general business-and-research -problems solving strategy. System is divided into 6 steps:
- business problem and condition understanding
- data understanding
- data prepration
- modelling
- evaluation
- implementation
It is very important to notice, each step is strictly connected with results of previous one and it is necessary to jump serveral times between levels (not only in the order presented above!). It is also natural that result of one step causes returning to the start point of the project and reevaluating some opinions or foredesigns.
[M. Berry, G. Linoff ?Data Mining Techniques?, Wiley 2004.]
[Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 5]