Naive Bayes is a statistical classificator. It is one of the oldest formal classification algorithm, but still – thanks to simplicity and efficiency – it is often used, for example in anti-spam mechanisms. This method is called: supervised classification – we are given a set of objects with classes assigned and using it we want to generate rules that help us assign future objects to classes.
MAP (maximal a posteriori classification) is very popular estimation method in bayesian statistic. It is said MAP is optimal – minimal error is achieved. The problem is when it comes to computation complexity, which is c^n (c – classes n – describing variables). However, naive bayes it is said the variables (components) are independent (conditionally independent). The point is – if it is true, NB gives also optimal results.
It may seem that independence presumption is too strict to adapt it in real world. Neverthless, activity before classification makes the difference, e.g. selection and elimination of corellated variables occurs always as the part of the methodology of data mining.
Sources: same as previous posts
A priori is algorithm used in affinity analysis. A set of rules is generated, usually implications, that describe dataset. Finding frequent datasets from the transaction database is a popular task among data mining appliances, which isn’t as simple as it initially seems. The reason is computational complexity, which goes extremaly extremely high when it comes to very large databases (I like the fancy way they described it in Top 10 Algorithm in DM: combinatorial explosion).
The idea of A priori is: find frequent itemsets (frequent means one with previously assigned level of support) and then generate rules that comply previously assigned level of confidence. There are candidate itemsets generated and they are the base to find n-element frequent itemsets (1st step in procedure is to find one-element frequent itemsets, and then repeat, eliminating itemsets which support is not sufficient). the procedure of generating candidate and frequent sets is repeated simply for the possible number. The main point exploits monotonicity: ?if an itemset is not frequent, any of its superset is never frequent? (again Top 10 Algo in DM). Smart way to eliminate itemsets.
A priori is one of the most important algorithms in data mining. The other ideas to make it even more efficient are e.g. the new ways to create candidate itemsets – hashing techniques (smaller candidate itemesets), partitioning (divide the problem into smaller ones and explore them separately – if only real-life problems work this way!) or sampling. Important improvement of A priori algorithm is FP-growth algorithm, which supports compression (without losing important information) and then partitioning.
Despite of the fact A priori is rather simple, easy implementation and proper results make it serious solution in many problems.
K Nearest Neighbours is a basic classification algorithm. The idea comes probably from the extension of Rote classifier, which is as simple as point system in ‘Whose line is it anyway’. System memorizes whole training set and classifies only items that have exactly same values as in training set. Obvious disadvantage is there will be a lot of unclassified objects. The “next generation” of the concept says the classification occurs using the value of the nearest point in dataset. Comparing to previous way it is a huge difference, but still – system is vulnerable to noise and outliers.
KNN is (comparing to previous strategies) a bit more sophisticated. Algorithm finds a group of k-objects in training set under the condition of “distance” and according to the findings classifies the new object to the previously given class (cluster), respecting weights set to neighbours. Important issues are:
- number of neighbours (it is important because it is in the name of the algo anyway)
- the meaning of distance
- training set is the basic
Parameters are very imporant to the results and I am going to write another post to discuss a little bit more about.
The procedure goes:
- Get the training set remembered (and prepared to update dynamically if data comes continously)
- Measure the distance between new object and object to training set, to find the nearests
- Use collected information to classify new object
In spite of the fact, building the model using kNN is not very difficult task, costs of classification are relatively high. Comparing new object with whole training set (lazy learning) is responsible for that and it is especially visible in large datasets. There are some techniques that reduce the amout of computation – from simply editing training set (sometimes results are even better than classification with larger database) to proximity graphs.
Sources: [Top 10 algorithms in data mining, Springer 2008]
K-means algorithm (centroid algorithm; LGB) is a simple algorithm to partition given data into clusters. The main purpose is to keep the similarity of the data and simultanously to keep the minimal error. K-means is greedy algo type. The main idea is: choose randomly c clusters and then set all the objects that are close to the randomly chosen cluster to it’s class. Then update the center of the set of objects – relocate the central point, and again check if all objects are clustered to the nearest centroid. If not – update. Repeat acordingly.
The k-means complexity is O(cni). The main disadvantages is outliers and noise can influence the result badly. The solution to the first problem is not to use means but medoids, which are not very sensitive to outliers (we just take the center of the set of data, not mean). There is also a danger to stuck in the local optimum – using k-means algo we always have to start several times using different starting sets of starting points.
However, k-means algorithm is the most popular partition algorithm. It is simple, easy to implement and scalable. Moreover, it is possible to use it with dynamic data. Improvement of k-means algorithm is generally connected with making it suitable for very large datasets – the best tries are kd-trees or triangular inequality.