Archive for the ‘general’ Category

naive Bayes

Posted on the February 16th, 2010 under data mining,definitions,general by

Naive Bayes is a statistical classificator. It is one of the oldest formal classification algorithm, but still – thanks to simplicity and efficiency – it is often used, for example in anti-spam mechanisms. This method is called: supervised classification – we are given a set of objects with classes assigned and using it we want to generate rules that help us assign future objects to classes.

MAP (maximal a posteriori classification) is very popular estimation method in bayesian statistic. It is said MAP is optimal – minimal error is achieved. The problem is when it comes to computation complexity, which is c^n (c – classes n – describing variables). However, naive bayes it is said the variables (components) are independent (conditionally independent). The point is – if it is true, NB gives also optimal results.

It may seem that independence presumption is too strict to adapt it in real world. Neverthless, activity before classification makes the difference, e.g. selection and elimination of corellated variables occurs always as the part of the methodology of data mining.

Sources: same as previous posts

a priori

Posted on the February 12th, 2010 under data mining,definitions,general,web content mining by

A priori is algorithm used in affinity analysis. A set of rules is generated, usually implications, that describe dataset. Finding frequent datasets from the transaction database is a popular task among data mining appliances, which isn’t as simple as it initially seems. The reason is computational complexity, which goes extremaly extremely high when it comes to very large databases (I like the fancy way they described it in Top 10 Algorithm in DM: combinatorial explosion).

The idea of A priori is: find frequent itemsets (frequent means one with previously assigned level of support) and then generate rules that comply previously assigned level of confidence. There are candidate itemsets generated and they are the base to find n-element frequent itemsets (1st step in procedure is to find one-element frequent itemsets, and then repeat, eliminating itemsets which support is not sufficient). the procedure of generating candidate and frequent sets is repeated simply for the possible number. The main point exploits monotonicity: ?if an itemset is not frequent, any of its superset is never frequent? (again Top 10 Algo in DM). Smart way to eliminate itemsets.

A priori is one of the most important algorithms in data mining. The other ideas to make it even more efficient are e.g. the new ways to create candidate itemsets – hashing techniques (smaller candidate itemesets), partitioning (divide the problem into smaller ones and explore them separately – if only real-life problems work this way!) or sampling. Important improvement of A priori algorithm is FP-growth algorithm, which supports compression (without losing important information) and then partitioning.

Despite of the fact A priori is rather simple, easy implementation and proper results make it serious solution in many problems.

Types inside web mining

Posted on the January 23rd, 2010 under definitions,general,web mining by

Looking at both data mining and web mining, there’s one main difference. When data mining operates generally on the content, web mining uses also structure of the data:

  • WEB CONTENT MINING operates on the content, most of the time it’s text (maybe that’s why WM is called also text mining)
  • WEB LINKAGE MINING is getting information from the structure of sites.
  • WEB USAGE MINING is getting information from logs – i.e. tracking user’s movement from one page to another

It gives a certain amout of possibilities that doesn’t count in usual data mining.

Web Mining – introduction

Posted on the November 2nd, 2009 under general,web mining by

Web mining is generally a data mining branch. Introducting Web mining I want to take one step back and present some thoughts about data mining.

Data mining or data exploration is set of techniques used to automatically discover non-trivial relations, patterns and schemes in large data collections. In other words, we are looking for deep-hidden knowlegde in very lagre datasets (in web mining case – the Internet), and we only accept automatic solutions. Why? For better understanding. Having the mechanism, we can ask much more difficult questions (comparing to i.e. sql).

At this point, we can say that web mining is data mining with the Internet as the dataset.

Let’s take a short look at the appliance of web mining:

  • data classification (i.e. customers’ sentiment,  reviews…)
  • natural language processing (NLP, but don’t confuse with neuro-linguistic programming)
  • www personalization
  • knowledge management

Sources:

wazniak.mimuw.edu.pl – data mining (.pps), wikipedia, Bing Liu – Web Mining 2005 Tutorial.

Greetings.

Posted on the September 3rd, 2009 under general by

Hello.

This blog is created to support me in my way to explore aspects of web mining. Firstly, I try to include basic informations about web mining, slowly increasing level of the articles. Meanwhile I’ll write my thesis, based also on web content mining.

Another purpose is to disciple me in getting more experienced in the subject, as it is rather new matter for me. I hope I’d release up to 2 articles – entries per week.

Greetings, author.