Archive for the ‘web mining’ Category

Pagerank

Posted on the May 22nd, 2010 under data mining,general,web mining by

PageRank – Larry Page’s algorithm -  is probably the most popular and well-known use of web linkage mining. This non-context  approach is simply a popularity contest, where the importance of the ‘vote’ is measured by the importance of the originating site itself. Better the linking (my page) site is, bigger gain in the rating I get. Looking inside, the importance of the site is measured by the probability of visiting the site, the way to get the digits is google’s secret, obviously (I bet naive Bayes is used somewhere there;).

What about reality? PageRank is vulnerable to spamming and a lot of people cheat PR for a living. For short, farm of sites (servicer) is created and it’s coordinated work pulls target site up in the ranking. It is also language problem how to deal with ambiguous keywords. Then, technical problem – solved more or less fine of course by taxation mechanism – with pages with no further linkage (PR value thieves as the PR popularity flows there and stays forever). The random jumping also helps with dead-end sites. Prediction mechanisms are also worth mentioning as well as using local resources to save some time and computing power, e.g. processing data for whole domain or server.

There are some modifications of the Pagerank algorithm. Interesting one is topic-specified pagerank by T. Haveliwala. There were contexts added (topic-specified groups, like DMOZ) and the idea is to keep results close to previously specified topic. The big advantage of this approach is that personalization of the search process can be easily applied (user-specified popularity ranking and not the general one).

Weka

Posted on the May 20th, 2010 under data mining,definitions,hunting content creators,projects,web mining by

Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.

Weka

Weka - explorer window

I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.

Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.

There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.

I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi

hunting content creators (1) – introduction

Posted on the May 14th, 2010 under hunting content creators,projects,social networks,web mining by

It is a kind of obvious statement that the motor of every social-networking site are content creators. Each owner of social-networking site knows, that it is only a machine what he provides, leaving the “stream of life” in the hands (and keyboards) of the most active users. Nothing says more than digits – my research shows that only 0.5% of all users of my S-N site are responsible for 38% content created!

From the business point of view it is critical to have such users. The situation when everybody wants to eat, but there is nobody to plant crops the result is starvation for the most of the society. It is also said that valuable content has magnetism within, attracting both users and search engines.

Hunting content creators should be high on the list of TO DO things after starting UCC website. Connecting dots, content creating in S-N and my interests in data mining, resulted with an idea to use data mining to discover users, who might be better than average content suppliers.

How to do it, having 8-years-old Internet board database, full of profile information, over 3200 users and over 115k posts? How will it affect the life of society?  What is the realiability of the research? And finally, what is the point (where are money)?

As usual, a lot of questions and answers given in probability measure. Revealing next part of the picture in the following part.

CRISP-DM

Posted on the April 2nd, 2010 under business,projects,web mining by

CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a methodology used in processing data mining projects, as data exploration like the other business processing techniques demands a general guide to follow.

Basic methodology is split into four parts:

  1. problem identification
  2. data preprocessing (turn data into information, whatever it means)
  3. data exploration
  4. evaluation (result examination)

Data mining is in general a mechanism that let us make better decision in the future, by analysing (in very fancy way) past data. There are two moments in the data mining process which we have to be careful – when we discover a pattern, which can be false or when pattern is true, but useless. The 1st is a straight danger, because business decissions made on false basis simply cost money (sometimes awful lot of money). 2nd one has additional, hidden trap, because it becomes clear the rule is useless after implementing i – system doesn’t simply pass the reality check. Maintaining the methodology provides us with the mechanism to minimize probability of making such a mistake.

According to crisp-dm.org, the open methodology to keep data mining industrial process close to general business-and-research -problems solving strategy. System is divided into 6 steps:

  1. business problem and condition understanding
  2. data understanding
  3. data prepration
  4. modelling
  5. evaluation
  6. implementation

It is very important to notice, each step is strictly connected with results of previous one and it is necessary to jump serveral times between levels (not only in the order presented above!). It is also natural that result of one step causes returning to the start point of the project and reevaluating some opinions or foredesigns.

[M. Berry, G. Linoff ?Data Mining Techniques?, Wiley 2004.]

[Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 5]

SVM – support vector machines

Posted on the March 31st, 2010 under data mining,web mining by

SVM stands for support vector machines. The idea of this classification’ algorithm is generating border between objects that belong to different decision class. Big advantage of this approach is simple training set and moreover, it can be easy used to solve multi-dimensional problems. Line between objects is generated by iterative algorithm.

Types of SVM:

  • C-SVM
  • ni-SVM
  • regression epsilon SVM
  • regression ni-SVM

[http://www.spc.com.pl/textbook/stathome_stat.html?http%3A%2F%2Fwww.spc.com.pl%2Ftextbook%2Fstmachlearn.html]

kNN – k nearest neighbours algorithm

Posted on the February 9th, 2010 under data mining,definitions,web mining by

K Nearest Neighbours is a basic classification algorithm. The idea comes probably from the extension of Rote classifier, which is as simple as point system in ‘Whose line is it anyway’. System memorizes whole training set and classifies only items that have exactly same values as in training set. Obvious disadvantage is there will be a lot of  unclassified objects. The “next generation” of the concept says the classification occurs using the value of the nearest point in dataset. Comparing to previous way it is a huge difference, but still – system is vulnerable to noise and outliers.

KNN is (comparing to previous strategies) a bit more sophisticated. Algorithm finds a group of k-objects in training set under the condition of “distance” and according to the findings classifies the new object to the previously given class (cluster), respecting weights set to neighbours. Important issues are:

  • number of neighbours (it is important because it is in the name of the algo anyway)
  • the meaning of distance
  • training set is the basic

Parameters are very imporant to the results and I am going to write another post to discuss a little bit more about.

The procedure goes:

  1. Get the training set remembered (and prepared to update dynamically if data comes continously)
  2. Measure the distance between new object and object to training set, to find the nearests
  3. Use collected information to classify new object

In spite of the fact, building the model using kNN is not very difficult task, costs of classification are relatively high. Comparing new object with whole training set (lazy learning) is responsible for that and it is especially visible in large datasets. There are some techniques that reduce the amout of computation – from simply editing training set (sometimes results are even better than classification with larger database) to proximity graphs.

Sources: [Top 10 algorithms in data mining, Springer 2008]

K-means

Posted on the February 5th, 2010 under data mining,definitions,web content mining,web mining by

K-means algorithm (centroid algorithm; LGB) is a simple algorithm to partition given data into clusters. The main purpose is to keep the similarity of the data and simultanously to keep the minimal error. K-means is greedy algo type. The main idea is: choose randomly c clusters and then set all the objects that are close to the randomly chosen cluster to it’s class. Then update the center of the set of objects – relocate the central point, and again check if all objects are clustered to the nearest centroid. If not – update. Repeat acordingly.

The k-means complexity is O(cni). The main disadvantages is outliers and noise  can influence the result badly. The solution to the first problem is not to use means but medoids, which are not very sensitive to outliers (we just take the center of the set of data, not mean). There is also a danger to stuck in the local optimum – using k-means algo we always have to start several times using different starting sets of starting points.

However, k-means algorithm is the most popular partition algorithm. It is simple, easy to implement and scalable. Moreover, it is possible to use it with dynamic data. Improvement of k-means algorithm is generally connected with making it suitable for very large datasets – the best tries are kd-trees or triangular inequality.

Types inside web mining

Posted on the January 23rd, 2010 under definitions,general,web mining by

Looking at both data mining and web mining, there’s one main difference. When data mining operates generally on the content, web mining uses also structure of the data:

  • WEB CONTENT MINING operates on the content, most of the time it’s text (maybe that’s why WM is called also text mining)
  • WEB LINKAGE MINING is getting information from the structure of sites.
  • WEB USAGE MINING is getting information from logs – i.e. tracking user’s movement from one page to another

It gives a certain amout of possibilities that doesn’t count in usual data mining.

Web Mining – introduction

Posted on the November 2nd, 2009 under general,web mining by

Web mining is generally a data mining branch. Introducting Web mining I want to take one step back and present some thoughts about data mining.

Data mining or data exploration is set of techniques used to automatically discover non-trivial relations, patterns and schemes in large data collections. In other words, we are looking for deep-hidden knowlegde in very lagre datasets (in web mining case – the Internet), and we only accept automatic solutions. Why? For better understanding. Having the mechanism, we can ask much more difficult questions (comparing to i.e. sql).

At this point, we can say that web mining is data mining with the Internet as the dataset.

Let’s take a short look at the appliance of web mining:

  • data classification (i.e. customers’ sentiment,  reviews…)
  • natural language processing (NLP, but don’t confuse with neuro-linguistic programming)
  • www personalization
  • knowledge management

Sources:

wazniak.mimuw.edu.pl – data mining (.pps), wikipedia, Bing Liu – Web Mining 2005 Tutorial.