Web content mining is a part of data mining domain that is the closest one to the classic definition of DM. Web content mining aspects are related to the similar domains in classic data mining.
- automatic content extraction from web pages
- integration of the information
- opinion and rewievs extraction
- knowledge synthesis
- noise detection and segmentation
Briefly said, web content mining listed above are solutions for more or less complicated problems or issues, connected to automation of web usage, which lead to the improvement in several aspects of Internet daily life, considering both technical and non-technical matters.
Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items. It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.
Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.
Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:
- web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
- web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
- web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.
Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.
Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.
As I’ve written in previous part – content creators part 1, discovering ubercreators and exploating this knowledge should be an important part of the development of every social-networking site.
My project (idea) is to set up a system to find content creators in functioning Internet board, using data mining algorithms. Some details:
- database (MySQL) with over 3k users and describing parameters (about 70),
- selection of the parameters describing users must be executed (manual – technically it comes to selection of the tables in the database, the process could be automated if necessary)
- Weka is used as a set of classifiers and clustering algorithms (it is necessary to prepare data for both program and algorithm)
Content creating in discussion board is not really complex issue. Although it is difficult to evaluate value of the messages, in most cases it is not even necessary. It is enough to eliminate obvious cases of spamming and just let the snowball rolling down the hill.
In the certain moment, discovering users with hidden potential to create valuable content can give evolving society a serious boost. Giving an algorithm set of users with parameters, with an emplasis on those parameters describing activity and “creative spirit”, algorithm does the rest of the job, clustering users into groups with high level of similarity. The point is to use results of classification to give positive feedback to possible creators, to exploit potential.
The most reliable way to measure results is implementing model in real-life system. However, it is also necessary to try some modelling, because walking in the dark without even predicting (flashlight) if it is going to succeed is unacceptable in every business. Success means in this case having quick development of the network society with a visible grow of the valuable content and SEO parameters.
Content creators in social-networking sites part 1
Next chapter covers the issue of the chosen parameters, algorithm and modelling.
PageRank – Larry Page’s algorithm - is probably the most popular and well-known use of web linkage mining. This non-context approach is simply a popularity contest, where the importance of the ‘vote’ is measured by the importance of the originating site itself. Better the linking (my page) site is, bigger gain in the rating I get. Looking inside, the importance of the site is measured by the probability of visiting the site, the way to get the digits is google’s secret, obviously (I bet naive Bayes is used somewhere there;).
What about reality? PageRank is vulnerable to spamming and a lot of people cheat PR for a living. For short, farm of sites (servicer) is created and it’s coordinated work pulls target site up in the ranking. It is also language problem how to deal with ambiguous keywords. Then, technical problem – solved more or less fine of course by taxation mechanism – with pages with no further linkage (PR value thieves as the PR popularity flows there and stays forever). The random jumping also helps with dead-end sites. Prediction mechanisms are also worth mentioning as well as using local resources to save some time and computing power, e.g. processing data for whole domain or server.
There are some modifications of the Pagerank algorithm. Interesting one is topic-specified pagerank by T. Haveliwala. There were contexts added (topic-specified groups, like DMOZ) and the idea is to keep results close to previously specified topic. The big advantage of this approach is that personalization of the search process can be easily applied (user-specified popularity ranking and not the general one).
Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.
Weka - explorer window
I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.
Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.
There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.
I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi
It is a kind of obvious statement that the motor of every social-networking site are content creators. Each owner of social-networking site knows, that it is only a machine what he provides, leaving the “stream of life” in the hands (and keyboards) of the most active users. Nothing says more than digits – my research shows that only 0.5% of all users of my S-N site are responsible for 38% content created!
From the business point of view it is critical to have such users. The situation when everybody wants to eat, but there is nobody to plant crops the result is starvation for the most of the society. It is also said that valuable content has magnetism within, attracting both users and search engines.
Hunting content creators should be high on the list of TO DO things after starting UCC website. Connecting dots, content creating in S-N and my interests in data mining, resulted with an idea to use data mining to discover users, who might be better than average content suppliers.
How to do it, having 8-years-old Internet board database, full of profile information, over 3200 users and over 115k posts? How will it affect the life of society? What is the realiability of the research? And finally, what is the point (where are money)?
As usual, a lot of questions and answers given in probability measure. Revealing next part of the picture in the following part.
ADABoost (Adaptive Boosting) is a meta-algorithm used to improve classification results. The concept is to make a lot of weak classifiers cooperate to boost results. Adaptability means in this case that detection of the wrong classification makes the algorithm do more work on it (by changing the wages and setting algorithm to do more effort where it failed).
AdaBoost is sensitive to noisy data or outliers.
[http://www.cs.princeton.edu/~schapire/boost.html; Wu i inni "Top 10 algorithms in data mining" Springer 2008]
CART (regression and classification tress) – decision trees algorithm. Trees created by CART are binary – there are two branches coming out of the node. The algorithm goes as follow: look for every partition possible, and choose the best one (“goodness” criterium). To reduce the complexity there are some pruning (=cutting branches) techniques.
C4.5 is also decision trees algorithm. What differs is the possibility to create more-than-binary trees. It is also the ‘information gain” that decides about attributes selection. Attribute with the biggest information gain (or lowest entropy reduction) ensure classification with the lowest amount of information needed to classify correctly.
Entropy is a number of bits needed to send information about the result of the occurrence with probability p. In the possible spilt of the training set to the sub-sets, it is possible to calculate the requisition on the information (as an weighted sum of entropy for the sub-sets). Algorithm chooses optimal split, the one with the biggest information gain.
Disadvantages of the C4.5 algorithm are huge memory and processor capacity requirements, which are necessary to produce rules.
The C5.0 algorithm was presented in 1997, as a commercial version of C4.5. Important step ahead was made as tests provided, both with better classification results and supported types of data.
[Wu i inni "Top 10 algorithms in data mining" Springer 2008; Daniel Larose „Odkrywanie wiedzy z danych” 2006 PWN, 118]
CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a methodology used in processing data mining projects, as data exploration like the other business processing techniques demands a general guide to follow.
Basic methodology is split into four parts:
- problem identification
- data preprocessing (turn data into information, whatever it means)
- data exploration
- evaluation (result examination)
Data mining is in general a mechanism that let us make better decision in the future, by analysing (in very fancy way) past data. There are two moments in the data mining process which we have to be careful – when we discover a pattern, which can be false or when pattern is true, but useless. The 1st is a straight danger, because business decissions made on false basis simply cost money (sometimes awful lot of money). 2nd one has additional, hidden trap, because it becomes clear the rule is useless after implementing i – system doesn’t simply pass the reality check. Maintaining the methodology provides us with the mechanism to minimize probability of making such a mistake.
According to crisp-dm.org, the open methodology to keep data mining industrial process close to general business-and-research -problems solving strategy. System is divided into 6 steps:
- business problem and condition understanding
- data understanding
- data prepration
It is very important to notice, each step is strictly connected with results of previous one and it is necessary to jump serveral times between levels (not only in the order presented above!). It is also natural that result of one step causes returning to the start point of the project and reevaluating some opinions or foredesigns.
[M. Berry, G. Linoff „Data Mining Techniques”, Wiley 2004.]
[Daniel Larose „Odkrywanie wiedzy z danych” 2006 PWN, 5]
SVM stands for support vector machines. The idea of this classification’ algorithm is generating border between objects that belong to different decision class. Big advantage of this approach is simple training set and moreover, it can be easy used to solve multi-dimensional problems. Line between objects is generated by iterative algorithm.
Types of SVM:
- regression epsilon SVM
- regression ni-SVM