Archive for the ‘data mining’ Category
Today I’d like to introduce one good online source of data mining and web mining knowledge – kdnuggets.com. As they advertise, it is professional newsletter with over 12k subscribers. KD stands for Knowledge Discovery, which is part of the deal with data mining.
It can be found a lot of useful information from the industry, starting from news, modern data mining software, datasets and a lot of education materials. There is even board with job offer from knowledge management business. I strongly recommend visiting KDNuggets website as a great resource of web mining and data mining online.
Here – part 3 – you will find previous part of Hunting content creators serie.
In the initial phase of social networking site functioning, it has to be provided with fuel – content. Same as car engine – to proper work it needes regular delivery of good quality fuel. Usually, users are treated equally when it comes to content creating. Sometims it is possible to become moderator or admin, but normally user is first and final stage of the ‘site career’.
Tracking of users, who are content creators and enabling them to work more efficient and effective would let the webpage develop much quicker, which is good for both owner and users. The main goal of research is to check, if web mining algorithm are useful in extracting knowledge about users – potential content creators in social networking sites. Research has been done using WEKA ver. 3.6.2 in Linux Ubuntu 8.1 (2.6.27-15-generic kernel) environment, database from internet board: 3200 users, 120k messages, PhpBB by Przemo technology.
Researched internet board is developed in PhpBB by Przemo technology with 3200 users and 118.000 messages. Data is kept in MySQL format, in table named “users”, containing information about users.
20 parameters describing users have been chosen:
- user ID (ascending to registation date),
- account status (activated / not activated / deactivated (used but not active anymore due to some reason),
- username,
- session time,
- last visit,
- registration date,
- number of posts,
- interface language (polish, english),
- avatar type (avatar from board, external source, none),
- e-mail,
- user webpage,
- where is user from (normally – city/town, but crativity of the users is infinite,
- signature,
- instant communication number,
- interests,
- birth date,
- sex,
- summery time spent on the forum,
- number of visits.
Some parameters were removed from the research:
- user level,
- private messages details,
- other functions of forum script,
- additional profile fields
- data about restrictions, etc.
During preparation, data was exported to csv format, which is supported by Weka. Some difficulties was encountered during the process:
- csv editing using external tool
- EOL chars was not accepted in signature / from field
- ‘ char was not accepted in signature / from field.
Data undestanding and prepration is a very important process in business data mining. Using iterative management model it has to be processed multiple times, to get the best results.
Popularity of the internet influenced ways of giving feedback to the providers of service and product usage. Main consequence was a turnover to individual client, because every single product or service user, who is also Internet user, can be the one to publish his sentiment worldwide. Moreover, opinion or overview is accessible for a number of potential or actual customers worldwide, which could be both promotion or discouragement. Awareness of this fact makes providers and producents looking for better ways to collect users opinions, usually to improve quality of service and relations with customers. Such opinions can appear in global network in various forms. Comments directly on producers’ website are quite popular, but blogs (both producers and independent), discussion boards (where professional testers meet regular users, who seeks just a product or service, which is suitable for their simple needs). All the features oblige providers to be and participate in places, where sentiment is published and discussed – collecting marketing information and giving feedback and help became important service, improving relations with customers and increasing chance of succeeding in business.
Opinion and overview matter in the Internet have to respect the same rules as the other kind of content – there are a lot of information and it is difficult to find the most suitable to given query, moreover, Internet content changes constantly – today’s noticed overview can suddenly be removed with several others found instead. There is also no assurance the overview is reliable and it is always a chance to be a victim of spoiled or malicious content. However, contras do not win with advantage of using Internet opinion and overview features, mentioning for example getting direct feedback from customer, and providers try to access this knowledge every possible way.
Web mining could be a method to improve business mechanism: customer opinion -> producer -> positive change. Using Internet as a source of feedback, producer obtains knowledge about functional features of his product and opprotunity to develop important domains, such as business intelligence, knowledge management or quality control. Thanks to data exploration, data collection mechanism is automatic and tuned to producers needs. Customer’s plus is time savings and more suitable web search results.
Opinion classification is automatic detection mechanism, allowing recognizing sentiment of the customer, describing experience with product or service. Sentiment can be both positive or negative, but more detailed approach is also possible – whether product is recommended or not recommended. The problem in general is part of natural language processing. Customer sentiment research starts with data gathering and preparation, then three step process: tagging fragments of text (at least two words phrases, using to classify), semantic orientation detection (with certain probability) and calculating probable sentiment. Example techniques for user sentiment classification are Support Vector Machines, Naive Bayes or EM.
Customer overview mining matters not only because it contains users sentiment, but also details about product usage or service features and their importance to final evaluation. Because of this overview research is more difficult task, but its effects could meaningfully change quality of product or service in positive way. Thanks to detailed description, producent or provider can place himself “in customer shoes” and – as a result of this feedback – improve. Process goes by steps: extracting described features, extracting customers sentiment regarding described feature, summation of previous.
In conclusion, automatization of customer sentiment and opinion research with web mining mechanisms is potentially very profitable business issue. Description of product usage or simple summary of transaction is very popular way of giving customers’ feedback. Widespread products and services collected many descriptions in the Internet, what makes checking all the opinions time and resource consuming. Automatic mechanism to collect and summary such a feedback would be very useful for both producers, providers and customers.
PageRank – Larry Page’s algorithm - is probably the most popular and well-known use of web linkage mining. This non-context approach is simply a popularity contest, where the importance of the ‘vote’ is measured by the importance of the originating site itself. Better the linking (my page) site is, bigger gain in the rating I get. Looking inside, the importance of the site is measured by the probability of visiting the site, the way to get the digits is google’s secret, obviously (I bet naive Bayes is used somewhere there;).
What about reality? PageRank is vulnerable to spamming and a lot of people cheat PR for a living. For short, farm of sites (servicer) is created and it’s coordinated work pulls target site up in the ranking. It is also language problem how to deal with ambiguous keywords. Then, technical problem – solved more or less fine of course by taxation mechanism – with pages with no further linkage (PR value thieves as the PR popularity flows there and stays forever). The random jumping also helps with dead-end sites. Prediction mechanisms are also worth mentioning as well as using local resources to save some time and computing power, e.g. processing data for whole domain or server.
There are some modifications of the Pagerank algorithm. Interesting one is topic-specified pagerank by T. Haveliwala. There were contexts added (topic-specified groups, like DMOZ) and the idea is to keep results close to previously specified topic. The big advantage of this approach is that personalization of the search process can be easily applied (user-specified popularity ranking and not the general one).
Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.

Weka - explorer window
I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.
Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.
There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.
I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi
ADABoost (Adaptive Boosting) is a meta-algorithm used to improve classification results. The concept is to make a lot of weak classifiers cooperate to boost results. Adaptability means in this case that detection of the wrong classification makes the algorithm do more work on it (by changing the wages and setting algorithm to do more effort where it failed).
AdaBoost is sensitive to noisy data or outliers.
[http://www.cs.princeton.edu/~schapire/boost.html; Wu i inni "Top 10 algorithms in data mining" Springer 2008]
CART (regression and classification tress) – decision trees algorithm. Trees created by CART are binary – there are two branches coming out of the node. The algorithm goes as follow: look for every partition possible, and choose the best one (“goodness” criterium). To reduce the complexity there are some pruning (=cutting branches) techniques.
C4.5 is also decision trees algorithm. What differs is the possibility to create more-than-binary trees. It is also the ‘information gain” that decides about attributes selection. Attribute with the biggest information gain (or lowest entropy reduction) ensure classification with the lowest amount of information needed to classify correctly.
Entropy is a number of bits needed to send information about the result of the occurrence with probability p. In the possible spilt of the training set to the sub-sets, it is possible to calculate the requisition on the information (as an weighted sum of entropy for the sub-sets). Algorithm chooses optimal split, the one with the biggest information gain.
Disadvantages of the C4.5 algorithm are huge memory and processor capacity requirements, which are necessary to produce rules.
The C5.0 algorithm was presented in 1997, as a commercial version of C4.5. Important step ahead was made as tests provided, both with better classification results and supported types of data.
[Wu i inni "Top 10 algorithms in data mining" Springer 2008; Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 118]
SVM stands for support vector machines. The idea of this classification’ algorithm is generating border between objects that belong to different decision class. Big advantage of this approach is simple training set and moreover, it can be easy used to solve multi-dimensional problems. Line between objects is generated by iterative algorithm.
Types of SVM:
- C-SVM
- ni-SVM
- regression epsilon SVM
- regression ni-SVM
[http://www.spc.com.pl/textbook/stathome_stat.html?http%3A%2F%2Fwww.spc.com.pl%2Ftextbook%2Fstmachlearn.html]
Classification and regression trees
Decision trees is one of the classification method – structures consisting nodes, connected with branches. Unlike the natural way, root appears on the top of the structure and branches go down ending with leaves or leading to another node.
Main goal of the algorithm is to select atributes (both whichever and sequence matter) to obtain highest conficence level. Decission trees fall under supervised learning category.
It is possible to employ classification trees, when:
- training set with defined target variable exists
- trainings set provides algorithm with representative group of records (enough examples)
- discrete target variables
Bibl. [Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 109, 111]
Naive Bayes is a statistical classificator. It is one of the oldest formal classification algorithm, but still – thanks to simplicity and efficiency – it is often used, for example in anti-spam mechanisms. This method is called: supervised classification – we are given a set of objects with classes assigned and using it we want to generate rules that help us assign future objects to classes.
MAP (maximal a posteriori classification) is very popular estimation method in bayesian statistic. It is said MAP is optimal – minimal error is achieved. The problem is when it comes to computation complexity, which is c^n (c – classes n – describing variables). However, naive bayes it is said the variables (components) are independent (conditionally independent). The point is – if it is true, NB gives also optimal results.
It may seem that independence presumption is too strict to adapt it in real world. Neverthless, activity before classification makes the difference, e.g. selection and elimination of corellated variables occurs always as the part of the methodology of data mining.
Sources: same as previous posts