Archive for the ‘definitions’ Category
Web linkage mining – one of the web exploration types – exploration of the interconnection in web is simply looking for useful information within connections between objects in Internet. There are various different connections, from locators (e.g. URL), then links between sites, to connections between tables in databases. Main task of web linkage mining is discovering and exploiting information, which can be used to better understanding the data. Usually label of the link is connected to both origin (source page) and destination (address, where link is pointing).
[netinsert=0.0.1.3.9.2.1]
Basic practical utilization of web linkage mining is search result rating (websites rating). The most well-known rating mechanisms are Pagerank (by Google), context Pagerank and Hubs & Authorities.
Rating depending on number of links pointing to the object isn’t new idea. There is a document stating, that in 1970 was a try to publish science articles rating, depending on number of quotation in other documents.
There are three approaches:
1. non-contextual (e.g. Pagerank)
The very example is Pagerank. Works as “popularity content”, the difference comparing to usual method is that quality of the backlink is measured by the quality of source page itself. The better linking page, the more is added to rating. Inside algorithm, importance of the page is defined by the probability of visiting the page – but specification of algorithm is not revealed by Google.
Regarding disadvantages – non-contextual algorighm is vulnerable to spamming. Server farms is easy mechanism to coordinate efforts resulting in pushing pages to the top of the rating in artificial way. Good solution is also anticipated, when it comes to keywords with several meanings. Serious problem is also pages collecting rating – that with no further links (this one is solved by special “page tax”).
On the other hand, non-contextual algorithm Pagerank has prediction mechanism that accelerates execution, usage of local resources, which reduces number of calculation needed to generate rating.
2. contextual (e.g. topic-specific Pagerank)
Topic-specified Page rank is a contextual method of www search result rating, made by T. Haveliwala. It is a modification of original Pagerank algorithm – there are thematical categories added (contexts), e.g. open directory dmoz and algorithm is taught to give priority to the pages which are close to the source documents in directory. The main idea is to keep results close to the given topic.
The advantage of contextual approach is – at least thoretical – possibility of rating’ personalization. Requirements of the user, who described his features and priorities before search could be meet better. For example, declarative 70% of sport interest and 30% art interest with possibility of contextual search in directory with consistent content segregation enables to generate results accordant to user profile.
3. detailed (Hubs and authorities algorithm)
H&A algorithm (proper nam: HITS – Hyperlink-induced topic search) is two moduled, responsible for collecting pages according to pattern, queried in search engine and calculating probability of classification document to the types described below:
- authorities – documents with important content from search engine point of view (e.g. definitions, information about topic, etc),
- hubs – documents containing important links or anchors
According to the algorighm’s logic, valuable page – authority – contains links in several hubs, when good hub contains links to several authorities.
Exploration of internet usage data is proceed to discover general patterns of users’ behavior. Results of the research are patterns of websites access, which could be use to improve page functionning or quality of service. Navigation on the webpage, server structure or ad presentation mechanism, e.g. potentially best place for ad, all those can be improved this way.
Log exploration could be use for:
- discovering characteristic of users,
- discovering association rules and subjections (e.g. between groups of users or users behaviors when exploring web),
- extrapolation (e.g. navigation path of the user exploring web),
- user classification,
- timing of web access analysis,
- web traffic analysis, leading to navigation paths discovering.
The most popular attempt is discovering frequent web users navigation paths. Server logs are researched and sequence patterns discovery algorithm, e.g. GSP, Prefixspan the most frequent ways of website exploration by users are found.
Basic difficulties of web usage mining process are identification of user session, because it could be serveral activities led by user during single stay on website. Another problem is limited amout of information in standard usage log, which decreases amount of potential profits form analysis.
Knowledge synthesis is knowledge database building process, from initially separate elements to the system. Following the paradigm of information search methods in Internet, search engines having given keywords as input, should generate sorted aggregation of the pages, that match keywords better than other pages. Then the most suitable result is chosen by user. It is useful and effective method, when user looks for specific information or definiton, but when it comes to open questions or more complicated queries.
Implementation of clustering of the web search results improved the mechanisms of Internet search process. When going step further the question appears, if it is possible to obtain full information from the search engines or albeit an aggregate providing consistent picture of researched topic?
In traditional knowledge synthesis, effect of the work is usually presented as systematic overwiev or metaanalysis. Description of the issue is made on behalf of scientific proofs and regarding methodology: problem has to be defined as well as sources (literature), data has to be understood and extracted, then researched. Results have to be summated, critically evaluated and finally – concluded. Such overview is knowledge aggregation. More detailed synthesis derives not only from scientific literature, but also from conferences, academic courses or scripts, etc. In some domains, important part of knowledge doesn’t exist in written form, which is a serious complication in research process. Regarding Internet knowledge synthesis, similar situation is with “deep web” or “dark web”. Automatic knowledge synthesis requires sometimes interconnection between several domains, such as artificial intelligence, semantic networks, data mining, but still keeping scientific and analytical approach.
Automatic knowledge synthesis research provides:
- automatically generated result on query regarding demanded topic, in consistent and comprehensive form, as a summary of the most important publication in the Internet or other source of data,
- possibility of getting answer on complex question, asked directly to the web browser,
- automatization of several parts of the vivid processes, where decisions would be made automatically by the machine, on the basis of knowledge extracted from systematic overview generated from Web or databases.
Web content mining is a part of data mining domain that is the closest one to the classic definition of DM. Web content mining aspects are related to the similar domains in classic data mining.
- automatic content extraction from web pages
- integration of the information
- opinion and rewievs extraction
- knowledge synthesis
- noise detection and segmentation
Briefly said, web content mining listed above are solutions for more or less complicated problems or issues, connected to automation of web usage, which lead to the improvement in several aspects of Internet daily life, considering both technical and non-technical matters.
Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.

Weka - explorer window
I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.
Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.
There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.
I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi
ADABoost (Adaptive Boosting) is a meta-algorithm used to improve classification results. The concept is to make a lot of weak classifiers cooperate to boost results. Adaptability means in this case that detection of the wrong classification makes the algorithm do more work on it (by changing the wages and setting algorithm to do more effort where it failed).
AdaBoost is sensitive to noisy data or outliers.
[http://www.cs.princeton.edu/~schapire/boost.html; Wu i inni "Top 10 algorithms in data mining" Springer 2008]
CART (regression and classification tress) – decision trees algorithm. Trees created by CART are binary – there are two branches coming out of the node. The algorithm goes as follow: look for every partition possible, and choose the best one (“goodness” criterium). To reduce the complexity there are some pruning (=cutting branches) techniques.
C4.5 is also decision trees algorithm. What differs is the possibility to create more-than-binary trees. It is also the ‘information gain” that decides about attributes selection. Attribute with the biggest information gain (or lowest entropy reduction) ensure classification with the lowest amount of information needed to classify correctly.
Entropy is a number of bits needed to send information about the result of the occurrence with probability p. In the possible spilt of the training set to the sub-sets, it is possible to calculate the requisition on the information (as an weighted sum of entropy for the sub-sets). Algorithm chooses optimal split, the one with the biggest information gain.
Disadvantages of the C4.5 algorithm are huge memory and processor capacity requirements, which are necessary to produce rules.
The C5.0 algorithm was presented in 1997, as a commercial version of C4.5. Important step ahead was made as tests provided, both with better classification results and supported types of data.
[Wu i inni "Top 10 algorithms in data mining" Springer 2008; Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 118]
Classification and regression trees
Decision trees is one of the classification method – structures consisting nodes, connected with branches. Unlike the natural way, root appears on the top of the structure and branches go down ending with leaves or leading to another node.
Main goal of the algorithm is to select atributes (both whichever and sequence matter) to obtain highest conficence level. Decission trees fall under supervised learning category.
It is possible to employ classification trees, when:
- training set with defined target variable exists
- trainings set provides algorithm with representative group of records (enough examples)
- discrete target variables
Bibl. [Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 109, 111]
Naive Bayes is a statistical classificator. It is one of the oldest formal classification algorithm, but still – thanks to simplicity and efficiency – it is often used, for example in anti-spam mechanisms. This method is called: supervised classification – we are given a set of objects with classes assigned and using it we want to generate rules that help us assign future objects to classes.
MAP (maximal a posteriori classification) is very popular estimation method in bayesian statistic. It is said MAP is optimal – minimal error is achieved. The problem is when it comes to computation complexity, which is c^n (c – classes n – describing variables). However, naive bayes it is said the variables (components) are independent (conditionally independent). The point is – if it is true, NB gives also optimal results.
It may seem that independence presumption is too strict to adapt it in real world. Neverthless, activity before classification makes the difference, e.g. selection and elimination of corellated variables occurs always as the part of the methodology of data mining.
Sources: same as previous posts
A priori is algorithm used in affinity analysis. A set of rules is generated, usually implications, that describe dataset. Finding frequent datasets from the transaction database is a popular task among data mining appliances, which isn’t as simple as it initially seems. The reason is computational complexity, which goes extremaly extremely high when it comes to very large databases (I like the fancy way they described it in Top 10 Algorithm in DM: combinatorial explosion).
The idea of A priori is: find frequent itemsets (frequent means one with previously assigned level of support) and then generate rules that comply previously assigned level of confidence. There are candidate itemsets generated and they are the base to find n-element frequent itemsets (1st step in procedure is to find one-element frequent itemsets, and then repeat, eliminating itemsets which support is not sufficient). the procedure of generating candidate and frequent sets is repeated simply for the possible number. The main point exploits monotonicity: ?if an itemset is not frequent, any of its superset is never frequent? (again Top 10 Algo in DM). Smart way to eliminate itemsets.
A priori is one of the most important algorithms in data mining. The other ideas to make it even more efficient are e.g. the new ways to create candidate itemsets – hashing techniques (smaller candidate itemesets), partitioning (divide the problem into smaller ones and explore them separately – if only real-life problems work this way!) or sampling. Important improvement of A priori algorithm is FP-growth algorithm, which supports compression (without losing important information) and then partitioning.
Despite of the fact A priori is rather simple, easy implementation and proper results make it serious solution in many problems.