Web linkage mining
Web linkage mining – one of the web exploration types – exploration of the interconnection in web is simply looking for useful information within connections between objects in Internet. There are various different connections, from locators (e.g. URL), then links between sites, to connections between tables in databases. Main task of web linkage mining is discovering and exploiting information, which can be used to better understanding the data. Usually label of the link is connected to both origin (source page) and destination (address, where link is pointing).
[netinsert=0.0.1.3.9.2.1]
Basic practical utilization of web linkage mining is search result rating (websites rating). The most well-known rating mechanisms are Pagerank (by Google), context Pagerank and Hubs & Authorities.
Rating depending on number of links pointing to the object isn’t new idea. There is a document stating, that in 1970 was a try to publish science articles rating, depending on number of quotation in other documents.
There are three approaches:
1. non-contextual (e.g. Pagerank)
The very example is Pagerank. Works as “popularity content”, the difference comparing to usual method is that quality of the backlink is measured by the quality of source page itself. The better linking page, the more is added to rating. Inside algorithm, importance of the page is defined by the probability of visiting the page – but specification of algorithm is not revealed by Google.
Regarding disadvantages – non-contextual algorighm is vulnerable to spamming. Server farms is easy mechanism to coordinate efforts resulting in pushing pages to the top of the rating in artificial way. Good solution is also anticipated, when it comes to keywords with several meanings. Serious problem is also pages collecting rating – that with no further links (this one is solved by special “page tax”).
On the other hand, non-contextual algorithm Pagerank has prediction mechanism that accelerates execution, usage of local resources, which reduces number of calculation needed to generate rating.
2. contextual (e.g. topic-specific Pagerank)
Topic-specified Page rank is a contextual method of www search result rating, made by T. Haveliwala. It is a modification of original Pagerank algorithm – there are thematical categories added (contexts), e.g. open directory dmoz and algorithm is taught to give priority to the pages which are close to the source documents in directory. The main idea is to keep results close to the given topic.
The advantage of contextual approach is – at least thoretical – possibility of rating’ personalization. Requirements of the user, who described his features and priorities before search could be meet better. For example, declarative 70% of sport interest and 30% art interest with possibility of contextual search in directory with consistent content segregation enables to generate results accordant to user profile.
3. detailed (Hubs and authorities algorithm)
H&A algorithm (proper nam: HITS – Hyperlink-induced topic search) is two moduled, responsible for collecting pages according to pattern, queried in search engine and calculating probability of classification document to the types described below:
- authorities – documents with important content from search engine point of view (e.g. definitions, information about topic, etc),
- hubs – documents containing important links or anchors
According to the algorighm’s logic, valuable page – authority – contains links in several hubs, when good hub contains links to several authorities.