Archive for the ‘general’ Category
Web linkage mining – one of the web exploration types – exploration of the interconnection in web is simply looking for useful information within connections between objects in Internet. There are various different connections, from locators (e.g. URL), then links between sites, to connections between tables in databases. Main task of web linkage mining is discovering and exploiting information, which can be used to better understanding the data. Usually label of the link is connected to both origin (source page) and destination (address, where link is pointing).
[netinsert=0.0.1.3.9.2.1]
Basic practical utilization of web linkage mining is search result rating (websites rating). The most well-known rating mechanisms are Pagerank (by Google), context Pagerank and Hubs & Authorities.
Rating depending on number of links pointing to the object isn’t new idea. There is a document stating, that in 1970 was a try to publish science articles rating, depending on number of quotation in other documents.
There are three approaches:
1. non-contextual (e.g. Pagerank)
The very example is Pagerank. Works as “popularity content”, the difference comparing to usual method is that quality of the backlink is measured by the quality of source page itself. The better linking page, the more is added to rating. Inside algorithm, importance of the page is defined by the probability of visiting the page – but specification of algorithm is not revealed by Google.
Regarding disadvantages – non-contextual algorighm is vulnerable to spamming. Server farms is easy mechanism to coordinate efforts resulting in pushing pages to the top of the rating in artificial way. Good solution is also anticipated, when it comes to keywords with several meanings. Serious problem is also pages collecting rating – that with no further links (this one is solved by special “page tax”).
On the other hand, non-contextual algorithm Pagerank has prediction mechanism that accelerates execution, usage of local resources, which reduces number of calculation needed to generate rating.
2. contextual (e.g. topic-specific Pagerank)
Topic-specified Page rank is a contextual method of www search result rating, made by T. Haveliwala. It is a modification of original Pagerank algorithm – there are thematical categories added (contexts), e.g. open directory dmoz and algorithm is taught to give priority to the pages which are close to the source documents in directory. The main idea is to keep results close to the given topic.
The advantage of contextual approach is – at least thoretical – possibility of rating’ personalization. Requirements of the user, who described his features and priorities before search could be meet better. For example, declarative 70% of sport interest and 30% art interest with possibility of contextual search in directory with consistent content segregation enables to generate results accordant to user profile.
3. detailed (Hubs and authorities algorithm)
H&A algorithm (proper nam: HITS – Hyperlink-induced topic search) is two moduled, responsible for collecting pages according to pattern, queried in search engine and calculating probability of classification document to the types described below:
- authorities – documents with important content from search engine point of view (e.g. definitions, information about topic, etc),
- hubs – documents containing important links or anchors
According to the algorighm’s logic, valuable page – authority – contains links in several hubs, when good hub contains links to several authorities.
Exploration of internet usage data is proceed to discover general patterns of users’ behavior. Results of the research are patterns of websites access, which could be use to improve page functionning or quality of service. Navigation on the webpage, server structure or ad presentation mechanism, e.g. potentially best place for ad, all those can be improved this way.
Log exploration could be use for:
- discovering characteristic of users,
- discovering association rules and subjections (e.g. between groups of users or users behaviors when exploring web),
- extrapolation (e.g. navigation path of the user exploring web),
- user classification,
- timing of web access analysis,
- web traffic analysis, leading to navigation paths discovering.
The most popular attempt is discovering frequent web users navigation paths. Server logs are researched and sequence patterns discovery algorithm, e.g. GSP, Prefixspan the most frequent ways of website exploration by users are found.
Basic difficulties of web usage mining process are identification of user session, because it could be serveral activities led by user during single stay on website. Another problem is limited amout of information in standard usage log, which decreases amount of potential profits form analysis.
Popularity of the internet influenced ways of giving feedback to the providers of service and product usage. Main consequence was a turnover to individual client, because every single product or service user, who is also Internet user, can be the one to publish his sentiment worldwide. Moreover, opinion or overview is accessible for a number of potential or actual customers worldwide, which could be both promotion or discouragement. Awareness of this fact makes providers and producents looking for better ways to collect users opinions, usually to improve quality of service and relations with customers. Such opinions can appear in global network in various forms. Comments directly on producers’ website are quite popular, but blogs (both producers and independent), discussion boards (where professional testers meet regular users, who seeks just a product or service, which is suitable for their simple needs). All the features oblige providers to be and participate in places, where sentiment is published and discussed – collecting marketing information and giving feedback and help became important service, improving relations with customers and increasing chance of succeeding in business.
Opinion and overview matter in the Internet have to respect the same rules as the other kind of content – there are a lot of information and it is difficult to find the most suitable to given query, moreover, Internet content changes constantly – today’s noticed overview can suddenly be removed with several others found instead. There is also no assurance the overview is reliable and it is always a chance to be a victim of spoiled or malicious content. However, contras do not win with advantage of using Internet opinion and overview features, mentioning for example getting direct feedback from customer, and providers try to access this knowledge every possible way.
Web mining could be a method to improve business mechanism: customer opinion -> producer -> positive change. Using Internet as a source of feedback, producer obtains knowledge about functional features of his product and opprotunity to develop important domains, such as business intelligence, knowledge management or quality control. Thanks to data exploration, data collection mechanism is automatic and tuned to producers needs. Customer’s plus is time savings and more suitable web search results.
Opinion classification is automatic detection mechanism, allowing recognizing sentiment of the customer, describing experience with product or service. Sentiment can be both positive or negative, but more detailed approach is also possible – whether product is recommended or not recommended. The problem in general is part of natural language processing. Customer sentiment research starts with data gathering and preparation, then three step process: tagging fragments of text (at least two words phrases, using to classify), semantic orientation detection (with certain probability) and calculating probable sentiment. Example techniques for user sentiment classification are Support Vector Machines, Naive Bayes or EM.
Customer overview mining matters not only because it contains users sentiment, but also details about product usage or service features and their importance to final evaluation. Because of this overview research is more difficult task, but its effects could meaningfully change quality of product or service in positive way. Thanks to detailed description, producent or provider can place himself “in customer shoes” and – as a result of this feedback – improve. Process goes by steps: extracting described features, extracting customers sentiment regarding described feature, summation of previous.
In conclusion, automatization of customer sentiment and opinion research with web mining mechanisms is potentially very profitable business issue. Description of product usage or simple summary of transaction is very popular way of giving customers’ feedback. Widespread products and services collected many descriptions in the Internet, what makes checking all the opinions time and resource consuming. Automatic mechanism to collect and summary such a feedback would be very useful for both producers, providers and customers.
Knowledge synthesis is knowledge database building process, from initially separate elements to the system. Following the paradigm of information search methods in Internet, search engines having given keywords as input, should generate sorted aggregation of the pages, that match keywords better than other pages. Then the most suitable result is chosen by user. It is useful and effective method, when user looks for specific information or definiton, but when it comes to open questions or more complicated queries.
Implementation of clustering of the web search results improved the mechanisms of Internet search process. When going step further the question appears, if it is possible to obtain full information from the search engines or albeit an aggregate providing consistent picture of researched topic?
In traditional knowledge synthesis, effect of the work is usually presented as systematic overwiev or metaanalysis. Description of the issue is made on behalf of scientific proofs and regarding methodology: problem has to be defined as well as sources (literature), data has to be understood and extracted, then researched. Results have to be summated, critically evaluated and finally – concluded. Such overview is knowledge aggregation. More detailed synthesis derives not only from scientific literature, but also from conferences, academic courses or scripts, etc. In some domains, important part of knowledge doesn’t exist in written form, which is a serious complication in research process. Regarding Internet knowledge synthesis, similar situation is with “deep web” or “dark web”. Automatic knowledge synthesis requires sometimes interconnection between several domains, such as artificial intelligence, semantic networks, data mining, but still keeping scientific and analytical approach.
Automatic knowledge synthesis research provides:
- automatically generated result on query regarding demanded topic, in consistent and comprehensive form, as a summary of the most important publication in the Internet or other source of data,
- possibility of getting answer on complex question, asked directly to the web browser,
- automatization of several parts of the vivid processes, where decisions would be made automatically by the machine, on the basis of knowledge extracted from systematic overview generated from Web or databases.
Recently I decided to set the goals regarding development of the WebContentMining.com blog, here are directions I want to follow.
There are three main categories I want to develop:
- Web content mining core category
- Hunting content creators project
- Around web mining and data mining topics.
Server location also changed and I switched to default theme, only temporary.
Rgds.
Web content mining is a part of data mining domain that is the closest one to the classic definition of DM. Web content mining aspects are related to the similar domains in classic data mining.
- automatic content extraction from web pages
- integration of the information
- opinion and rewievs extraction
- knowledge synthesis
- noise detection and segmentation
Briefly said, web content mining listed above are solutions for more or less complicated problems or issues, connected to automation of web usage, which lead to the improvement in several aspects of Internet daily life, considering both technical and non-technical matters.
Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items. It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.
Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.
Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:
- web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
- web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
- web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.
Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.
Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.
PageRank – Larry Page’s algorithm - is probably the most popular and well-known use of web linkage mining. This non-context approach is simply a popularity contest, where the importance of the ‘vote’ is measured by the importance of the originating site itself. Better the linking (my page) site is, bigger gain in the rating I get. Looking inside, the importance of the site is measured by the probability of visiting the site, the way to get the digits is google’s secret, obviously (I bet naive Bayes is used somewhere there;).
What about reality? PageRank is vulnerable to spamming and a lot of people cheat PR for a living. For short, farm of sites (servicer) is created and it’s coordinated work pulls target site up in the ranking. It is also language problem how to deal with ambiguous keywords. Then, technical problem – solved more or less fine of course by taxation mechanism – with pages with no further linkage (PR value thieves as the PR popularity flows there and stays forever). The random jumping also helps with dead-end sites. Prediction mechanisms are also worth mentioning as well as using local resources to save some time and computing power, e.g. processing data for whole domain or server.
There are some modifications of the Pagerank algorithm. Interesting one is topic-specified pagerank by T. Haveliwala. There were contexts added (topic-specified groups, like DMOZ) and the idea is to keep results close to previously specified topic. The big advantage of this approach is that personalization of the search process can be easily applied (user-specified popularity ranking and not the general one).
ADABoost (Adaptive Boosting) is a meta-algorithm used to improve classification results. The concept is to make a lot of weak classifiers cooperate to boost results. Adaptability means in this case that detection of the wrong classification makes the algorithm do more work on it (by changing the wages and setting algorithm to do more effort where it failed).
AdaBoost is sensitive to noisy data or outliers.
[http://www.cs.princeton.edu/~schapire/boost.html; Wu i inni "Top 10 algorithms in data mining" Springer 2008]
Classification and regression trees
Decision trees is one of the classification method – structures consisting nodes, connected with branches. Unlike the natural way, root appears on the top of the structure and branches go down ending with leaves or leading to another node.
Main goal of the algorithm is to select atributes (both whichever and sequence matter) to obtain highest conficence level. Decission trees fall under supervised learning category.
It is possible to employ classification trees, when:
- training set with defined target variable exists
- trainings set provides algorithm with representative group of records (enough examples)
- discrete target variables
Bibl. [Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 109, 111]