web mining, web content mining

Web mining sources – kdnuggets.com

Posted on the June 10th, 2011 under business,data mining,web mining by

Today I’d like to introduce one good online source of data mining and web mining knowledge – kdnuggets.com. As they advertise, it is professional newsletter with over 12k subscribers. KD stands for Knowledge Discovery, which is part of the deal with data mining.

It can be found a lot of useful information from the industry, starting from news, modern data mining software, datasets and a lot of education materials. There is even board with job offer from knowledge management business. I strongly recommend visiting KDNuggets website as a great resource of web mining and data mining online.

hunting content creators (4) – data understanding and preparation

Posted on the February 28th, 2011 under business,data mining,hunting content creators,social networks,web mining by

Here – part 3 – you will find previous part of Hunting content creators serie.

In the initial phase of social networking site functioning, it has to be provided with fuel – content.  Same as car engine – to proper work it needes regular delivery of good quality fuel. Usually, users are treated equally when it comes to content creating. Sometims it is possible to become moderator or admin, but normally user is first and final stage of the ‘site career’.

Tracking of users, who are content creators and enabling them to work more efficient and effective would let the webpage develop much quicker, which is good for both owner and users. The main goal of research is to check, if web mining algorithm are useful in extracting knowledge about users – potential content creators in social networking sites. Research has been done using WEKA ver. 3.6.2 in Linux Ubuntu 8.1 (2.6.27-15-generic kernel) environment, database from internet board: 3200 users, 120k messages, PhpBB by Przemo technology.

Researched internet board is developed in PhpBB by Przemo technology with 3200 users and 118.000 messages. Data is kept in MySQL format, in table named “users”, containing information about users.

20 parameters describing users have been chosen:

  • user ID (ascending to registation date),
  • account status (activated / not activated / deactivated (used but not active anymore due to some reason),
  • username,
  • session time,
  • last visit,
  • registration date,
  • number of posts,
  • interface language (polish, english),
  • avatar type (avatar from board, external source, none),
  • e-mail,
  • user webpage,
  • where is user from (normally – city/town, but crativity of the users is infinite,
  • signature,
  • instant communication number,
  • interests,
  • birth date,
  • sex,
  • summery time spent on the forum,
  • number of visits.

Some parameters were removed from the research:

  • user level,
  • private messages details,
  • other functions of forum script,
  • additional profile fields
  • data about restrictions, etc.

During preparation, data was exported to csv format, which is supported by Weka. Some difficulties was encountered during the process:

  • csv editing using external tool
  • EOL chars was not accepted in signature / from field
  • ‘ char was not accepted in signature / from field.

Data undestanding and prepration is a very important process in business data mining. Using iterative management model it has to be processed multiple times, to get the best results.

Web linkage mining

Posted on the January 19th, 2011 under definitions,general,web mining by

Web linkage mining – one of the web exploration types – exploration of the interconnection in web is simply looking for useful information within connections between objects in Internet. There are various different connections, from locators (e.g. URL), then links between sites, to connections between tables in databases. Main task of web linkage mining is discovering and exploiting information, which can be used to better understanding the data. Usually label of the link is connected to both origin (source page) and destination (address, where link is pointing).

[netinsert=0.0.1.3.9.2.1]

Basic practical utilization of web linkage mining is search result rating (websites rating). The most well-known rating mechanisms are Pagerank (by Google), context Pagerank and Hubs & Authorities.

Rating depending on number of links pointing to the object isn’t new idea. There is a document stating, that in 1970 was a try to publish science articles rating, depending on number of quotation in other documents.

There are three approaches:

1. non-contextual (e.g. Pagerank)

The very example is Pagerank. Works as “popularity content”, the difference comparing to usual method is that quality of the backlink is measured by the quality of source page itself. The better linking page, the more is added to rating. Inside algorithm, importance of the page is defined by the probability of visiting the page – but specification of algorithm is not revealed by Google.

Regarding disadvantages – non-contextual algorighm is vulnerable to spamming. Server farms is easy mechanism to coordinate efforts resulting in pushing pages to the top of the rating in artificial way. Good solution is also anticipated, when it comes to keywords with several meanings. Serious problem is also pages collecting rating – that with no further links (this one is solved by special “page tax”).

On the other hand, non-contextual algorithm Pagerank has prediction mechanism that accelerates execution, usage of local resources, which reduces number of calculation needed to generate rating.

2. contextual (e.g. topic-specific Pagerank)

Topic-specified Page rank is a contextual method of www search result rating, made by T. Haveliwala.  It is a modification of original Pagerank algorithm – there are thematical categories added (contexts), e.g. open directory dmoz and algorithm is taught to give priority to the pages which are close to the source documents in directory. The main idea is to keep results close to the given topic.

The advantage of contextual approach is – at least thoretical – possibility of rating’ personalization. Requirements of the user, who described his features and priorities before search could be meet better. For example, declarative 70% of sport interest and 30% art interest with possibility of contextual search in directory with consistent content segregation enables to generate results accordant to user profile.

3. detailed (Hubs and authorities algorithm)

H&A algorithm (proper nam: HITS – Hyperlink-induced topic search) is two moduled, responsible for collecting pages according to pattern, queried in search engine and calculating probability of classification document to the types described below:

  • authorities – documents with important content from search engine point of view (e.g. definitions, information about topic, etc),
  • hubs – documents containing important links or anchors

According to the algorighm’s logic, valuable page – authority – contains links in several hubs, when good hub contains links to several authorities.

hunting content creators (3) – business point

Posted on the December 2nd, 2010 under business,hunting content creators,projects,social networks,web mining by

Post is the continuation of Hunting content creators thread. You will find previous part here.

Analysis of the content, generated by users of social networking sites is a process aiming to improve. Past data makes present solutions better and keep development of the Internet progressing: more users, more content, business matters such as more profit, happy customers or synergy effect.

Data generated by members of Internet society is valuable source of knowledge about several aspects of their network activity. Improving organization of network reality by using logical rules is intuitive, but in business matters it is vital issue to have advantage over competitors. Economy based on information demands constant improvement in production process, management and marketing, and it happens since decades, thus it doesn’t leave any space for serious advantage – companies explore various spaces to look for it. Data mining is used to discover such knowledge and contribute it in development and advantage obtaining process.

Problem identification is the first step in data mining process. Because of the fact, that data exploration is not only device to improve, but also to create new solutions, set of initial conditions is much wider.

Improvement of social networking sites funcioning:

  • faster content creation

It is proven t hat only small part of all social networking sites users actively contributes and participates in content creation. The majority are only passive consumers. It is very important to deliver “fresh” content and taking care of that little part of creators seems very natural and positive. The question is, how to – in a very dynamic environment, which social network is – detect content creators and how to “feed” them? On the other hand, mechanism created would be also useful for detecting potential negative users such as spamers, trolls or robots.

  • content of higher quality

Explaining the nature of the Internet in the time of Web 2.0, where every user is a potential content creator and contributor, it was mentioned also that there are some negative aspects of such. Content duplication, redundant information, spam, not appropriate content, improper categorization are consequences of free-choice Internet users activity. Data mining algorithms can be used to search for and integrate inappropriate, redundant content or correct categorization, to look for errors or even prevent some user’s behavior – for example by malicious users detection. Moreover, potential advantage is searching for knowledge extraction, to collect know-how arount some topic. Internet boards are places, where people share their experience and practise, which helps solving problems and creating new ways to approach, sometimes complicated, problems.

  • conveniences for users

Some of the conveniences, that are given users by web mining techniques are recommendation of topics or posts, according to their preferences (defined previously in profile or detected on the spot from history or visited content). Same when it comes to contacts or groups, which are fit thanks to the features extrapolation – and not only basic features, such as localization or similar activities, but also more complicated, as a result of algorithm function.

Business points

  • contextual advertising and dedicated business offer – knowledge synthesis to generate content (knowledge) aggregates

Knowledge extraction is a tool to enhance conciousness of users’ preferences and behavior – system isn’t limited to the information, that user decides to share. Content and traffic analysic allows system to discover preferences, interests and many more – sometimes surprising – features of community members. All above could server to identify needs and to more effective marketing process.

  • opinion and review extraction

As mentioned above, it is positive attempt to extract knowledge from internet boards or other places where discussion arises, to gather know-how in various domains. Is social networking siter it is also possible to gather opinion regarding products or services, valueable from producer point of view. Knowledge, gathered using extrapolation, could be base to build an advantage over business competitors. Same issue appears when it comes to blogs – concerning both posts and comments.

In summary – business issues of Hunting content creators are around detection of most contributing users, to increase tempo of network development. Experimental site is internet board, where basic indicator of user’s activity is number of published posts. Big number of visits with small number of posts is a sign, that user is not a contributor to the community, but rather a consumer. Detection of contributors and providing them with improved environment results chain reaction: more content -> more users -> faster community growth -> more content… To acomplish the goals it was couple of web mining algorithm tested. More in next part.

Web usage mining

Posted on the November 16th, 2010 under definitions,general,web mining by

Exploration of internet usage data is proceed to discover general patterns of users’ behavior. Results of the research are patterns of websites access, which could be use to improve page functionning or quality of service. Navigation on the webpage, server structure or ad presentation mechanism, e.g. potentially best place for ad, all those can be improved this way.

Log exploration could be use for:

  • discovering characteristic of users,
  • discovering association rules and subjections (e.g. between groups of users or users behaviors when exploring web),
  • extrapolation (e.g. navigation path of the user exploring web),
  • user classification,
  • timing of web access analysis,
  • web traffic analysis, leading to navigation paths discovering.

The most popular attempt is discovering frequent web users navigation paths. Server logs are researched and sequence patterns discovery algorithm, e.g. GSP, Prefixspan the most frequent ways of website exploration by users are found.

Basic difficulties of web usage mining process are identification of user session, because it could be serveral activities led by user during single stay on website. Another problem is limited amout of information in standard usage log, which decreases amount of potential profits form analysis.

Sentiment analysis – key to opinion and overwiev extraction

Posted on the November 2nd, 2010 under business,data mining,general,web content mining,web mining by

Popularity of the internet influenced ways of giving feedback to the providers of service and product usage. Main consequence was a turnover to individual client, because every single product or service user, who is also Internet user, can be the one to publish his sentiment worldwide. Moreover, opinion or overview is accessible for a number of potential or actual customers worldwide, which could be both promotion or discouragement. Awareness of this fact makes providers and producents looking for better ways to collect users opinions, usually to improve quality of service and relations with customers. Such opinions can appear in global network in various forms. Comments directly on producers’ website are quite popular, but blogs (both producers and independent), discussion boards (where professional testers meet regular users, who seeks just a product or service, which is suitable for their simple needs). All the features oblige providers to be and participate in places, where sentiment is published and discussed – collecting marketing information and giving feedback and help became important service, improving relations with customers and increasing chance of succeeding in business.

Opinion and overview matter in the Internet have to respect the same rules as the other kind of content – there are a lot of information and it is difficult to find the most suitable to given query, moreover, Internet content changes constantly – today’s noticed overview can suddenly be removed with several others found instead. There is also no assurance the overview is reliable and it is always a chance to be a victim of spoiled or malicious content. However, contras do not win with advantage of using Internet opinion and overview features, mentioning for example getting direct feedback from customer, and providers try to access this knowledge every possible way.

Web mining could be a method to improve business mechanism: customer opinion -> producer -> positive change. Using Internet as a source of feedback, producer obtains knowledge about functional features of his product and opprotunity to develop important domains, such as business intelligence, knowledge management or quality control. Thanks to data exploration, data collection mechanism is automatic and tuned to producers needs. Customer’s plus is time savings and more suitable web search results.

Opinion classification is automatic detection mechanism, allowing recognizing sentiment of the customer, describing experience with product or service. Sentiment can be both positive or negative, but more detailed approach is also possible – whether product is recommended or not recommended. The problem in general is part of natural language processing. Customer sentiment research starts with data gathering and preparation, then three step process: tagging fragments of text (at least two words phrases, using to classify),  semantic orientation detection (with certain probability) and calculating probable sentiment. Example techniques for user sentiment classification are Support Vector Machines, Naive Bayes or EM.

Customer overview mining matters not only because it contains users sentiment, but also details about product usage or service features and their importance to final evaluation. Because of this overview research is more difficult task, but its effects could meaningfully change quality of product or service in positive way. Thanks to detailed description, producent or provider can place himself “in customer shoes” and – as a result of this feedback – improve. Process goes by steps: extracting described features, extracting customers sentiment regarding described feature, summation of previous.

In conclusion, automatization of customer sentiment and opinion research with web mining mechanisms is potentially very profitable business issue. Description of product usage or simple summary of transaction is very popular way of giving customers’ feedback. Widespread products and services collected many descriptions in the Internet, what makes checking all the opinions time and resource consuming. Automatic mechanism to collect and summary such a feedback would be very useful for both producers, providers and customers.

Knowledge synthesis

Posted on the October 15th, 2010 under business,definitions,general,projects,web mining by

Knowledge synthesis is knowledge database building process, from initially separate elements to the system. Following the paradigm of information search methods in Internet, search engines having given keywords as input, should generate sorted aggregation of the pages, that match keywords better than other pages. Then the most suitable result is chosen by user. It is useful and effective method, when user looks for specific information or definiton, but when it comes to open questions or more complicated queries.

Implementation of clustering of the web search results improved the mechanisms of Internet search process. When going step further the question appears, if it is possible to obtain full information from the search engines or albeit an aggregate providing consistent picture of researched topic?

In traditional knowledge synthesis, effect of the work is usually presented as systematic overwiev or metaanalysis. Description of the issue is made on behalf of scientific proofs and regarding methodology: problem has to be defined as well as sources (literature), data has to be understood and extracted, then researched. Results have to be summated,  critically evaluated and finally – concluded. Such overview is knowledge aggregation. More detailed synthesis derives not only from scientific literature, but also from conferences, academic courses or scripts, etc. In some domains, important part of knowledge doesn’t exist in written form, which is a serious complication in research process. Regarding Internet knowledge synthesis, similar situation is with “deep web” or “dark web”. Automatic knowledge synthesis requires sometimes interconnection between several domains, such as artificial intelligence, semantic networks, data mining, but still keeping scientific and analytical approach.

Automatic knowledge synthesis research provides:

  • automatically generated result on query regarding demanded topic, in consistent and comprehensive form, as a summary of the most important publication in the Internet or other source of data,
  • possibility of getting answer on complex question, asked directly to the web browser,
  • automatization of several parts of the vivid processes, where decisions would be made automatically by the machine, on the basis of knowledge extracted from systematic overview generated from Web or databases.

Sustainable development

Posted on the September 30th, 2010 under general,projects by

Recently I decided to set the goals regarding development of the WebContentMining.com blog, here are directions I want to follow.

There are three main categories I want to develop:

  1. Web content mining core category
  2. Hunting content creators project
  3. Around web mining and data mining topics.

Server location also changed and I switched to default theme, only temporary.

Rgds.

web content mining – introduction

Posted on the August 3rd, 2010 under definitions,general,web mining by

Web content mining is a part of data mining domain that is the closest one to the classic definition of DM. Web content mining aspects are related to the similar domains in classic data mining.

  • automatic content extraction from web pages
  • integration of the information
  • opinion and rewievs extraction
  • knowledge synthesis
  • noise detection and segmentation

Briefly said, web content mining listed above are solutions for more or less complicated problems or issues, connected to automation of web usage, which lead to the improvement in several aspects of Internet daily life, considering both technical and non-technical matters.

web mining – what do we research?

Posted on the May 30th, 2010 under general,web mining by

Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items.  It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.

Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.

Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:

  1. web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
  2. web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
  3. web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.

Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.

Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.