Archive for the ‘business’ Category

Web mining sources – kdnuggets.com

Posted on the June 10th, 2011 under business,data mining,web mining by

Today I’d like to introduce one good online source of data mining and web mining knowledge – kdnuggets.com. As they advertise, it is professional newsletter with over 12k subscribers. KD stands for Knowledge Discovery, which is part of the deal with data mining.

It can be found a lot of useful information from the industry, starting from news, modern data mining software, datasets and a lot of education materials. There is even board with job offer from knowledge management business. I strongly recommend visiting KDNuggets website as a great resource of web mining and data mining online.

hunting content creators (4) – data understanding and preparation

Posted on the February 28th, 2011 under business,data mining,hunting content creators,social networks,web mining by

Here – part 3 – you will find previous part of Hunting content creators serie.

In the initial phase of social networking site functioning, it has to be provided with fuel – content.  Same as car engine – to proper work it needes regular delivery of good quality fuel. Usually, users are treated equally when it comes to content creating. Sometims it is possible to become moderator or admin, but normally user is first and final stage of the ‘site career’.

Tracking of users, who are content creators and enabling them to work more efficient and effective would let the webpage develop much quicker, which is good for both owner and users. The main goal of research is to check, if web mining algorithm are useful in extracting knowledge about users – potential content creators in social networking sites. Research has been done using WEKA ver. 3.6.2 in Linux Ubuntu 8.1 (2.6.27-15-generic kernel) environment, database from internet board: 3200 users, 120k messages, PhpBB by Przemo technology.

Researched internet board is developed in PhpBB by Przemo technology with 3200 users and 118.000 messages. Data is kept in MySQL format, in table named “users”, containing information about users.

20 parameters describing users have been chosen:

  • user ID (ascending to registation date),
  • account status (activated / not activated / deactivated (used but not active anymore due to some reason),
  • username,
  • session time,
  • last visit,
  • registration date,
  • number of posts,
  • interface language (polish, english),
  • avatar type (avatar from board, external source, none),
  • e-mail,
  • user webpage,
  • where is user from (normally – city/town, but crativity of the users is infinite,
  • signature,
  • instant communication number,
  • interests,
  • birth date,
  • sex,
  • summery time spent on the forum,
  • number of visits.

Some parameters were removed from the research:

  • user level,
  • private messages details,
  • other functions of forum script,
  • additional profile fields
  • data about restrictions, etc.

During preparation, data was exported to csv format, which is supported by Weka. Some difficulties was encountered during the process:

  • csv editing using external tool
  • EOL chars was not accepted in signature / from field
  • ‘ char was not accepted in signature / from field.

Data undestanding and prepration is a very important process in business data mining. Using iterative management model it has to be processed multiple times, to get the best results.

hunting content creators (3) – business point

Posted on the December 2nd, 2010 under business,hunting content creators,projects,social networks,web mining by

Post is the continuation of Hunting content creators thread. You will find previous part here.

Analysis of the content, generated by users of social networking sites is a process aiming to improve. Past data makes present solutions better and keep development of the Internet progressing: more users, more content, business matters such as more profit, happy customers or synergy effect.

Data generated by members of Internet society is valuable source of knowledge about several aspects of their network activity. Improving organization of network reality by using logical rules is intuitive, but in business matters it is vital issue to have advantage over competitors. Economy based on information demands constant improvement in production process, management and marketing, and it happens since decades, thus it doesn’t leave any space for serious advantage – companies explore various spaces to look for it. Data mining is used to discover such knowledge and contribute it in development and advantage obtaining process.

Problem identification is the first step in data mining process. Because of the fact, that data exploration is not only device to improve, but also to create new solutions, set of initial conditions is much wider.

Improvement of social networking sites funcioning:

  • faster content creation

It is proven t hat only small part of all social networking sites users actively contributes and participates in content creation. The majority are only passive consumers. It is very important to deliver “fresh” content and taking care of that little part of creators seems very natural and positive. The question is, how to – in a very dynamic environment, which social network is – detect content creators and how to “feed” them? On the other hand, mechanism created would be also useful for detecting potential negative users such as spamers, trolls or robots.

  • content of higher quality

Explaining the nature of the Internet in the time of Web 2.0, where every user is a potential content creator and contributor, it was mentioned also that there are some negative aspects of such. Content duplication, redundant information, spam, not appropriate content, improper categorization are consequences of free-choice Internet users activity. Data mining algorithms can be used to search for and integrate inappropriate, redundant content or correct categorization, to look for errors or even prevent some user’s behavior – for example by malicious users detection. Moreover, potential advantage is searching for knowledge extraction, to collect know-how arount some topic. Internet boards are places, where people share their experience and practise, which helps solving problems and creating new ways to approach, sometimes complicated, problems.

  • conveniences for users

Some of the conveniences, that are given users by web mining techniques are recommendation of topics or posts, according to their preferences (defined previously in profile or detected on the spot from history or visited content). Same when it comes to contacts or groups, which are fit thanks to the features extrapolation – and not only basic features, such as localization or similar activities, but also more complicated, as a result of algorithm function.

Business points

  • contextual advertising and dedicated business offer – knowledge synthesis to generate content (knowledge) aggregates

Knowledge extraction is a tool to enhance conciousness of users’ preferences and behavior – system isn’t limited to the information, that user decides to share. Content and traffic analysic allows system to discover preferences, interests and many more – sometimes surprising – features of community members. All above could server to identify needs and to more effective marketing process.

  • opinion and review extraction

As mentioned above, it is positive attempt to extract knowledge from internet boards or other places where discussion arises, to gather know-how in various domains. Is social networking siter it is also possible to gather opinion regarding products or services, valueable from producer point of view. Knowledge, gathered using extrapolation, could be base to build an advantage over business competitors. Same issue appears when it comes to blogs – concerning both posts and comments.

In summary – business issues of Hunting content creators are around detection of most contributing users, to increase tempo of network development. Experimental site is internet board, where basic indicator of user’s activity is number of published posts. Big number of visits with small number of posts is a sign, that user is not a contributor to the community, but rather a consumer. Detection of contributors and providing them with improved environment results chain reaction: more content -> more users -> faster community growth -> more content… To acomplish the goals it was couple of web mining algorithm tested. More in next part.

Sentiment analysis – key to opinion and overwiev extraction

Posted on the November 2nd, 2010 under business,data mining,general,web content mining,web mining by

Popularity of the internet influenced ways of giving feedback to the providers of service and product usage. Main consequence was a turnover to individual client, because every single product or service user, who is also Internet user, can be the one to publish his sentiment worldwide. Moreover, opinion or overview is accessible for a number of potential or actual customers worldwide, which could be both promotion or discouragement. Awareness of this fact makes providers and producents looking for better ways to collect users opinions, usually to improve quality of service and relations with customers. Such opinions can appear in global network in various forms. Comments directly on producers’ website are quite popular, but blogs (both producers and independent), discussion boards (where professional testers meet regular users, who seeks just a product or service, which is suitable for their simple needs). All the features oblige providers to be and participate in places, where sentiment is published and discussed – collecting marketing information and giving feedback and help became important service, improving relations with customers and increasing chance of succeeding in business.

Opinion and overview matter in the Internet have to respect the same rules as the other kind of content – there are a lot of information and it is difficult to find the most suitable to given query, moreover, Internet content changes constantly – today’s noticed overview can suddenly be removed with several others found instead. There is also no assurance the overview is reliable and it is always a chance to be a victim of spoiled or malicious content. However, contras do not win with advantage of using Internet opinion and overview features, mentioning for example getting direct feedback from customer, and providers try to access this knowledge every possible way.

Web mining could be a method to improve business mechanism: customer opinion -> producer -> positive change. Using Internet as a source of feedback, producer obtains knowledge about functional features of his product and opprotunity to develop important domains, such as business intelligence, knowledge management or quality control. Thanks to data exploration, data collection mechanism is automatic and tuned to producers needs. Customer’s plus is time savings and more suitable web search results.

Opinion classification is automatic detection mechanism, allowing recognizing sentiment of the customer, describing experience with product or service. Sentiment can be both positive or negative, but more detailed approach is also possible – whether product is recommended or not recommended. The problem in general is part of natural language processing. Customer sentiment research starts with data gathering and preparation, then three step process: tagging fragments of text (at least two words phrases, using to classify),  semantic orientation detection (with certain probability) and calculating probable sentiment. Example techniques for user sentiment classification are Support Vector Machines, Naive Bayes or EM.

Customer overview mining matters not only because it contains users sentiment, but also details about product usage or service features and their importance to final evaluation. Because of this overview research is more difficult task, but its effects could meaningfully change quality of product or service in positive way. Thanks to detailed description, producent or provider can place himself “in customer shoes” and – as a result of this feedback – improve. Process goes by steps: extracting described features, extracting customers sentiment regarding described feature, summation of previous.

In conclusion, automatization of customer sentiment and opinion research with web mining mechanisms is potentially very profitable business issue. Description of product usage or simple summary of transaction is very popular way of giving customers’ feedback. Widespread products and services collected many descriptions in the Internet, what makes checking all the opinions time and resource consuming. Automatic mechanism to collect and summary such a feedback would be very useful for both producers, providers and customers.

Knowledge synthesis

Posted on the October 15th, 2010 under business,definitions,general,projects,web mining by

Knowledge synthesis is knowledge database building process, from initially separate elements to the system. Following the paradigm of information search methods in Internet, search engines having given keywords as input, should generate sorted aggregation of the pages, that match keywords better than other pages. Then the most suitable result is chosen by user. It is useful and effective method, when user looks for specific information or definiton, but when it comes to open questions or more complicated queries.

Implementation of clustering of the web search results improved the mechanisms of Internet search process. When going step further the question appears, if it is possible to obtain full information from the search engines or albeit an aggregate providing consistent picture of researched topic?

In traditional knowledge synthesis, effect of the work is usually presented as systematic overwiev or metaanalysis. Description of the issue is made on behalf of scientific proofs and regarding methodology: problem has to be defined as well as sources (literature), data has to be understood and extracted, then researched. Results have to be summated,  critically evaluated and finally – concluded. Such overview is knowledge aggregation. More detailed synthesis derives not only from scientific literature, but also from conferences, academic courses or scripts, etc. In some domains, important part of knowledge doesn’t exist in written form, which is a serious complication in research process. Regarding Internet knowledge synthesis, similar situation is with “deep web” or “dark web”. Automatic knowledge synthesis requires sometimes interconnection between several domains, such as artificial intelligence, semantic networks, data mining, but still keeping scientific and analytical approach.

Automatic knowledge synthesis research provides:

  • automatically generated result on query regarding demanded topic, in consistent and comprehensive form, as a summary of the most important publication in the Internet or other source of data,
  • possibility of getting answer on complex question, asked directly to the web browser,
  • automatization of several parts of the vivid processes, where decisions would be made automatically by the machine, on the basis of knowledge extracted from systematic overview generated from Web or databases.

CRISP-DM

Posted on the April 2nd, 2010 under business,projects,web mining by

CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a methodology used in processing data mining projects, as data exploration like the other business processing techniques demands a general guide to follow.

Basic methodology is split into four parts:

  1. problem identification
  2. data preprocessing (turn data into information, whatever it means)
  3. data exploration
  4. evaluation (result examination)

Data mining is in general a mechanism that let us make better decision in the future, by analysing (in very fancy way) past data. There are two moments in the data mining process which we have to be careful – when we discover a pattern, which can be false or when pattern is true, but useless. The 1st is a straight danger, because business decissions made on false basis simply cost money (sometimes awful lot of money). 2nd one has additional, hidden trap, because it becomes clear the rule is useless after implementing i – system doesn’t simply pass the reality check. Maintaining the methodology provides us with the mechanism to minimize probability of making such a mistake.

According to crisp-dm.org, the open methodology to keep data mining industrial process close to general business-and-research -problems solving strategy. System is divided into 6 steps:

  1. business problem and condition understanding
  2. data understanding
  3. data prepration
  4. modelling
  5. evaluation
  6. implementation

It is very important to notice, each step is strictly connected with results of previous one and it is necessary to jump serveral times between levels (not only in the order presented above!). It is also natural that result of one step causes returning to the start point of the project and reevaluating some opinions or foredesigns.

[M. Berry, G. Linoff ?Data Mining Techniques?, Wiley 2004.]

[Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 5]