Archive for the ‘projects’ Category

hunting content creators (3) – business point

Posted on the December 2nd, 2010 under business,hunting content creators,projects,social networks,web mining by

Post is the continuation of Hunting content creators thread. You will find previous part here.

Analysis of the content, generated by users of social networking sites is a process aiming to improve. Past data makes present solutions better and keep development of the Internet progressing: more users, more content, business matters such as more profit, happy customers or synergy effect.

Data generated by members of Internet society is valuable source of knowledge about several aspects of their network activity. Improving organization of network reality by using logical rules is intuitive, but in business matters it is vital issue to have advantage over competitors. Economy based on information demands constant improvement in production process, management and marketing, and it happens since decades, thus it doesn’t leave any space for serious advantage – companies explore various spaces to look for it. Data mining is used to discover such knowledge and contribute it in development and advantage obtaining process.

Problem identification is the first step in data mining process. Because of the fact, that data exploration is not only device to improve, but also to create new solutions, set of initial conditions is much wider.

Improvement of social networking sites funcioning:

  • faster content creation

It is proven t hat only small part of all social networking sites users actively contributes and participates in content creation. The majority are only passive consumers. It is very important to deliver “fresh” content and taking care of that little part of creators seems very natural and positive. The question is, how to – in a very dynamic environment, which social network is – detect content creators and how to “feed” them? On the other hand, mechanism created would be also useful for detecting potential negative users such as spamers, trolls or robots.

  • content of higher quality

Explaining the nature of the Internet in the time of Web 2.0, where every user is a potential content creator and contributor, it was mentioned also that there are some negative aspects of such. Content duplication, redundant information, spam, not appropriate content, improper categorization are consequences of free-choice Internet users activity. Data mining algorithms can be used to search for and integrate inappropriate, redundant content or correct categorization, to look for errors or even prevent some user’s behavior – for example by malicious users detection. Moreover, potential advantage is searching for knowledge extraction, to collect know-how arount some topic. Internet boards are places, where people share their experience and practise, which helps solving problems and creating new ways to approach, sometimes complicated, problems.

  • conveniences for users

Some of the conveniences, that are given users by web mining techniques are recommendation of topics or posts, according to their preferences (defined previously in profile or detected on the spot from history or visited content). Same when it comes to contacts or groups, which are fit thanks to the features extrapolation – and not only basic features, such as localization or similar activities, but also more complicated, as a result of algorithm function.

Business points

  • contextual advertising and dedicated business offer – knowledge synthesis to generate content (knowledge) aggregates

Knowledge extraction is a tool to enhance conciousness of users’ preferences and behavior – system isn’t limited to the information, that user decides to share. Content and traffic analysic allows system to discover preferences, interests and many more – sometimes surprising – features of community members. All above could server to identify needs and to more effective marketing process.

  • opinion and review extraction

As mentioned above, it is positive attempt to extract knowledge from internet boards or other places where discussion arises, to gather know-how in various domains. Is social networking siter it is also possible to gather opinion regarding products or services, valueable from producer point of view. Knowledge, gathered using extrapolation, could be base to build an advantage over business competitors. Same issue appears when it comes to blogs – concerning both posts and comments.

In summary – business issues of Hunting content creators are around detection of most contributing users, to increase tempo of network development. Experimental site is internet board, where basic indicator of user’s activity is number of published posts. Big number of visits with small number of posts is a sign, that user is not a contributor to the community, but rather a consumer. Detection of contributors and providing them with improved environment results chain reaction: more content -> more users -> faster community growth -> more content… To acomplish the goals it was couple of web mining algorithm tested. More in next part.

Knowledge synthesis

Posted on the October 15th, 2010 under business,definitions,general,projects,web mining by

Knowledge synthesis is knowledge database building process, from initially separate elements to the system. Following the paradigm of information search methods in Internet, search engines having given keywords as input, should generate sorted aggregation of the pages, that match keywords better than other pages. Then the most suitable result is chosen by user. It is useful and effective method, when user looks for specific information or definiton, but when it comes to open questions or more complicated queries.

Implementation of clustering of the web search results improved the mechanisms of Internet search process. When going step further the question appears, if it is possible to obtain full information from the search engines or albeit an aggregate providing consistent picture of researched topic?

In traditional knowledge synthesis, effect of the work is usually presented as systematic overwiev or metaanalysis. Description of the issue is made on behalf of scientific proofs and regarding methodology: problem has to be defined as well as sources (literature), data has to be understood and extracted, then researched. Results have to be summated,  critically evaluated and finally – concluded. Such overview is knowledge aggregation. More detailed synthesis derives not only from scientific literature, but also from conferences, academic courses or scripts, etc. In some domains, important part of knowledge doesn’t exist in written form, which is a serious complication in research process. Regarding Internet knowledge synthesis, similar situation is with “deep web” or “dark web”. Automatic knowledge synthesis requires sometimes interconnection between several domains, such as artificial intelligence, semantic networks, data mining, but still keeping scientific and analytical approach.

Automatic knowledge synthesis research provides:

  • automatically generated result on query regarding demanded topic, in consistent and comprehensive form, as a summary of the most important publication in the Internet or other source of data,
  • possibility of getting answer on complex question, asked directly to the web browser,
  • automatization of several parts of the vivid processes, where decisions would be made automatically by the machine, on the basis of knowledge extracted from systematic overview generated from Web or databases.

Sustainable development

Posted on the September 30th, 2010 under general,projects by

Recently I decided to set the goals regarding development of the WebContentMining.com blog, here are directions I want to follow.

There are three main categories I want to develop:

  1. Web content mining core category
  2. Hunting content creators project
  3. Around web mining and data mining topics.

Server location also changed and I switched to default theme, only temporary.

Rgds.

hunting content creators (2) – idea

Posted on the May 27th, 2010 under hunting content creators,projects,social networks,web mining by

As I’ve written in previous part – content creators part 1, discovering ubercreators and exploating this knowledge should be an important part of the development of every social-networking site.

My project (idea) is to set up a system to find content creators in functioning Internet board, using data mining algorithms. Some details:

  • database (MySQL) with over 3k users and describing parameters (about 70),
  • selection of the parameters describing users must be executed (manual – technically it comes to selection of the tables in the database, the process could be automated if necessary)
  • Weka is used as a set of classifiers and clustering algorithms (it is necessary to prepare data for both program and algorithm)

Content creating in discussion board is not really complex issue. Although it is difficult to evaluate value of the messages, in most cases it is not even necessary. It is enough to eliminate obvious cases of spamming and just let the snowball rolling down the hill.

In the certain moment, discovering users with hidden potential to create valuable content can give evolving society a serious boost. Giving an algorithm set of users with parameters, with an emplasis on those parameters describing activity and “creative spirit”, algorithm does the rest of the job, clustering users into groups with high level of similarity. The point is to use results of classification to give positive feedback to possible creators, to exploit potential.

The most reliable way to measure results is implementing model in real-life system. However, it is also necessary to try some modelling, because walking in the dark without even predicting (flashlight) if it is going to succeed is unacceptable in every business. Success means in this case having quick development of the network society with a visible grow of the valuable content and SEO parameters.

Content creators in social-networking sites part 1

Next chapters cover issues of the chosen parameters, algorithm and modelling.

Weka

Posted on the May 20th, 2010 under data mining,definitions,hunting content creators,projects,web mining by

Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.

Weka

Weka - explorer window

I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.

Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.

There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.

I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi

hunting content creators (1) – introduction

Posted on the May 14th, 2010 under hunting content creators,projects,social networks,web mining by

It is a kind of obvious statement that the motor of every social-networking site are content creators. Each owner of social-networking site knows, that it is only a machine what he provides, leaving the “stream of life” in the hands (and keyboards) of the most active users. Nothing says more than digits – my research shows that only 0.5% of all users of my S-N site are responsible for 38% content created!

From the business point of view it is critical to have such users. The situation when everybody wants to eat, but there is nobody to plant crops the result is starvation for the most of the society. It is also said that valuable content has magnetism within, attracting both users and search engines.

Hunting content creators should be high on the list of TO DO things after starting UCC website. Connecting dots, content creating in S-N and my interests in data mining, resulted with an idea to use data mining to discover users, who might be better than average content suppliers.

How to do it, having 8-years-old Internet board database, full of profile information, over 3200 users and over 115k posts? How will it affect the life of society?  What is the realiability of the research? And finally, what is the point (where are money)?

As usual, a lot of questions and answers given in probability measure. Revealing next part of the picture in the following part.

CRISP-DM

Posted on the April 2nd, 2010 under business,projects,web mining by

CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a methodology used in processing data mining projects, as data exploration like the other business processing techniques demands a general guide to follow.

Basic methodology is split into four parts:

  1. problem identification
  2. data preprocessing (turn data into information, whatever it means)
  3. data exploration
  4. evaluation (result examination)

Data mining is in general a mechanism that let us make better decision in the future, by analysing (in very fancy way) past data. There are two moments in the data mining process which we have to be careful – when we discover a pattern, which can be false or when pattern is true, but useless. The 1st is a straight danger, because business decissions made on false basis simply cost money (sometimes awful lot of money). 2nd one has additional, hidden trap, because it becomes clear the rule is useless after implementing i – system doesn’t simply pass the reality check. Maintaining the methodology provides us with the mechanism to minimize probability of making such a mistake.

According to crisp-dm.org, the open methodology to keep data mining industrial process close to general business-and-research -problems solving strategy. System is divided into 6 steps:

  1. business problem and condition understanding
  2. data understanding
  3. data prepration
  4. modelling
  5. evaluation
  6. implementation

It is very important to notice, each step is strictly connected with results of previous one and it is necessary to jump serveral times between levels (not only in the order presented above!). It is also natural that result of one step causes returning to the start point of the project and reevaluating some opinions or foredesigns.

[M. Berry, G. Linoff ?Data Mining Techniques?, Wiley 2004.]

[Daniel Larose ?Odkrywanie wiedzy z danych? 2006 PWN, 5]