Here – part 3 – you will find previous part of Hunting content creators serie.
In the initial phase of social networking site functioning, it has to be provided with fuel – content. Same as car engine – to proper work it needes regular delivery of good quality fuel. Usually, users are treated equally when it comes to content creating. Sometims it is possible to become moderator or admin, but normally user is first and final stage of the ‘site career’.
Tracking of users, who are content creators and enabling them to work more efficient and effective would let the webpage develop much quicker, which is good for both owner and users. The main goal of research is to check, if web mining algorithm are useful in extracting knowledge about users – potential content creators in social networking sites. Research has been done using WEKA ver. 3.6.2 in Linux Ubuntu 8.1 (2.6.27-15-generic kernel) environment, database from internet board: 3200 users, 120k messages, PhpBB by Przemo technology.
Researched internet board is developed in PhpBB by Przemo technology with 3200 users and 118.000 messages. Data is kept in MySQL format, in table named “users”, containing information about users.
20 parameters describing users have been chosen:
- user ID (ascending to registation date),
- account status (activated / not activated / deactivated (used but not active anymore due to some reason),
- username,
- session time,
- last visit,
- registration date,
- number of posts,
- interface language (polish, english),
- avatar type (avatar from board, external source, none),
- e-mail,
- user webpage,
- where is user from (normally – city/town, but crativity of the users is infinite,
- signature,
- instant communication number,
- interests,
- birth date,
- sex,
- summery time spent on the forum,
- number of visits.
Some parameters were removed from the research:
- user level,
- private messages details,
- other functions of forum script,
- additional profile fields
- data about restrictions, etc.
During preparation, data was exported to csv format, which is supported by Weka. Some difficulties was encountered during the process:
- csv editing using external tool
- EOL chars was not accepted in signature / from field
- ‘ char was not accepted in signature / from field.
Data undestanding and prepration is a very important process in business data mining. Using iterative management model it has to be processed multiple times, to get the best results.
Post is the continuation of Hunting content creators thread. You will find previous part here.
Analysis of the content, generated by users of social networking sites is a process aiming to improve. Past data makes present solutions better and keep development of the Internet progressing: more users, more content, business matters such as more profit, happy customers or synergy effect.
Data generated by members of Internet society is valuable source of knowledge about several aspects of their network activity. Improving organization of network reality by using logical rules is intuitive, but in business matters it is vital issue to have advantage over competitors. Economy based on information demands constant improvement in production process, management and marketing, and it happens since decades, thus it doesn’t leave any space for serious advantage – companies explore various spaces to look for it. Data mining is used to discover such knowledge and contribute it in development and advantage obtaining process.
Problem identification is the first step in data mining process. Because of the fact, that data exploration is not only device to improve, but also to create new solutions, set of initial conditions is much wider.
Improvement of social networking sites funcioning:
It is proven t hat only small part of all social networking sites users actively contributes and participates in content creation. The majority are only passive consumers. It is very important to deliver “fresh” content and taking care of that little part of creators seems very natural and positive. The question is, how to – in a very dynamic environment, which social network is – detect content creators and how to “feed” them? On the other hand, mechanism created would be also useful for detecting potential negative users such as spamers, trolls or robots.
- content of higher quality
Explaining the nature of the Internet in the time of Web 2.0, where every user is a potential content creator and contributor, it was mentioned also that there are some negative aspects of such. Content duplication, redundant information, spam, not appropriate content, improper categorization are consequences of free-choice Internet users activity. Data mining algorithms can be used to search for and integrate inappropriate, redundant content or correct categorization, to look for errors or even prevent some user’s behavior – for example by malicious users detection. Moreover, potential advantage is searching for knowledge extraction, to collect know-how arount some topic. Internet boards are places, where people share their experience and practise, which helps solving problems and creating new ways to approach, sometimes complicated, problems.
Some of the conveniences, that are given users by web mining techniques are recommendation of topics or posts, according to their preferences (defined previously in profile or detected on the spot from history or visited content). Same when it comes to contacts or groups, which are fit thanks to the features extrapolation – and not only basic features, such as localization or similar activities, but also more complicated, as a result of algorithm function.
Business points
- contextual advertising and dedicated business offer – knowledge synthesis to generate content (knowledge) aggregates
Knowledge extraction is a tool to enhance conciousness of users’ preferences and behavior – system isn’t limited to the information, that user decides to share. Content and traffic analysic allows system to discover preferences, interests and many more – sometimes surprising – features of community members. All above could server to identify needs and to more effective marketing process.
- opinion and review extraction
As mentioned above, it is positive attempt to extract knowledge from internet boards or other places where discussion arises, to gather know-how in various domains. Is social networking siter it is also possible to gather opinion regarding products or services, valueable from producer point of view. Knowledge, gathered using extrapolation, could be base to build an advantage over business competitors. Same issue appears when it comes to blogs – concerning both posts and comments.
In summary – business issues of Hunting content creators are around detection of most contributing users, to increase tempo of network development. Experimental site is internet board, where basic indicator of user’s activity is number of published posts. Big number of visits with small number of posts is a sign, that user is not a contributor to the community, but rather a consumer. Detection of contributors and providing them with improved environment results chain reaction: more content -> more users -> faster community growth -> more content… To acomplish the goals it was couple of web mining algorithm tested. More in next part.
As I’ve written in previous part – content creators part 1, discovering ubercreators and exploating this knowledge should be an important part of the development of every social-networking site.
My project (idea) is to set up a system to find content creators in functioning Internet board, using data mining algorithms. Some details:
- database (MySQL) with over 3k users and describing parameters (about 70),
- selection of the parameters describing users must be executed (manual – technically it comes to selection of the tables in the database, the process could be automated if necessary)
- Weka is used as a set of classifiers and clustering algorithms (it is necessary to prepare data for both program and algorithm)
Content creating in discussion board is not really complex issue. Although it is difficult to evaluate value of the messages, in most cases it is not even necessary. It is enough to eliminate obvious cases of spamming and just let the snowball rolling down the hill.
In the certain moment, discovering users with hidden potential to create valuable content can give evolving society a serious boost. Giving an algorithm set of users with parameters, with an emplasis on those parameters describing activity and “creative spirit”, algorithm does the rest of the job, clustering users into groups with high level of similarity. The point is to use results of classification to give positive feedback to possible creators, to exploit potential.
The most reliable way to measure results is implementing model in real-life system. However, it is also necessary to try some modelling, because walking in the dark without even predicting (flashlight) if it is going to succeed is unacceptable in every business. Success means in this case having quick development of the network society with a visible grow of the valuable content and SEO parameters.
Content creators in social-networking sites part 1
Next chapters cover issues of the chosen parameters, algorithm and modelling.
It is a kind of obvious statement that the motor of every social-networking site are content creators. Each owner of social-networking site knows, that it is only a machine what he provides, leaving the “stream of life” in the hands (and keyboards) of the most active users. Nothing says more than digits – my research shows that only 0.5% of all users of my S-N site are responsible for 38% content created!
From the business point of view it is critical to have such users. The situation when everybody wants to eat, but there is nobody to plant crops the result is starvation for the most of the society. It is also said that valuable content has magnetism within, attracting both users and search engines.
Hunting content creators should be high on the list of TO DO things after starting UCC website. Connecting dots, content creating in S-N and my interests in data mining, resulted with an idea to use data mining to discover users, who might be better than average content suppliers.
How to do it, having 8-years-old Internet board database, full of profile information, over 3200 users and over 115k posts? How will it affect the life of society? What is the realiability of the research? And finally, what is the point (where are money)?
As usual, a lot of questions and answers given in probability measure. Revealing next part of the picture in the following part.