Here – part 3 – you will find previous part of Hunting content creators serie.
In the initial phase of social networking site functioning, it has to be provided with fuel – content. Same as car engine – to proper work it needes regular delivery of good quality fuel. Usually, users are treated equally when it comes to content creating. Sometims it is possible to become moderator or admin, but normally user is first and final stage of the ‘site career’.
Tracking of users, who are content creators and enabling them to work more efficient and effective would let the webpage develop much quicker, which is good for both owner and users. The main goal of research is to check, if web mining algorithm are useful in extracting knowledge about users – potential content creators in social networking sites. Research has been done using WEKA ver. 3.6.2 in Linux Ubuntu 8.1 (2.6.27-15-generic kernel) environment, database from internet board: 3200 users, 120k messages, PhpBB by Przemo technology.
Researched internet board is developed in PhpBB by Przemo technology with 3200 users and 118.000 messages. Data is kept in MySQL format, in table named “users”, containing information about users.
20 parameters describing users have been chosen:
- user ID (ascending to registation date),
- account status (activated / not activated / deactivated (used but not active anymore due to some reason),
- username,
- session time,
- last visit,
- registration date,
- number of posts,
- interface language (polish, english),
- avatar type (avatar from board, external source, none),
- e-mail,
- user webpage,
- where is user from (normally – city/town, but crativity of the users is infinite,
- signature,
- instant communication number,
- interests,
- birth date,
- sex,
- summery time spent on the forum,
- number of visits.
Some parameters were removed from the research:
- user level,
- private messages details,
- other functions of forum script,
- additional profile fields
- data about restrictions, etc.
During preparation, data was exported to csv format, which is supported by Weka. Some difficulties was encountered during the process:
- csv editing using external tool
- EOL chars was not accepted in signature / from field
- ‘ char was not accepted in signature / from field.
Data undestanding and prepration is a very important process in business data mining. Using iterative management model it has to be processed multiple times, to get the best results.
As I’ve written in previous part – content creators part 1, discovering ubercreators and exploating this knowledge should be an important part of the development of every social-networking site.
My project (idea) is to set up a system to find content creators in functioning Internet board, using data mining algorithms. Some details:
- database (MySQL) with over 3k users and describing parameters (about 70),
- selection of the parameters describing users must be executed (manual – technically it comes to selection of the tables in the database, the process could be automated if necessary)
- Weka is used as a set of classifiers and clustering algorithms (it is necessary to prepare data for both program and algorithm)
Content creating in discussion board is not really complex issue. Although it is difficult to evaluate value of the messages, in most cases it is not even necessary. It is enough to eliminate obvious cases of spamming and just let the snowball rolling down the hill.
In the certain moment, discovering users with hidden potential to create valuable content can give evolving society a serious boost. Giving an algorithm set of users with parameters, with an emplasis on those parameters describing activity and “creative spirit”, algorithm does the rest of the job, clustering users into groups with high level of similarity. The point is to use results of classification to give positive feedback to possible creators, to exploit potential.
The most reliable way to measure results is implementing model in real-life system. However, it is also necessary to try some modelling, because walking in the dark without even predicting (flashlight) if it is going to succeed is unacceptable in every business. Success means in this case having quick development of the network society with a visible grow of the valuable content and SEO parameters.
Content creators in social-networking sites part 1
Next chapters cover issues of the chosen parameters, algorithm and modelling.
It is a kind of obvious statement that the motor of every social-networking site are content creators. Each owner of social-networking site knows, that it is only a machine what he provides, leaving the “stream of life” in the hands (and keyboards) of the most active users. Nothing says more than digits – my research shows that only 0.5% of all users of my S-N site are responsible for 38% content created!
From the business point of view it is critical to have such users. The situation when everybody wants to eat, but there is nobody to plant crops the result is starvation for the most of the society. It is also said that valuable content has magnetism within, attracting both users and search engines.
Hunting content creators should be high on the list of TO DO things after starting UCC website. Connecting dots, content creating in S-N and my interests in data mining, resulted with an idea to use data mining to discover users, who might be better than average content suppliers.
How to do it, having 8-years-old Internet board database, full of profile information, over 3200 users and over 115k posts? How will it affect the life of society? What is the realiability of the research? And finally, what is the point (where are money)?
As usual, a lot of questions and answers given in probability measure. Revealing next part of the picture in the following part.