hunting content creators (4) – data understanding and preparation
Here – part 3 – you will find previous part of Hunting content creators serie.
In the initial phase of social networking site functioning, it has to be provided with fuel – content. Same as car engine – to proper work it needes regular delivery of good quality fuel. Usually, users are treated equally when it comes to content creating. Sometims it is possible to become moderator or admin, but normally user is first and final stage of the ‘site career’.
Tracking of users, who are content creators and enabling them to work more efficient and effective would let the webpage develop much quicker, which is good for both owner and users. The main goal of research is to check, if web mining algorithm are useful in extracting knowledge about users – potential content creators in social networking sites. Research has been done using WEKA ver. 3.6.2 in Linux Ubuntu 8.1 (2.6.27-15-generic kernel) environment, database from internet board: 3200 users, 120k messages, PhpBB by Przemo technology.
Researched internet board is developed in PhpBB by Przemo technology with 3200 users and 118.000 messages. Data is kept in MySQL format, in table named “users”, containing information about users.
20 parameters describing users have been chosen:
- user ID (ascending to registation date),
- account status (activated / not activated / deactivated (used but not active anymore due to some reason),
- username,
- session time,
- last visit,
- registration date,
- number of posts,
- interface language (polish, english),
- avatar type (avatar from board, external source, none),
- e-mail,
- user webpage,
- where is user from (normally – city/town, but crativity of the users is infinite,
- signature,
- instant communication number,
- interests,
- birth date,
- sex,
- summery time spent on the forum,
- number of visits.
Some parameters were removed from the research:
- user level,
- private messages details,
- other functions of forum script,
- additional profile fields
- data about restrictions, etc.
During preparation, data was exported to csv format, which is supported by Weka. Some difficulties was encountered during the process:
- csv editing using external tool
- EOL chars was not accepted in signature / from field
- ‘ char was not accepted in signature / from field.
Data undestanding and prepration is a very important process in business data mining. Using iterative management model it has to be processed multiple times, to get the best results.