web mining, web content mining

Weka

Posted on the May 20th, 2010 under data mining,definitions,hunting content creators,projects,web mining by wash

Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.

Weka

Weka - explorer window

I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.

Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.

There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.

I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi

One Response to 'Weka'