rpi-classification: April 2009

There's been quite a bit of progress since the last post, though unfortunately I haven't kept up with the updates here.

I guess a quick recap is in order, then. For the most part, the program is functional and can do basic classification. It makes use of a two step process. Well, three, I guess. It first parses through a document and keeps counts of all the terms that appear in it, but that's not as interesting as the other parts. After it has the document information, it calculates weights for the terms to generate scores for the document (the second part). It can then use these to classify it (the last part).

At the moment, though, the classification is very basic; just a simple max check over a weighted sum of scores. Ideally, this needs to do something more useful, but the program will need a few changes to the weighting section first.

Most notably, it currently generates a weight for each class per term. That is, each term has a weight for every class. This makes intuitive sense (classes described as relations between the terms they contain), but I think it needs to be the inverse of that -- each class has a weight for every term. Same information, but storing it that way will essentially make each class an N-dimensional object (N being the number of relevant terms), which will let the program do some clustering on the classes. That should make it far more powerful (it would be able to find classes that relate to each other, or create its own classes from scratch, for example), and allow more useful classification methods.

Once that's taken care of, the bulk of the work will just be testing. The clustering and classification parts could use many different algorithms, as well as having adjustable parameters. Indeed, the program itself has a few that dictate what constitutes a 'relevant' term (currently has to do with global frequency over all documents and IDF) which will need to be adjusted. Fortunately, Google Books has kindly provided a large database of books (with OCR data), so we have a good foundation for training and testing the various methods. Our thanks to them for such a great resource.

Let's hope I can provide updates with more frequency in the future.

rpi-classification

Monday, April 6, 2009

Followers

Blog Archive

Contributors