Monday, May 25, 2009

Ruts

I've been hammering out issues with the simple classification schemes the past couple weeks. A few logic errors and weird implementation methods here and there that had to be straightened out, but those are pretty simple to catch and fix. What's more tricky is parameter and information gauging. 

It's proven to be surprisingly easy for the current algorithm to get into 'ruts'. Basically, a few classes dominate the results of what it sorts things into. I've been playing with ways of reducing this impact, such as trying to set up scaling based on the total number of documents seen (to adjust the cutoffs for irrelevant terms), the number of documents that have been sorted into each class (if one class has 500 documents of that type, the term counts will be much higher than one with three, so that has to be accounted for), and so forth. 

It can be pretty hard to tell what is and isn't working, though, especially when dealing with a 'class' that's mostly continuous (date). Dates within a few years (or possibly decades) of each other should be similar and grouped together. So if it's picking, say, 1829 for almost everything from 1812 to 1880, that might actually be what we want, especially when it doesn't actually know about every individual date in that range as a separate potential class. Or maybe it's because there's a bug. Or a poor selection of the training set. Or the specific weights and global parameters are off. Or combinations of the above. 

So I'm left wondering if I should really be focusing on figuring these out, or if I should move on to something else. Fixing up the clustering so that the program can define it's own super-classes that group 'similar' classes together and make new ones for outliers could really help. But if there are other issues, adding more complexity might just make things worse down the road.

Speed

Originally, it was intended that comparisons between documents and other documents and/or classes would be a term-by-term thing. Go through all of them, do some comparisons and calculations on that term in other classes, and get a result. So it made sense to use something like a list to store the terms seen in a given document if it was just going to iterate through them anyway. In hindsight, that was a terrible idea.

A few weeks ago, while trying to get some improvements in the efficiency of the algorithm, we decided to try placing a hard cap on terms used in comparisons. While the algorithm may know a lot of terms, any given class only cares about a few terms "characteristic" of it. Keep a relatively small number of only the most discriminatory terms for each class, and only compare between those. This means that quick random access of the term list in each document was needed. So we switched over to a hash map. 

Of course, this sped up everything else, too. Essentially, whenever it read a term and wanted to update the count of it it had to find the term in the list first, an O(n) operation. Now that it could reach them much more efficiently, reading and parsing each document became extremely fast. Something like 95% of the running time is now spent in the actual classification instead of wasting a lot of it in parsing. 

The bottom line is that this makes things much more practical, as well as easier to test and tweak the algorithms since it doesn't take hours to go through a few dozen documents anymore. So we've been able to focus on the methods now instead of spending a lot of time twiddling thumbs while waiting for the results to the latest tweaks. And, of course, it means the algorithms will be much more practical for large data sets now.

I'll see if I can write up a post on some of the results and issues in the tweaks in a little while.

Monday, April 6, 2009

There's been quite a bit of progress since the last post, though unfortunately I haven't kept up with the updates here.

I guess a quick recap is in order, then. For the most part, the program is functional and can do basic classification. It makes use of a two step process. Well, three, I guess. It first parses through a document and keeps counts of all the terms that appear in it, but that's not as interesting as the other parts. After it has the document information, it calculates weights for the terms to generate scores for the document (the second part). It can then use these to classify it (the last part). 

At the moment, though, the classification is very basic; just a simple max check over a weighted sum of scores. Ideally, this needs to do something more useful, but the program will need a few changes to the weighting section first.

Most notably, it currently generates a weight for each class per term. That is, each term has a weight for every class. This makes intuitive sense (classes described as relations between the terms they contain), but I think it needs to be the inverse of that -- each class has a weight for every term. Same information, but storing it that way will essentially make each class an N-dimensional object (N being the number of relevant terms), which will let the program do some clustering on the classes. That should make it far more powerful (it would be able to find classes that relate to each other, or create its own classes from scratch, for example), and allow more useful classification methods. 

Once that's taken care of, the bulk of the work will just be testing. The clustering and classification parts could use many different algorithms, as well as having adjustable parameters. Indeed, the program itself has a few that dictate what constitutes a 'relevant' term (currently has to do with global frequency over all documents and IDF) which will need to be adjusted. Fortunately, Google Books has kindly provided a large database of books (with OCR data), so we have a good foundation for training and testing the various methods. Our thanks to them for such a great resource. 

Let's hope I can provide updates with more frequency in the future.