Monday, May 25, 2009

Ruts

I've been hammering out issues with the simple classification schemes the past couple weeks. A few logic errors and weird implementation methods here and there that had to be straightened out, but those are pretty simple to catch and fix. What's more tricky is parameter and information gauging. 

It's proven to be surprisingly easy for the current algorithm to get into 'ruts'. Basically, a few classes dominate the results of what it sorts things into. I've been playing with ways of reducing this impact, such as trying to set up scaling based on the total number of documents seen (to adjust the cutoffs for irrelevant terms), the number of documents that have been sorted into each class (if one class has 500 documents of that type, the term counts will be much higher than one with three, so that has to be accounted for), and so forth. 

It can be pretty hard to tell what is and isn't working, though, especially when dealing with a 'class' that's mostly continuous (date). Dates within a few years (or possibly decades) of each other should be similar and grouped together. So if it's picking, say, 1829 for almost everything from 1812 to 1880, that might actually be what we want, especially when it doesn't actually know about every individual date in that range as a separate potential class. Or maybe it's because there's a bug. Or a poor selection of the training set. Or the specific weights and global parameters are off. Or combinations of the above. 

So I'm left wondering if I should really be focusing on figuring these out, or if I should move on to something else. Fixing up the clustering so that the program can define it's own super-classes that group 'similar' classes together and make new ones for outliers could really help. But if there are other issues, adding more complexity might just make things worse down the road.

Speed

Originally, it was intended that comparisons between documents and other documents and/or classes would be a term-by-term thing. Go through all of them, do some comparisons and calculations on that term in other classes, and get a result. So it made sense to use something like a list to store the terms seen in a given document if it was just going to iterate through them anyway. In hindsight, that was a terrible idea.

A few weeks ago, while trying to get some improvements in the efficiency of the algorithm, we decided to try placing a hard cap on terms used in comparisons. While the algorithm may know a lot of terms, any given class only cares about a few terms "characteristic" of it. Keep a relatively small number of only the most discriminatory terms for each class, and only compare between those. This means that quick random access of the term list in each document was needed. So we switched over to a hash map. 

Of course, this sped up everything else, too. Essentially, whenever it read a term and wanted to update the count of it it had to find the term in the list first, an O(n) operation. Now that it could reach them much more efficiently, reading and parsing each document became extremely fast. Something like 95% of the running time is now spent in the actual classification instead of wasting a lot of it in parsing. 

The bottom line is that this makes things much more practical, as well as easier to test and tweak the algorithms since it doesn't take hours to go through a few dozen documents anymore. So we've been able to focus on the methods now instead of spending a lot of time twiddling thumbs while waiting for the results to the latest tweaks. And, of course, it means the algorithms will be much more practical for large data sets now.

I'll see if I can write up a post on some of the results and issues in the tweaks in a little while.

Monday, April 6, 2009

There's been quite a bit of progress since the last post, though unfortunately I haven't kept up with the updates here.

I guess a quick recap is in order, then. For the most part, the program is functional and can do basic classification. It makes use of a two step process. Well, three, I guess. It first parses through a document and keeps counts of all the terms that appear in it, but that's not as interesting as the other parts. After it has the document information, it calculates weights for the terms to generate scores for the document (the second part). It can then use these to classify it (the last part). 

At the moment, though, the classification is very basic; just a simple max check over a weighted sum of scores. Ideally, this needs to do something more useful, but the program will need a few changes to the weighting section first.

Most notably, it currently generates a weight for each class per term. That is, each term has a weight for every class. This makes intuitive sense (classes described as relations between the terms they contain), but I think it needs to be the inverse of that -- each class has a weight for every term. Same information, but storing it that way will essentially make each class an N-dimensional object (N being the number of relevant terms), which will let the program do some clustering on the classes. That should make it far more powerful (it would be able to find classes that relate to each other, or create its own classes from scratch, for example), and allow more useful classification methods. 

Once that's taken care of, the bulk of the work will just be testing. The clustering and classification parts could use many different algorithms, as well as having adjustable parameters. Indeed, the program itself has a few that dictate what constitutes a 'relevant' term (currently has to do with global frequency over all documents and IDF) which will need to be adjusted. Fortunately, Google Books has kindly provided a large database of books (with OCR data), so we have a good foundation for training and testing the various methods. Our thanks to them for such a great resource. 

Let's hope I can provide updates with more frequency in the future.

Thursday, November 27, 2008

It's alive... mostly

Though this is the first blog post, I'm just going to jump in on the latest progress of the project, and try and fill in the general overview of what we're trying to do later. It might be confusing in the future, but let's see how it goes.

I guess the first thing to say is that the issue with zero term weights was caused by the inverse index not functioning properly. An annoying issue with local variables getting destroyed, it seemed, but switching it to hold document indices and retrieving them that way didn't work either. I'm not entirely sure what the deal is there, but for the moment I've worked around it -- instead of computing weights only using documents the term appears in, it uses all of them. Not quite as efficient (some zero weights are now expected), but it gets some leniency for actually working.

It did, however, highlight another issue with the implementation -- incremental updates fail on zero weights. More specifically when the IDF is zero, since the current weighting function is multiplied by the IDF. Since all it stores is the weight and IDF, if that becomes zero (i.e., it appears in all currently seen documents), there's absolutely no way to correctly update them if it's not also in the next one. A simple solution was just to have it store the total number of occurrences of the term for documents of that class, rather than the weight. When needed, the weight is a simple count * IDF computation, and this can be properly updated as documents come in. Later on, more complicated weighting and classification schemes might require more information than just the class frequency count, but as long as it doesn't need too much information a struct should be okay.

With those taken care of, the program can compute classification information in a 'learning' mode (with provided classes), save the smaller amount of information it will need to a file, then load that to perform more learning or actual classification. Since the updates and classification is incremental, the order of the documents is important, but that's largely true for people as well. 

For the basic test case I've been checking it on, though, it doesn't classify the document correctly, but since the current goal is getting the program working, that's a minor issue. I'll add in bound checking on IDF, counts, and maybe weights to narrow the number of terms that actually contribute to the classification to a more useful band, but that will probably be the end for this semester, as far as the algorithm goes -- the program could really use more usability enhancements before we start testing algorithms and running it on larger data sets.

So that makes the next step adding those in. Currently, it takes all the documents (and provided classes) on the command line. Allowing it to read those from a file (or stdin which can be piped from a file) or taking a directory and running on everything in it are pretty much necessary in order for it to be useful on databases of hundreds or thousands of books. Similarly, some tweaks regarding the knowledge database it keeps would be nice. Getting this done before the end of the semester shouldn't be an issue, so the next semester can be focused on the weighting and classification algorithm.

That's all for now.