I've been hammering out issues with the simple classification schemes the past couple weeks. A few logic errors and weird implementation methods here and there that had to be straightened out, but those are pretty simple to catch and fix. What's more tricky is parameter and information gauging.
It's proven to be surprisingly easy for the current algorithm to get into 'ruts'. Basically, a few classes dominate the results of what it sorts things into. I've been playing with ways of reducing this impact, such as trying to set up scaling based on the total number of documents seen (to adjust the cutoffs for irrelevant terms), the number of documents that have been sorted into each class (if one class has 500 documents of that type, the term counts will be much higher than one with three, so that has to be accounted for), and so forth.
It can be pretty hard to tell what is and isn't working, though, especially when dealing with a 'class' that's mostly continuous (date). Dates within a few years (or possibly decades) of each other should be similar and grouped together. So if it's picking, say, 1829 for almost everything from 1812 to 1880, that might actually be what we want, especially when it doesn't actually know about every individual date in that range as a separate potential class. Or maybe it's because there's a bug. Or a poor selection of the training set. Or the specific weights and global parameters are off. Or combinations of the above.
So I'm left wondering if I should really be focusing on figuring these out, or if I should move on to something else. Fixing up the clustering so that the program can define it's own super-classes that group 'similar' classes together and make new ones for outliers could really help. But if there are other issues, adding more complexity might just make things worse down the road.