Though this is the first blog post, I'm just going to jump in on the latest progress of the project, and try and fill in the general overview of what we're trying to do later. It might be confusing in the future, but let's see how it goes.
I guess the first thing to say is that the issue with zero term weights was caused by the inverse index not functioning properly. An annoying issue with local variables getting destroyed, it seemed, but switching it to hold document indices and retrieving them that way didn't work either. I'm not entirely sure what the deal is there, but for the moment I've worked around it -- instead of computing weights only using documents the term appears in, it uses all of them. Not quite as efficient (some zero weights are now expected), but it gets some leniency for actually working.
It did, however, highlight another issue with the implementation -- incremental updates fail on zero weights. More specifically when the IDF is zero, since the current weighting function is multiplied by the IDF. Since all it stores is the weight and IDF, if that becomes zero (i.e., it appears in all currently seen documents), there's absolutely no way to correctly update them if it's not also in the next one. A simple solution was just to have it store the total number of occurrences of the term for documents of that class, rather than the weight. When needed, the weight is a simple count * IDF computation, and this can be properly updated as documents come in. Later on, more complicated weighting and classification schemes might require more information than just the class frequency count, but as long as it doesn't need too much information a struct should be okay.
With those taken care of, the program can compute classification information in a 'learning' mode (with provided classes), save the smaller amount of information it will need to a file, then load that to perform more learning or actual classification. Since the updates and classification is incremental, the order of the documents is important, but that's largely true for people as well.
For the basic test case I've been checking it on, though, it doesn't classify the document correctly, but since the current goal is getting the program working, that's a minor issue. I'll add in bound checking on IDF, counts, and maybe weights to narrow the number of terms that actually contribute to the classification to a more useful band, but that will probably be the end for this semester, as far as the algorithm goes -- the program could really use more usability enhancements before we start testing algorithms and running it on larger data sets.
So that makes the next step adding those in. Currently, it takes all the documents (and provided classes) on the command line. Allowing it to read those from a file (or stdin which can be piped from a file) or taking a directory and running on everything in it are pretty much necessary in order for it to be useful on databases of hundreds or thousands of books. Similarly, some tweaks regarding the knowledge database it keeps would be nice. Getting this done before the end of the semester shouldn't be an issue, so the next semester can be focused on the weighting and classification algorithm.
That's all for now.