Monday, May 25, 2009

Speed

Originally, it was intended that comparisons between documents and other documents and/or classes would be a term-by-term thing. Go through all of them, do some comparisons and calculations on that term in other classes, and get a result. So it made sense to use something like a list to store the terms seen in a given document if it was just going to iterate through them anyway. In hindsight, that was a terrible idea.

A few weeks ago, while trying to get some improvements in the efficiency of the algorithm, we decided to try placing a hard cap on terms used in comparisons. While the algorithm may know a lot of terms, any given class only cares about a few terms "characteristic" of it. Keep a relatively small number of only the most discriminatory terms for each class, and only compare between those. This means that quick random access of the term list in each document was needed. So we switched over to a hash map. 

Of course, this sped up everything else, too. Essentially, whenever it read a term and wanted to update the count of it it had to find the term in the list first, an O(n) operation. Now that it could reach them much more efficiently, reading and parsing each document became extremely fast. Something like 95% of the running time is now spent in the actual classification instead of wasting a lot of it in parsing. 

The bottom line is that this makes things much more practical, as well as easier to test and tweak the algorithms since it doesn't take hours to go through a few dozen documents anymore. So we've been able to focus on the methods now instead of spending a lot of time twiddling thumbs while waiting for the results to the latest tweaks. And, of course, it means the algorithms will be much more practical for large data sets now.

I'll see if I can write up a post on some of the results and issues in the tweaks in a little while.

No comments: