Ask/Poll News.YC: What is a good open source Bayes classifier?

dreish · on Feb 26, 2008

Whichever one you use, you'll probably need to edit the source to change the probability formula from (n_hits / n_total) to ((n_hits + 1) / (n_total + 2)). That's the correct formula based on an even distribution of probabilities (which is close enough to the actual distribution in most situations for this to be a huge improvement). I can never find a reference for this when I search for one, but you can verify it experimentally with a short program in your favorite dynamically-typed language, or a long program.

For example, if you live in a world with only black and white birds, but you don't know the percentage of each and have no reason to believe it's more likely 2% black than 70% black, or any other percentage, if you see two black birds fly by, that doesn't mean the next bird you see has a 100% probability of being black, but that's exactly the assumption most widely-used naive Bayesian classifiers make.

I modified SpamProbe to use (spam+1)/(total+2), and the results have been good.

jgrahamc · on Feb 26, 2008

Sounds to me that you are describing Laplace Smoothing in a Naive Bayes classifier. This is a pretty standard technique for avoiding the problem that you are seeing where the probability comes out as 100%/0% because of a lack of information in the model.

timr · on Feb 26, 2008

More generally, it's equivalent to applying a Dirichlet prior distribution, with uninformative parameters (i.e. all of the parameters on the Dirichlet are equal).

This is important, because while adding a single pseudocount to each column will prevent zero divisions, it's probably not reflective of the true distribution of values. If instead, you add pseudocounts using a Dirichlet where the parameters are set based on some prior knowledge, you can often improve the performance of the classifier (especially in low-count situations), without biasing the results unfairly.

dreish · on Feb 26, 2008

Thanks, yes, that's what it's called.

All due credit to Laplace for the technique, but the word "smoothing" is making me wince, because it makes it sound as though this is some artificial approximation. For the assumption of an even distribution of probabilities, n+1 / m+2 really _is_ the exact probability of the event repeating. Like I said, you can confirm this experimentally with a quick program.

apgwoz · on Feb 27, 2008

There are other smoothing techniques more prevalent in NLP that I'm learning about in my NLP class which distribute probability mass more evenly, but really won't help (at least I don't think so) in a classifier. Witten-Bell and Good-Turing are the ones I can think of off the top of my head.

jgrahamc · on Feb 26, 2008

What sort of classification are you trying to do? Text, I assume. If it's text and you need something open source, or just a pointer to how to write one yourself, you could read the article I wrote for Dr Dobbs on this subject: http://www.ddj.com/development-tools/184406064

There are quite a lot of toolkits out there that do Bayesian things (take a look at libbow or Weka).

fiaz · on Feb 26, 2008

I'm looking for something that is open source that I can start using right away with minimal configuration... I'm not smart enough to create my own, but clever enough to take somebody else's work and cater it to what I'm trying to do.

It's only for text processing.

jackbard · on Feb 26, 2008

http://www.codeproject.com/KB/cs/BayesClassifier.aspx

This is a simple one built in C#/.Net. I fixed a significant bug and raised classification accuracy from 74% -> 96% in my noisy dataset (automobile accident claims). I emailed the author with a bunch of improvements (such as histograms) and some other tweaks but never heard back. Anyway, the bugfix is simple: take a look at category.cs. In TeachPhrase(), move m_TotalWords++ inside the test for "if (!m_Phrases.TryGetValue(phrase, out pc))".

What you want here is to count the # of unique words. The original code was counting the total # of times all words appear.This one change reduced classification errors by 3X.

Cheers, --Jack

jackbard · on Feb 26, 2008

Oh, forgot to mention that you'll want to compute probabilities using logarithms to avoid underflow precision loss (which will introduce classification errors). example: wordValue = System.Math.Log((double)count / (double)cat.TotalWords);

food79 · on Feb 26, 2008

The Cadillac of bayes classifiers is CRM114--it can use classifiers that are far more advanced than naive bayes, such as clustering, or with hidden markov modeling.

ketralnis · on Feb 26, 2008

I can not over-recommend crm114. I've been using it to classify some database entries and its accuracy is second-to-none, and its custom language makes working with strangely-stored data (like database entries) easy (after you learn the strange language)

fiaz · on Feb 26, 2008

Is there any documentation that stands out for learning the alien language?

No doubt, I'll be combing through all of the CRM114 information on the website. Is there anything that is not referenced there that will be of use?

food79 · on Feb 27, 2008

If you are doing a ham/spam type classification, then you won't need the alien language. I am almost a total tech novice and I was able to do well with just some bash scripts. Of course the docs will teach you about better ways to train the system, if you are interested in going from 98% correct classification to 99.5% correct.

learn ham.css < file_to_learn.txt

learn spam.css < file_to_learn.txt

classify < file_to_classify.txt

fiaz · on Feb 27, 2008

I am NOT doing ham/spam type classification. I need to define some classifications for specific types of content.

a-priori · on Feb 27, 2008

Then substitute ham/spam for whatever those types of content are.

fiaz · on Feb 27, 2008

Thanks for the affirmation! It helps when jumping into territory with which I have no previous experience.

henning · on Feb 26, 2008

Beware, the source code to CRM114 ain't no freakin' Cadillac.

henning · on Feb 26, 2008

The one with the simplest interface I've seen is http://divmod.org/trac/wiki/DivmodReverend . Looks very easy to get started with.

If you want to get more serious, use Weka or Bow or YALE or something implemented in a reasonably fast language.

yawl · on Feb 26, 2008

I use Weka in my project. Btw, there is a book about Weka: http://www.cs.waikato.ac.nz/~ml/weka/book.html

jorgeortiz85 · on Feb 26, 2008

+1 for Weka.

It's not just a Bayes classifier. It's got every machine learning algorithm you can think of. Once you get the file format right, you can drop-in any algorithm you want. It's amazing.

fiaz · on Feb 27, 2008

How does Weka differ from CRM114?

jorgeortiz85 · on Feb 27, 2008

I'm not familiar with CRM114, but upon cursory inspection it seems to be written in C and targeted mainly at document classification.

Weka is written in Java and implements all kinds of machine learning/data mining tools.

ashu · on Feb 26, 2008

Andrew Mccallum's Bow library (CMU) is a very robust toolkit. http://www.cs.cmu.edu/~mccallum/bow/

Very high quality. Multiple people have used it for serious research.

vital_sol · on Feb 26, 2008

If Perl is ok, you can try these: http://search.cpan.org/search?query=Bayes&mode=all

simplegeek · on Feb 26, 2008

I'm not sure about this one but you should definitely look into Orange if you're considering Python. http://magix.fri.uni-lj.si/orange/

toplakm · on Feb 27, 2008

You could try Orange: http://www.ailab.si/orange

It is written in combination of Python and C. It can be used as a python module.

bcater · on Feb 26, 2008

The one in BioPython is pretty good.

mig · on Feb 27, 2008

Use CRF for classifying/labeling.

http://crf.sourceforge.net/

Its extremely well written, easy to define features etc. Reasonably good support too.

pg · on Feb 26, 2008

crm114

ivankirigin · on Feb 26, 2008

awesome logo

http://crm114.sourceforge.net/

fiaz · on Feb 27, 2008

pg has spoken, and it is good...

Seriously though this is exactly what I need to accomplish my tasks (which is not SPAM filtering). I briefly looked at all of the other alternatives before diving deeper into the "CRM114 Revealed" book. I really wish I knew about this a few years earlier!!

herdrick · on Feb 27, 2008

Write it yourself. I will say that teasing out the features you want to score can be some work; does anyone here know if the CRM114 package helps with that?

fiaz · on Feb 26, 2008

PROJECT LINKS/POLL

reply only to this message and vote on links below

dfranke · on Feb 26, 2008

CRM114

http://crm114.sourceforge.net/

fiaz · on Feb 26, 2008

Ruby :: Classifier gem

http://classifier.rubyforge.org/

fiaz · on Feb 26, 2008

Java :: jBNC Toolkit

http://jbnc.sourceforge.net/