Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask/Poll News.YC: What is a good open source Bayes classifier?
35 points by fiaz on Feb 26, 2008 | hide | past | favorite | 36 comments
I'm just doing a little research for a project and I thought everybody could benefit from the fact that we have happen to have an expert on the subject as a member.

I would also like learn from what other people have used as well so any input that others could give would be helpful.

------------

UPDATE: I started a poll thread below so that we can submit a project link and up-vote a link if we have experience or knowledge over one project versus another.

Thank you to everybody who replied, you are great!!!



Whichever one you use, you'll probably need to edit the source to change the probability formula from (n_hits / n_total) to ((n_hits + 1) / (n_total + 2)). That's the correct formula based on an even distribution of probabilities (which is close enough to the actual distribution in most situations for this to be a huge improvement). I can never find a reference for this when I search for one, but you can verify it experimentally with a short program in your favorite dynamically-typed language, or a long program.

For example, if you live in a world with only black and white birds, but you don't know the percentage of each and have no reason to believe it's more likely 2% black than 70% black, or any other percentage, if you see two black birds fly by, that doesn't mean the next bird you see has a 100% probability of being black, but that's exactly the assumption most widely-used naive Bayesian classifiers make.

I modified SpamProbe to use (spam+1)/(total+2), and the results have been good.


Sounds to me that you are describing Laplace Smoothing in a Naive Bayes classifier. This is a pretty standard technique for avoiding the problem that you are seeing where the probability comes out as 100%/0% because of a lack of information in the model.


More generally, it's equivalent to applying a Dirichlet prior distribution, with uninformative parameters (i.e. all of the parameters on the Dirichlet are equal).

This is important, because while adding a single pseudocount to each column will prevent zero divisions, it's probably not reflective of the true distribution of values. If instead, you add pseudocounts using a Dirichlet where the parameters are set based on some prior knowledge, you can often improve the performance of the classifier (especially in low-count situations), without biasing the results unfairly.


Thanks, yes, that's what it's called.

All due credit to Laplace for the technique, but the word "smoothing" is making me wince, because it makes it sound as though this is some artificial approximation. For the assumption of an even distribution of probabilities, n+1 / m+2 really _is_ the exact probability of the event repeating. Like I said, you can confirm this experimentally with a quick program.


There are other smoothing techniques more prevalent in NLP that I'm learning about in my NLP class which distribute probability mass more evenly, but really won't help (at least I don't think so) in a classifier. Witten-Bell and Good-Turing are the ones I can think of off the top of my head.


What sort of classification are you trying to do? Text, I assume. If it's text and you need something open source, or just a pointer to how to write one yourself, you could read the article I wrote for Dr Dobbs on this subject: http://www.ddj.com/development-tools/184406064

There are quite a lot of toolkits out there that do Bayesian things (take a look at libbow or Weka).


I'm looking for something that is open source that I can start using right away with minimal configuration... I'm not smart enough to create my own, but clever enough to take somebody else's work and cater it to what I'm trying to do.

It's only for text processing.


http://www.codeproject.com/KB/cs/BayesClassifier.aspx

This is a simple one built in C#/.Net. I fixed a significant bug and raised classification accuracy from 74% -> 96% in my noisy dataset (automobile accident claims). I emailed the author with a bunch of improvements (such as histograms) and some other tweaks but never heard back. Anyway, the bugfix is simple: take a look at category.cs. In TeachPhrase(), move m_TotalWords++ inside the test for "if (!m_Phrases.TryGetValue(phrase, out pc))".

What you want here is to count the # of unique words. The original code was counting the total # of times all words appear.This one change reduced classification errors by 3X.

Cheers, --Jack


Oh, forgot to mention that you'll want to compute probabilities using logarithms to avoid underflow precision loss (which will introduce classification errors). example: wordValue = System.Math.Log((double)count / (double)cat.TotalWords);


The Cadillac of bayes classifiers is CRM114--it can use classifiers that are far more advanced than naive bayes, such as clustering, or with hidden markov modeling.


I can not over-recommend crm114. I've been using it to classify some database entries and its accuracy is second-to-none, and its custom language makes working with strangely-stored data (like database entries) easy (after you learn the strange language)


Is there any documentation that stands out for learning the alien language?

No doubt, I'll be combing through all of the CRM114 information on the website. Is there anything that is not referenced there that will be of use?


If you are doing a ham/spam type classification, then you won't need the alien language. I am almost a total tech novice and I was able to do well with just some bash scripts. Of course the docs will teach you about better ways to train the system, if you are interested in going from 98% correct classification to 99.5% correct.

learn ham.css < file_to_learn.txt

learn spam.css < file_to_learn.txt

classify < file_to_classify.txt


I am NOT doing ham/spam type classification. I need to define some classifications for specific types of content.


Then substitute ham/spam for whatever those types of content are.


Thanks for the affirmation! It helps when jumping into territory with which I have no previous experience.


Beware, the source code to CRM114 ain't no freakin' Cadillac.


The one with the simplest interface I've seen is http://divmod.org/trac/wiki/DivmodReverend . Looks very easy to get started with.

If you want to get more serious, use Weka or Bow or YALE or something implemented in a reasonably fast language.


I use Weka in my project. Btw, there is a book about Weka: http://www.cs.waikato.ac.nz/~ml/weka/book.html


+1 for Weka.

It's not just a Bayes classifier. It's got every machine learning algorithm you can think of. Once you get the file format right, you can drop-in any algorithm you want. It's amazing.


How does Weka differ from CRM114?


I'm not familiar with CRM114, but upon cursory inspection it seems to be written in C and targeted mainly at document classification.

Weka is written in Java and implements all kinds of machine learning/data mining tools.


Andrew Mccallum's Bow library (CMU) is a very robust toolkit. http://www.cs.cmu.edu/~mccallum/bow/

Very high quality. Multiple people have used it for serious research.


If Perl is ok, you can try these: http://search.cpan.org/search?query=Bayes&mode=all


I'm not sure about this one but you should definitely look into Orange if you're considering Python. http://magix.fri.uni-lj.si/orange/


You could try Orange: http://www.ailab.si/orange

It is written in combination of Python and C. It can be used as a python module.


The one in BioPython is pretty good.


Use CRF for classifying/labeling.

http://crf.sourceforge.net/

Its extremely well written, easy to define features etc. Reasonably good support too.


crm114



pg has spoken, and it is good...

Seriously though this is exactly what I need to accomplish my tasks (which is not SPAM filtering). I briefly looked at all of the other alternatives before diving deeper into the "CRM114 Revealed" book. I really wish I knew about this a few years earlier!!


Write it yourself. I will say that teasing out the features you want to score can be some work; does anyone here know if the CRM114 package helps with that?


PROJECT LINKS/POLL

reply only to this message and vote on links below



Ruby :: Classifier gem

http://classifier.rubyforge.org/


Java :: jBNC Toolkit

http://jbnc.sourceforge.net/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: