Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd be interested to learn how to add missing value support into an existing language. It's pervasive in R, so important for statistics, and it seems like it would be hard to patch onto an existing language.


How important is the distinction between NaN versus NA? I agree this is a subtle issue.


It is important - you cannot know whether NaN is coming from a computation or is really a missing value otherwise. In Numpy, we have the MaskedArray implementation to do this.


What does a NA value become when you extract it to a float? i.e. What is the behavior of X[0]?

It is somewhat confusing that python base types and numpy differ in behavior, for instance when dealing with inf or divide by zero exceptions. I think this gets to hadley's point that it will be hard to bolt on R to an existing language.


If X[0] is masked, it will return the value mask, of type MaskedConstant.

As for python vs numpy differences: yes, those can be confusing, and that's inherent to the fact that we use a "real" language with a library on top of it instead of the language designed around the domain. If you want to do numerical computation, you do want the behavior of numpy in most cases, I think. There is the issue of "easiness" vs what scientists need. You regularly have people who complain about various float issues, and people with little numerical computation knowledge advising to use decimal, etc... unaware of the issues. Also, python will want to "hide" the complexities, whereas numpy is much less forgiving.

As for the special case of divide by 0 or inf, note that you can get a behavior quite similar to python float. You can control how FPU exceptions are raised with numpy.seterr:

import numpy as np a = np.random.randn(4) a / 0 # puts a warnings, gives an array of +/- inf np.seterr(divide="raise") a / 0 # raise divide by 0 exception


I agree. It's pretty important. Missing values can be coded as funny things like minus infinity in some languages. So then, if you want to assign a subset of a vector to a new group (say the number of people who have had a an event e.g. "heart attack") and you used the argument x <= 1, "missing values" would be incorrectly categorized as "heart attack". You can see how this could really affect your analysis. With R, the usefulness of NA is that if you're careful when you import data, this never happens.

Of course if you're always careful, if your NA values in other languages are stored as numbers, you can avoid this error. But it's made easy by R's approach to NA.


And this isn't just a hypothetical example - it has happened and has affected the results of important medical studies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: