Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Don't run 'strings' on untrusted files (lcamtuf.blogspot.com)
392 points by robinhouston on Oct 25, 2014 | hide | past | favorite | 214 comments


Am I the only person who thinks there's something fundamentally wrong with computing if running "strings" could let someone take over your computer? (I'm not being snarky; I seriously think the whole approach to security need to be redone somehow.)


I think this is an indicator how fundamentally over-engineered all of the GNU tools are. strings was supposed to be a simple tool that finds bits of data that look like human-readable strings. It wasn't meant to parse ELF binaries and suddenly be a security risk, especially since it's one of the first tools you would use in computer forensics.


The goal of the tool "strings" has always been to dump the strings table of a binary object. It happens to also have a mode that lets you try to find random strings-like content in any file. It happens to default to this mode if it can't parse the file as an executable. People thereby have gotten somewhat used to using this tool to having this functionality at hand, and use this tool a lot of this purpose.

This is not, however, the actual goal or purpose of this tool. The fact that many people use Perl as nothing more than a slightly better version of sed doesn't mean that Perl's ability to write complex object-oriented software is "over-engineering". You just don't know what this tool is actually for, which is OK, but means you can't judge whether or not it is "over-engineered".

BSD, apparently going back to at least BSD 4.3, also had a strings tool, and it did the exact same thing: it parsed binary files to dump their strings table. Apple's strings tool has no code heritage from the GNU version, instead being a vague descendent of the one from BSD 4.3. This is how this tool has always worked: stop being part of the noise trying to turn this into a GNU-bash fest :/.


> The fact that many people use Perl as nothing more than a slightly better version of sed doesn't mean that Perl's ability to write complex object-oriented software is "over-engineering".

Not sure I agree with you on Perl in particular, but other than that I agree with what you're saying ;-) (That is, I don't think GNU being over-engineered is the (or perhaps even "a") problem here.

>> Am I the only person who thinks there's something fundamentally wrong with computing if running "strings" could let someone take over your computer?

No, the underlying library (libbfd) is an example of something that should've been fixed a long time ago. Maybe not quite the horror that was/is openssl -- but clearly an example of "old c code that sort of works" -- perhaps in some ways like bash was/is. It's old, it works, but it could use some clean-up (as evidenced by a number of related buffer under/over-flows and whatnot).

Note that parsing arbitrary binary (or otherwise) input safely, is a pretty hard problem. There was recently a (resource DOS) bug in libxml2, which have been under quite a lot of scrutiny lately (by virtue of being a brilliant injection vector for malicious code, if a bug can be found).

I read this as a two-part bug: one, a lot of people didn't know that strings did more complex parsing than a hexdump and a filter for printable strings (me included) -- and it turns out that the "smart" library isn't terribly robust (in other words: it's typical C code).

While it is possible (at least in theory) to write small C utilities that are safe, if you want them to be (wildly) portable, and have sane handling of various kinds of encoded strings, and encoded data, along with handling different endianess -- apparently most people screw up.

I think there are two basic camps wrt what should be done: those that think we need something like rust, so that we can have safety without much of a slowdown, and those that say screw it, we're no longer running on 5Mhz (or 50Mhz) cpus, we can take anything up to a 100x (10x) slow down without it really being an issue -- Security/stability/predictability is more important.

Those that can't decide between the two, continue writing C like it was still 1989, and we get lots of stuff like this.

I'm not sure if it's usually the mix of a "smart" C programmer writing a program that is patched by a "hobby" C programmer, or the fact that getting C right is just too hard -- or that people don't -Wall and don't run fuzzers and static checkers -- but whatever the reason, we keep seeing serious bugs in C programs.

I'd like to think some of it could be avoided if people wrote more "bloated" C with copious use of functions, more call-by-value, smaller loops, perhaps more computer generated code -- and other "slow" things (while still being C). But I'm probably hopelessly naive.

While I'm definitely not sold on C++ (at least not as viable "better" C for systems programming), I think the old 1998 article[1] by Strostrup on "simple" C and C++ programs illustrate quite well how hard C can be to get reasonably right, for even simple problems. Perhaps rather than waiting for Rust, a reimplementation on large parts of the backbone of our OSs/GNU in Guile, Lua, or some other higher-than-C level language could be worthwhile.

As a side note -- does anyone know of any follow up on Stroustrup's article?

[1] http://www.stroustrup.com/new_learning.pdf


> I'm not sure if it's usually the mix of a "smart" C programmer writing a program that is patched by a "hobby" C programmer, or the fact that getting C right is just too hard -- or that people don't -Wall and don't run fuzzers and static checkers -- but whatever the reason, we keep seeing serious bugs in C programs.

I think the problem is that people don't take programming seriously. The basic flow of development seems to be to write the first thing that comes to mind as it comes to mind, and then incrementally patch things up until the code seems to work on every test case you've thought up.

This approach is at best inefficient when working with "safe" languages. With C or C++, it's nothing short of irresponsible. Part of the problem, I think, is that programmers are never taught to reason formally about their code: to catalogue their preconditions and postconditions and verify with some semblance of rigour that the code they write respects these. At best, they might be treated to a passing reference to object invariants if they happen to take a class or (God forbid) read a book on "OOP."

It's perfectly possible to write good, safe code in C and C++, but not if you're hung up on a "smart" or "hobbyist" programmer mindset, and not if you're not willing to put in a fair amount of effort into thinking before you write.


I don't think that's the root of the problem. While you could make a simpler strings(1), which would help people who only use that one, more complex stuff like objdump(1) really does need to parse binaries. And that should be possible to do without worrying about security problems: you're just reading a file and extracting some information, which even in the worst case should be possible to do without accidentally executing arbitrary code. It's just that libbfd seems to have a lot of bugs, and because it's written in an unsafe language, such bugs can not only cause incorrect information or crashes, but sometimes attacker-controlled code execution. But if you don't do it via libbfd, you're going to need some library that can parse binaries, since many utilities end up needing to do it, and it shouldn't be impossible to safely do so.

An alternative is to implement a subset of full parsing specifically tailored to each utility. In the case of strings(1) that's very simple; in the case of some other utilities it's of intermediate complexity; all the way up to some that need to parse every corner of ELF. Whether that produces a bigger or smaller attack surface depends on a lot of factors: each parser might be simpler, but there are many more of them. FreeBSD was contemplating centralizing more of that into a common libelf, so I don't think it's only GNU who think that's a good idea in principle: https://wiki.freebsd.org/LibElf


The problem is not additional features, the problem is unsafe parsers. Adding features is natural and healthy. The damage is that we've been so afraid of parsers for so long that we associate "let's understand this bytestream better so we can be more useful" with implicit danger.

Some friends of mine are tackling this problem. You should help them. http://langsec.org


Although perhaps not in this case, it's generally true that one person's bloat is another person's functionality.

BSD tools are simpler. They're also generally less useful.


There's something fundamentally wrong with still using unsafe languages for system software.

C should have been buried decades ago, it's a toy language for small pieces of code on isolated systems (and yes, I've written larger programs in C myself, 20 years ago).

Nowdays, C should not even be considered safe for implementing interpreters and runtime systems for other languages. There are plenty of reasonably portable choices available (and no, don't use Java / anything JVM based).

It can't be terribly hard to reimplement "strings" and similar software in a modern language without such deficiencies (i.e. plenty of ways to shoot oneself in the foot and overwrite the stack etc.).


The problem is, until now there has not been a very good systems programming language that allowed you to stay lean on memory without introducing any performance overhead. Rust is certainly a contender, thankfully.

So, there is a perfectly good reason C is still prevalent to this day, even if there are many security implications in doing so.


> The problem is, until now there has not been a very good systems programming language that allowed you to stay lean on memory without introducing any performance overhead. Rust is certainly a contender, thankfully.

We've only had Pascal and Modula-2 for what, 40-45 years?

>perfectly good reason C is still prevalent to this day

... it's not that and it's not good. It's just that security implications were largely ignored, people were lazy and innovation and making sound decisions (rather than popular ones) has never been strong in the OSS/Linux community (apart from the kernel itself).


http://www.lysator.liu.se/c/bwk-on-pascal.html

It's a hobby horse by this point, but it makes it clear that the Pascal of 30 years ago is not the Pascal of today. The same can be said of C.

These languages have evolved to be where they are today, mostly because hardware has evolved to be where it is today. Castigating past decisions as laziness really seems to be ignorant of this process, what it involved, and why it was necessary to make the decisions that we have up until now.

Engineering, on the whole, is the art of compromise.


Reading through some gstreamer code this weekend, I contend that "C of 30 years ago is not the C of today".

It is still all the most horribly insecure and obtuse passing of raw pointer manipulation and bitwise logic ever. The preprocessor is still hell on software. Every variable is named something like gst_hello_world_parser_box because when you write complex software in it you always get huge name conflicts.

It still thinks the best way to manage memory is to not manage it at all. People complained for so long about having to put your deletes in destructors in C++, too bad you have no such thing in C at all - you don't have classes, after all. Guess it has to go in the procedural function logic. Like 30 years ago.


> These languages have evolved to be where they are today, mostly because hardware has evolved to be where it is today.*

No. Pascal and Modula-2 were perfectly usable, safe systems programming capable languages 20 years ago. Pascal was widely used in commercial product development (DOS/Windows), Modula-2 was scarce, but taught at universities with high quality development environments available for many platforms and the Oberon OS was based on the language Oberon, which was heavily influenced by/derived from Modula-2.

Everything necessary to produce safe, maintainable, fast software was available back then, but lazy/uneducated/stubborn people used C instead and wrote crappy software we still have to use today. It's a shame really.


I'm sure Pascal was really the silver bullet and its failure had nothing to do with Pascal derivatives being academic, non-portable, not very expressive, and having numerous other issues.

I think your ad hominem comments about people who implemented in C are in poor taste and not based in reality.


I don't understand why we talk about Pascal at all, since Pascal is just as memory-unsafe as C.


No, because Pascal offers proper strings, bound checked arrays,open arrays, real enumerations, numeric ranges and reference parameters.


in one important practical aspect it isn't - strings are length+content in pascal, not null-delimited as in c. It's very important difference.


You read into buffers, not strings. You can easily have buffers overflows. Pascal arrays are bounds checked, but not when it matters: http://wiki.lazarus.freepascal.org/Secure_programming#Buffer....


The problems mentioned there (assigning to array elements with out-of-bounds indices) can be avoided by using range checks (-CR option for fpc). It's strange (and silly) that this is apparently off by default for fpc, but it's not a language feature of Pascal.


> April 2, 1981

USCD Pascal was created in 1978, Mesa in the mid-70s and Algol in 1968.

Most of the complaints don't apply to them.


> UCSD Pascal

bwk's point, then as now, was that once you deviated from Standard Pascal, you had to either bet on a specific horse in the non-standard Pascal race, or develop your own Pascal-like language with its own inherent defects, similar to but different from the defects of Standard Pascal and other Pascal-like languages. Your code would likely never interoperate with anyone else's Pascal variant.

C, which had one de facto standard implementation and less genetic drift (because the standards, first K&R and then ANSI and ISO, were never as terrible as Standard Pascal's definition), didn't suffer from this nearly as much.

Mesa didn't, either, but only because it was fairly obscure, and there sure were a lot of Foogols running around for a while, of which Pascal and its semi-clones formed only one family.


Nothing of this was relevant in 1981.

The first K&R was published in 1978, ANSI/ISO C in 1989.

In 1981 C only mattered if you had access to a UNIX system, there were hardly any K&R compliant compilers.

Anyone using C outside a UNIX system was also betting on a specific horse.

Which is one of reasons why ANSI C has so much undefined and unspecified behaviours defined, as the committee didn't want to rule out any implementation.


I wonder how big of a project would it be to port all of GNU/Linux to Rust.

Here's a start: https://news.ycombinator.com/item?id=7882211 https://github.com/uutils/coreutils/tree/master/src


I'd rather see it ported to Scheme.


And then see it fall into disrepair and disuse because it becomes unmaintainable.

LISP is 50 years old. If it was going to become useful for high performance software in large scale distributed teams of development, it would have happened already.


Wow, what an extraordinarily ignorant and obtuse remark. Way to go HN!


Please kindly explain where it falls down specifically. Very happy to be wrong should I be missing sometihng.


No no, you are mistaken. His reply is a special one. See, it not only refers to your reply, but it refers to itself as well! It's amazingly efficient, no need to respond to it any further!


I hope fundamental pieces of GNU/Linux system can start migrating to Rust without the FSF objecting to rutsc using LLVM.


I wonder if there'd be less friction to porting a BSD - FreeBSD already uses clang to build the kernel+userland, for instance.


You mean the myth that C was the first systems programming language, ignoring the fact that its designers just decided to ignore what was being done since the early 60's?


OK, how about this: C was the first portable, non-assembly language used for writing OS kernels. Is that true? And I mean portable in practice, not just in theory, and portable beyond one family of computers.


No. OS were already being written in BCPL, Algol, Algol W, Algol 68, PL/I, PL/M and many other languages, battling for a place in the podium of systems programming languages.

C was tied to UNIX, just as the others were tied to the OS of their vendor.

C only became portable after UNIX was available in a few American universities outside AT&T, some students decided to create workstations based on UNIX, while others started to develop C compilers, so that they could continue their work on other systems.


Didn't BCPL and Algol heavily influence C though? What features of BCPL or Algol or other, older languages that were ignored could have made C better?


BCPL not much. It was barely an high level assembler.

Now Algol, it had:

- Bounds checked arrays

- Reference parameters to functions/procedures

- Real strings (not the first version in 1960 though)

- Explicit conversions

Quote from Tony Hoare's ACM award article[1]:

"A consequence of this principle is that every occurrence of every subscript of every subscripted variable was on every occasion checked at run time against both the upper and the lower declared bounds of the array. Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to--they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980 language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law."

[1]http://www.labouseur.com/projects/codeReckon/papers/The-Empe...


Multics was mostly written in PL/I. PL/I is certainly portable, though I don't think Multics itself was particularly portable.

BCPL was used to write several OS's that were ported, though while BCPL predates C, I'm not sure if any of the OS's written using it predates Unix.


I think you are overlooking OCaml & Haskell (which can both be used skillfully to approach performance of C).


Are there examples of either OCaml or Haskell being used for systems programming? I know that algorithmically they can achieve pretty close performance, but I don't think support is that great for low-level operations.


"Ivory" is a Haskell embedded domain specific language for writing safe C programs. One (very loose) way of thinking of it is that it compiles a restricted subset of Haskell to low level C.

http://ivorylang.org/


There is also 'Atom' which is an embedded DSL in Haskell for writing real-time embedded systems: http://en.wikipedia.org/wiki/Atom_(programming_language)


You might consider Mirage OS (http://www.openmirage.org/) to be systems programming.


It's hard to tell what you mean by "low-level operations." A lot of systems programming, particularly the GNU suite, just deals with manipulating files, sockets, and so on, and Haskell can do that just fine.

Example: http://www.haskell.org/haskellwiki/Simple_Unix_tools


Maybe you could have a look at "Unix system programming in OCaml"?

http://ocamlunix.forge.ocamlcore.org/


Haskell is great, but GHC's runtime system is 50000 lines of C.


Which is relatively inconsequential as long as there's a clear boundary between things that can be manipulated by a user and the C code.

It'd be better if it wasn't C at all, sure, but, from a security perspective I'd take a language with a C runtime over a language where code has to be carefully written in a memory-safe manner any day.


GCC is written in C/C++. I don't think any of the compilers are dogfooding their language because its supposed to be reusing infrastructure.

I'd hope an LLVM compiler for a language would eventually be dogfooded, though.


http://en.wikipedia.org/wiki/Bootstrapping_(compilers)

This is pretty much the first thing that everyone writes in their new programming language.


Whatever. One can always write unsafe code. No matter in what language. A really secure language is almost worthless because there's so much it can't do.

No, the problem comes from the other end. Strings should not be able to compromise a system. It only needs to read a user supplied file, and write to stdout. There's no reason for it to be allowed to exec a piece of code.


You seem to be arguing against yourself. If 'strings' were written in a safe language, compromising the system would be extremely unlikely. Or are you seriously arguing that languages that are safer than C aren't Turing complete?

There are plenty of memory-safe languages in which you can do nearly everything you'd do in C, and much, much safer. There's no reason whatsoever a program like 'strings' couldn't be written in a memory-safe language.


> If 'strings' were written in a safe language, compromising the system would be extremely unlikely.

Shellshock was a parsing bug, the memory safety of bash helped nothing.

Bugs are bugs. When you get an out of bounds exception that leaves your program in an inconsistent state somewhere halfway up the call stack in a code path with poor test coverage, "safe" is not the correct word.


The premise of this conversation is flawed: A language isn't "insecure" or "secure." It's placed at a certain point in a safety spectrum ranging from "Do anything with no safety" to "Safely heat the room and achieve nothing."

What I'm saying is not that you can't write software that can be abused in memory-safe languages. I'm saying that you're much less likely to have extremely serious code execution vulnerabilities if you write a program in Go instead of C.


> What I'm saying is not that you can't write software that can be abused in memory-safe languages. I'm saying that you're much less likely to have extremely serious code execution vulnerabilities if you write a program in Go instead of C.

What I'm saying is that "much" less likely is overstating the difference. Most bugs in C programs are not buffer overruns, and even the ones that are would still be bugs in Go or Rust, they would just be a different kind of bug which is still plausibly exploitable under real conditions.

This is not a silver bullet kind of situation. "Rewrite everything in Go" is not actually a fix -- it's replacing code with 30 years worth of bug reports and vulnerability testing with completely untested new code, at the cost of significant resources that could better be used to fix the remaining vulnerabilities.

I'm not even saying that all the existing code is perfect. Replacing OpenSSL with entirely new code would probably do more good than harm just because the existing code is so ugly. But that's the exception rather than the rule.


There are a ridiculous number of serious vulnerabilities in C code related to lack of bounds checking and manual memory management. This isn't really opinion so much as fact. I understand what you're saying, but you're also downplaying the seriousness of C bugs relative to <most other languages> bugs.

> Most bugs in C programs are not buffer overruns, and even the ones that are would still be bugs in Go or Rust, they would just be a different kind of bug which is still plausibly exploitable under real conditions.

I don't follow. What's the equivalent of forgetting to check the length of an input string before chugging it into a too-small array in Go?

> "Rewrite everything in Go" is not actually a fix

I never said that, though. I'm not suggesting we rewrite the GNU tools in Go. But if you were to write the GNU tools from scratch today, C would be a bad choice simply because it's so easy to slip up with devastating effects, and there aren't many advantages to using it for simple tools like 'strings'.


> There are a ridiculous number of serious vulnerabilities in C code related to lack of bounds checking and manual memory management. This isn't really opinion so much as fact. I understand what you're saying, but you're also downplaying the seriousness of C bugs relative to <most other languages> bugs.

I feel like "<most other languages> bugs" tend to get ignored because it's not popular to blame <language> for bugs unless <language> is C. For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.

> I don't follow. What's the equivalent of forgetting to check the length of an input string before chugging it into a too-small array in Go?

In Go the program terminates, which is at best a denial of service vulnerability. If the program is anything in the nature of Fail2ban then just causing it to die is a serious problem. Meanwhile when it restarts it will have to somehow deal with whatever corrupted state the crash left behind, which depending on the context can provide the attacker with opportunities to do arbitrarily bad things by manipulating the state to be something the programmer never anticipated. Being able to induce a restart is a huge increase in attack surface.

Immediate program termination is the "take cyanide capsule" solution to serious bugs. It may be better than some of the alternatives but it's still very bad.


> For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.

Using bind parameters instead of putting data directly in queries + escaping has been standard for a long time now. That is, instead of saying "make sure to escape everything", like "make sure to avoid any bugs in C code", we indeed prefer to switch to a technique which which doesn't act pathologically in the presence of small errors. In SQL's case it doesn't require replacing the whole language. It technically doesn't have to in C, either - you could have bounds checked C - but I guess once you give up the absolute-maximum-performance goal, people prefer to use different languages.

edit: Also, while program termination is not ideal, in many cases, such as this one (strings), it is basically a non-problem, and at worst, denial of service is still loads better than arbitrary code execution.


> I feel like "<most other languages> bugs" tend to get ignored because it's not popular to blame <language> for bugs unless <language> is C. For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.

I don't even use SQL, but I've heard a similar mantra plenty enough times to internalize it.

"but one thing is clear: you should never, ever use concatenated SQL strings in your applications. Give me parameterized SQL, or give me death." ( http://blog.codinghorror.com/give-me-parameterized-sql-or-gi... )

This is an API level change, not a protocol level change - but an API change is the correct answer anyways. If you change the protocol and slap an API like the SQL injection prone ones on top of it, you'll have the same vulnerabilities no matter what the protocol.

For bonus points, use static analysis to catch and forbid query strings that can't be trivially proven static.

> For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.

I see 3x more memory corruption CVEs alone, than SQLI CVEs. I'm not convinced that replacing SQL (or SQL APIs) would be higher impact than replacing C and C++. And since I actually use C and C++ in my daily bread and butter, they're significantly more relevant to me.

But SQLI may be significantly more relevant to you. If that's the case, by all means, focus on them more.


> If the program is anything in the nature of Fail2ban then just causing it to die is a serious problem

If that causes a serious problem, then the architecture is bad and unsuitable for solving that problem (remember: programs can also die because of OOM conditions, flakey hardware, admin errors. If this leaves a gaping security hole open for an attacker, or lead to DoS, then you need a better approach; fail2ban functionality should be in the process handling logins for example, not a separate entity watching logfiles, which might be broken too).

> Being able to induce a restart is a huge increase in attack surface.

Compared to what? Certainly not C code, where very common issues are easily exploitable for arbitrary code execution.


Program termination/exception is clearly preferable to arbitrary code execution.


Does Go force you to sanitize all user input?

Because I am sure if had ported bash to Go, you would still have the same issue with the broken parser. I am not sure if you would still have had heart bleed, but I know you wouldn't have had heart bleed if the openssl people had used the platform libc, instead of rolling their own, so I wouldn't consider heart bleed an issue with C, but an issue with the programmer.


Multiple factors are to blame. I agree this is dumb behavior here by strings, but if the ELF parsing code were to be written in a language like Rust, it's far less likely it would have a bug of this nature.


Rust... or pretty much any other language than C.

C is scary from a security perspective because it is both incredibly easy to write code that has subtle but very serious bugs leading to e.g. arbitrary code execution, and it is easy to exploit those vulnerabilities.


So what you're saying is that the bugs from a piece of software written 20 years ago (roughly) can be solved by writing the program in a language that hasn't had a stable release yet? Sure there are better alternatives now, but what widespread, well known, portable language was available in the mid 90's that this could have been written in?


Despite the volume of messages on this thread about securing C or replacing it with a safer language for systems implementation, I think your answer is the only practical one.

If you haven't seen it already, you might want to check out HiStar:

http://www.scs.stanford.edu/histar/


It's not 'allowed' to exec a piece of code. It parses code and the parser contains a bug that can be exploited to execute code.

Code that is unsafe in this way is impossible to write in most languages.


ATS is a definite contender too, especially once the high entry barrier issues are worked on a bit.


Counterpoint: All the rails security exploits, python pickling and sql injection issues.

It is common to blame C for being unsafe but the real issue seems to be trusting user input, without sanitising it first. You could use a typed language to enforce sanitising by having a special type for user input (whether read from a file or received from the system) and have the converters sanitise it.


Sure. I re-wrote strings a few years ago in C#, worked great. That was at Microsoft, and sadly the code can't be made public.

I've done most of my OS-level hacking in C and a limited subset of C++. I think it works well down there, where resources can't be "magic" and it's really important to know what's going on (and you often have to tell the optimizer "hands off").

At higher levels, it's lunacy to be using C. I sure wouldn't write a compiler in one, for instance.


> I re-wrote strings a few years ago in C#, worked great. That was at Microsoft, and sadly the code can't be made public.

You sure about that? If it was a re-implementation in non-cleanroom conditions, Microsoft likely don't have that right.

Even if that's not the case, I'm curious what commercial gain Microsoft believe they would have be keeping it private.

One wonders how much other code of public interest is hidden behind closed doors (and not just Microsoft's).


They have every right - even if they translated the source code to "strings" to C# line by line, they aren't distributing the result, and hence have no GPL obligations whatsoever. By the same token, if they did distribute it under something other than the GPL (which they would), and it were found to be a "derived work", they would be infringing copyright. However far-fetched, it's no surprise they play it safe.


It's very common for places to claim ownership of all code you write as an employee, both in and out of work. I don't know whether this would stand up in court, but nobody I've spoken to has tried to find out.

As for the provenance of the code, it's quite probable it was a clean-room reimplementation. Writing such tools isn't exactly rocket science. I wrote a PDB-oriented addr2line for VC++ a few years ago, and it didn't have any of the original addr2line code in it at all. In fact, I've never even looked at the addr2line code. I just ran addr2line --help and copied the command line options I saw there. I suspect a reimplementation of strings would be just as straightforward.


Yes, it was clean-room, along the lines of "I need a program to extract strings from binary files." I don't think I even looked at a man page for an existing implementation, so the options and functionality are almost certainly different from other implementations of strings.

This was driven by Microsoft's hostile policies towards running anything open source inside the company (you have to get special dispensation to install Linux, for instance). Actually looking at outside source code is a big no-no.

There has been at least one case of a contractor including GPL3 code in a project, and MS responded by (a) letting the contractor go, and (b) releasing the source for the product in question. [And no, it wasn't Windows]


Pretty much anything written inside a company is born secret. There are exceptions to this, but shipping "free" code (especially from a bureaucracy that is ultimately controlled by people who want to make money, and that includes not giving competitors any advantage at all) is a big political deal.


Compile times are important though and using low level language such as C can make a big difference.


I'll happily lose 5% compile performance in exchange for a safe compiler. Unfortunately, writing a compiler in a safe language isn't enough to make it a safe compiler.


Are you claiming that C is fast to compile? If compile speed mattered that much we'd all still be using Turbo Pascal/Borland Delphi, or perhaps more recently Go.


No, what I meant is that a compiler written in C is likely to be faster than a compler written in other language, which is a good reason to write a compiler in C.


Nah. What you usually want from a compiler is maintainability and correctness. There's also a grand tradition for dogfooding by writing the compiler for language X in language X. If you don't, then how do you know you're actually making progress?


Out of curiosity, what's the problem JVM-based languages as a safe choice? (I'm legitimately curious here.)

It's certainly not the right solution for everything, but there are plenty of cases where I'd think it could be an excellent choice. Most of Android (except the kernel) is written in Java, as an example.


The most popular implementation of the JVM itself is written in C++. But more importantly, Java has experienced a multitude of serious security bugs over the years.


I'm not a huge JVM fan, but the majority of the Java vulnerabilities are related to a very different threat model that many languages don't even purport to guard against: safely running code that you know may be directly written by an adversary. The Java SecurityManager is supposed to allow that by sandboxing code with limited permissions, but has had a number of bugs, some of which can be exploited to let malicious apps break out of the sandbox. That's bad for cases where you are actually relying on running potentially malicious code, like applets in browsers. But not too relevant to the case where you'd be considering whether to choose Java, C, or C++ for a desktop or server app. In that case, treating the app itself as potentially malicious is not a common threat model: people don't normally run C/C++ apps in a sandbox. Although I suppose with the rise of Docker that might start becoming plausible for certain apps, relying on some OS facilities rather than a VM to do the sandboxing.


There's already been multiple security vulnerabilities in Docker, and likely there will continue to be more in the future. Virtualization and sanboxing is hard!


>The most popular implementation of the JVM itself is written in C++.

The amount of C++code decreases with each release.

For example, the version 8 of the OpenJDK had quite a few code rewritten in Java, thanks to the work introduced with invokedynamic.

There are plans to eventually replace Hotspot with Graal and SubstrateVM, in some future version of the OpenJDK, thus reducing even more the C++ surface.

This is why project Sumatra is using Graal and now jRuby is also playing with it.


The vast majority of which where actually in parts of the C++/C code used to implement the most popular implementation of the JVM. Which is in C++ for historical reasons not because C++ is needed to implement a JVM.

And as other commenters note the amount of C++ decreases each release because its less and less useful to have any code in the JVM as C++.

In graal calls to the OS are going to more limited in surface area than in Hotspot C2.

For all reasons removing a class of errors e.g. buffer overflow leading to executable code is going to give more secure code.

Not that I think Java is a silver bullet that will solve all problems. I do think that a language like Java or Rust is a major improvement over C with minimal performance issues.


Garbage collection is an issue breaker for some systems software, you can't use the "fork and start a new process model" for utilities, and memory usage for small systems is going to be an issue too.


The problem is that "safer" languages just encourages programmers that don't really think carefully about what they're doing because of the "the language is safe, it'll protect me from everything" effect. "Simple" bugs get hidden, programmers are encouraged to create more complex systems as a result, and the bugs thus created become even more subtle and difficult to find.

Is it really so bloody hard to ensure that e.g. the fields of the executable header, if they're offsets, are actually valid values? I've written tools that work with PEs, many of them reading the entire file into a buffer first, and "make sure you're inside the file" was one of the points I always kept in mind.

I say we need to fix how programmers think, not the language, because the same mindset that leads to bugs like these in "unsafe" languages will also lead to (maybe less severe, maybe more severe but also more subtle) bugs in "safe" languages too.


> The problem is that "safer" languages just encourages programmers that don't really think carefully about what they're doing because of the "the language is safe, it'll protect me from everything" effect. "Simple" bugs get hidden, programmers are encouraged to create more complex systems as a result, and the bugs thus created become even more subtle and difficult to find.

Citation needed. The plethora of over overcomplicated C code is a strong argument against this claim.


The plethora of over overcomplicated C code is a strong argument against this claim.

There certainly is plenty of overly complicated C code, but I doubt there is more of it than there is the amount of overly complicated code written in some of the other popular "safe" languages: Java and JavaScript.


These unsubstantiated claims are pointless.


> I say we need to fix how programmers think, not the language

I agree, but unfortunately, changing the language or replacing it, is a lot easier than changing the way programmers think.


People create more complex systems because they no longer have to think about tedious bugs that will be caught by their compiler?


> (and no, don't use Java / anything JVM based).

Why not? There are meta-circular implementations of Java, with native code generation on their toolchains.

Even OpenJDK is going to get a AOT compiler as discussed on JVM Language Summit 2014.


What are your arguments against Java?

If you are disregarding Java when you are looking for portable options, then you are IMHO limiting yourself quite a bit and could be missing out on a good solution to your problem.


For software like "strings"? Startup times and memory footprint/performance due to JIT compilation / the JVM, as well as the notorious complexity and bugginess of the runtime. Just consider the frequent updates and incompatibilities you'd get with it - for example, we had plenty of Java-based remote management GUIs which wouldn't run on newer Java runtime versions. Maybe with the (limiting) static compilation options available for Java it'd make more sense.


http://www.erights.org/talks/no-sep/ agrees that there's a basic problem with having a program run as the user. "Treating security as a separate concern has not succeeded in bridging the gap between principle and practice, because it operates without knowledge of what constitutes least authority. Only when requests are made -- whether by humans acting through a user interface, or by one object invoking another -- can we determine how much authority is adequate. Without this knowledge, we must provide programs with enough authority to do anything they might be requested to do."

It's also a basic problem that so much complicated systems software is written in C, but that's better known.


I absolutely agree. It should be trivial to define the sandbox around strings - read an input file, output to STDOUT, nothing else.

Knowing which programmer signed the code is nice, writing stuff in a "more secure" language is nice, but sandboxing beats it all.


I wonder if you could sandbox everything by default.


You mean Qubes OS: https://qubes-os.org/ ?


heh, even parsing file header can be malicious https://www.youtube.com/watch?v=3kEfedtQVOY


No, lots of people think that. Those people do not use GNU software for this reason.


What do they use then?


The MINIX one looks OK, for example: https://github.com/jncraton/minix3/blob/master/commands/simp...

It does attempt to parse a.out format, but any weird values in the header won't cause reading/writing beyond the end of a buffer.


Having a trivial binary format is not particularly great, nor is supporting one binary format and not another. Yet when you get to the complexity of ELF parsing, while I'm sure it's possible to code defensively and end up with fewer bugs than something like libbfd, parser bugs in general are the bread and butter of C.

Solution? In my opinion, either don't bother parsing any binary formats (who actually needs that functionality?), or use a safe language.


Even without such vulnerabilities, I would be wary of printing out stuff from any untrusted files in a terminal. Most terminal emulators have been vulnerable to escape character attacks at some point.

http://marc.info/?l=bugtraq&m=104612710031920 http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-2383


But I don't think that strings will print escape characters. It's the point of strings to extract printable ASCII characters. Or am I mistaken?


No, you're right, I think the parent comment was referring to using cat or grep on binaries.


even on 'valid' binaries, it still tends to mess up your terminal. I noticed that pretty quickly when I started working with Linux. Are there really people that work with cat and grep on binaries files ?


Perhaps not intentionally, but cat has valid uses for concatenating binary files and sometimes they end up going to the terminal just by accident. As far as grep goes, the answer to your question is "yes": http://stackoverflow.com/questions/9988379/how-to-grep-a-tex...


Of course you can use grep on binaries perfectly save, if you just don't print all the results to the terminal. Use `grep -lr <pattern> <dir>` to find binaries that contain a certain pattern, use `grep --byte-offset --only-matching --text <pattern> <file>` to find the offsets in a file.


Grep even explicitly supports it with the -a (--text) option that forces it to treat a binary file as text.


It reminds me one bug I have published earlier in Bugtraq in 1999: http://archives.neohapsis.com/archives/bugtraq/1999-q3/1113....

I wrote a buffer overflow exploit at that time.


I'm a regular user of this utility and this came as a complete surprise to me. So much so that I checked the source myself before believing the article.

I thought "strings" was just a dumb scan over the file. Does this mean that with a properly crafted binary it is also possible to hide strings from a quick check with "strings"?


Yes, though a properly crafted binary has always been able to hide from strings with even a minimal amount of obfuscation or encryption. There's no way to know all the strings a program can output without running it.


Of course, a properly crafted program can always arbitrarily obfuscate strings.

But if you manage to trick libbfd into thinking it's looking at a particular format, you can hide plain text from a simple "strings" invocation even in files that are not executable. I've been using "strings" on all kinds of files, not only executables, and assumed that it will always display all sequences of printable characters present in the file.


Ah, I see. Fair enough, but I don't think it was ever a great assumption that "strings" would uncover all the text in a file. There are so many ways to screw with a file at the byte level that could confuse "strings" but still appear fine when read by an application.


If you solve the problem of determining all output strings of a given program with running it, please leave a message ;)



Sure, you can reduce it down to the Halting Problem, but it's much simpler just to invoke Rice's Theorem.

http://en.wikipedia.org/wiki/Rice's_theorem


This tool has always, going way back to at least BSD 4.3 (if not earlier), been a tool for dumping the strings table of a binary object, which just so happened to also have a fallback mode for things it didn't know how to parse as an object file.


Based on the man page, I'm going to guess yes:

       -a
       --all
       -   Do not scan only the initialized and loaded sections of object files; scan the whole files.
Both the name and description imply that by default it won't scan the whole thing.


It's time to start converting the low-level Linux/UNIX utilities to a language with subscript checking. Go, or Rust (if and when it's finished), or D, or something. We have some good options now.


The Linux kernel and the core utilities needed for an user-friendly OS add up to a mind-boggling amount of code, written by thousands of hobbyists over the course of decades. That code base has the benefit of actually existing, being familiar to a lot of people, and (mostly) behaving in predictable ways that are consistent from one un*x-like system to another.

So in that sense, it's somewhat counterproductive to just say "somebody oughta rewrite this stuff", unless (like RMS) you're willing to dedicate a good chunk of your life to that mission - or think that your post will inspire somebody else to do the same.


As I wrote some time ago,

http://www.drdobbs.com/architecture-and-design/cs-biggest-mi...

it is easy to add bounds checking arrays to the C language. The trouble is, nobody is interested in doing it.

It'd be a heckuva lot easier than changing languages.


(Bright, of course, is the creator of D.)

I agree completely about C. I've been saying this for years. There are three big problems that cause crashes in C programs: "How big is it?", "Who owns it?", and "Who locks it?". The result is over three decades of segfaults and buffer overflows.

There have been three or four variants on C which address some of those issues. I've proposed one myself. None got any traction. The only thing that might work is if someone developed a safe variant of C which could be machine-generated from existing C code, and didn't add significant overhead. GCC already has a fat-pointer subscript checking option, but nobody uses it. That approach is usually slow, with a subscript check on every reference. If you do it right, most subscript checks get hoisted out of loops. Go does that for many FOR statements.

Rust is one of the very few languages which addresses all three of those issues without resorting to garbage collection. I really hope the Rust crowd doesn't screw it up.


> GCC already has a fat-pointer subscript checking option, but nobody uses it

My experience with adding extensions to C++ is that nobody will use them, not even the people who proposed the extension, unless it is adopted by the Standard. The same goes for C.

The feature I proposed for C has been in D since the beginning, and has a very strong track record of success - both in user acceptance and in eliminating bugs. Whether the runtime bounds checking is actually done or not is controlled by a compiler switch - but most users choose to leave it on.


There are a lot of GCC extensions that are not in the standard. https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html

Things that are definitely used are __attribute__'s and labels-as-values.


How would that fix the problem at hand? The code in question isn't using array notation. At the end of the day, it's not a case of using C when another language would be better, it's a case of crappy coding.


It helps because instead of rewriting the whole app, just the function parameter types are redone where pointers to data are changed to bounds checked arrays, on a case by case basis.

It isn't a magic bullet, but as buffer overflows are (I presume) the most common cause of C security exploits, this would help a lot.


Maybe, but my fear would be that the act of rewriting it to use array notation will introduce yet more bugs. Some of the comments point out that the problem exists because nobody cared enough to fix it for several years. Having seen that nobody cared enough to fix it, it's not clear that anyone will care enough to fix it properly.


>to a language with subscript checking

Or, more generally, memory safety.

Of the three you described, Rust makes the strongest memory safety guarantees. Neither D nor Rust require a complicated runtime. I'd say this isn't really Go's intended area of usage.

Also, Go is missing things like ASLR and DEP (last time I checked), which means that if you link to any vulnerable non-Go code using cgo (which is almost inevitable when writing core utils), or if you find a good bug in the Go runtime, it's trivial to get code execution.


Address space randomization is security theater. It slows down attackers, but doesn't stop them. See "stack spraying". What it does prevent is replication of security bugs, which allows vendors to ignore them.


If it slows them down and makes their lives harder, then it is a worthy mitigation. That's what a mitigation is, after all: it reduces severity and effectiveness. It's not a full stopgap. Even still, ASLR implementation quality has left much to be desired throughout the years.


Counter-ASLR measures are not always viable. Regardless, Go also lacks DEP, which is the more important of the two.


No it isn't. If trying to exploit some daemon requires thousands of tries instead of one try, I am thousands of times more secure. Because they will almost certainly fail the first time, causing the daemon to crash, and then have no more tries. It is only security theater if you live in the land of "lets use broken software that crashes all the time and then run a monitoring tool to automatically restart it". That is the problem.


Or, the authors of Linux/UNIX utilities should have implemented and should be implementing bound checking. It seems excessive to switch languages rather than encourage stronger programming habits.


Are there any examples of that being successfully done? Even djb's software, intentionally written to be minimalist and as secure as possible, has had exploitable overflows (both qmail and djbdns have suffered from this). Every Linux and BSD distribution (even OpenBSD) has suffered buffer overflows so severe that arbitrary internet users could get remote root access. Etc.


> Even djb's software, intentionally written to be minimalist and as secure as possible, has had exploitable overflows (both qmail and djbdns have suffered from this).

Which of these are you describing as a buffer overflow?

http://www.cvedetails.com/vulnerability-list/vendor_id-9069/...


The Cyclone dialect of the C language comes to mind: https://en.wikipedia.org/wiki/Cyclone_%28programming_languag...

Unfortunately, it's a dead project, and as far as I recall, never compiled on 64-bit architectures.


That's one approach, yeah. I used to think it was a likely one, but I now think three others are more likely:

1. A language that is low-level and safe but also gives you enough interesting & new to build some buzz/interest, rather than "just" safety. Rust is a candidate here, perhaps.

2. Static analyzers in C advance to the point where a subset of C large enough to be useful can be routinely checked for common types of errors. And it then becomes socially expected that at least core OS stuff will be written in that "checkable" subset of C, treating "unable to prove safety" warnings as errors, or at the very least as suspicious.

3. Mitigate it at the OS level with finer-grained access controls. Utilities like strings(1) or objdump(1) are the easy case here: they do not need to actually have permissions other than "read a file" and "print to output". Even in the worst case, arbitrary code execution in objdump(1) should not be able to delete your home directory, join a botnet, or email your ssh key somewhere, because objdump(1) does not need those permissions. FreeBSD's libcapsicum looks promising, in the sense that it is actually being implemented in the base system, rather than just being yet another ACL proposal going nowhere (Solaris/Illumos also has an actually-shipped privileges system, but I don't know how extensively the base install itself uses it).


1. "Just" safety is hardly peanuts given the status quo. In fact, "just" safety would be much more practical. It'd be much easier to port all the existing code to a C dialect (like Cyclone) than rewrite from scratch in something like Rust.

2. I find compiler instrumentation (think AddressSanitizer and Mudflap) to be more promising than static analysis. Much of the latter is still stuck in the lint era and give out too much noise. That said, tools like Coverity have come a long way and I know a lot of FOSS projects use them frequently. I personally haven't.

3. Capsicum is quite promising, indeed. I like that it extends the existing file descriptor metaphor and offers sandboxing based on namespaces instead of system calls (unlike seccomp), as opposed to the crufty POSIX 1003.1e capabilities which are underdeveloped and still limited to executable processes, AFAIK. That said, we shouldn't just rely on sandboxing, jailing and capability-based security. We need to fix underlying application bugs, as well (the applications that implement the capabilities and sandboxing themselves, particularly so!)


No. Three decades of C have clearly demonstrated that trying to make programmers be "very careful" will not work.

(This evening, I'm struggling with a broken "gedit" on Ubuntu. It turns out that editing a sufficiently large file will break "gedit". Not just crash it once, mess its configuration up so badly that future uses of "gedit" hang the entire GUI.

Known bug since 2012. https://bugs.launchpad.net/ubuntu/+source/gedit/+bug/1021720 Status: unassigned.

If the program terminated with "subscript out of range at line 1354 of 'editmain.c'", this would have been fixed by now.)


Humans are unreliable, especially when doing uncreative work. Anything that can be automated should be.


I agree, but I think by now we should have accepted that the programmer is their own worst enemy. Laziness is a virtue and all that : )


It's not the utilities themselves that are the bigger problem - rather a library the utility uses. libbfd in this case. Problem with writing libraries in anything other than C of course is that they are not very usable by a large variety of software.


It's pretty easy to call into a Rust library from C. The lack of a garbage collector makes embedding Rust quite nice.

Lately I've even been playing with a Ruby gem that's a C extension that calls into a Rust .a that exposes itself via extern C.


I just hope the Rust developers don't screw up. We badly need a safe, fast, low-level language to replace C.

Once Rust settles down, the next step is academic work to automatically translate existing C into Rust.


Ada, Modula-2, Object Pascal, Oberon, Modula-3, ...

Sadly, the list of better system programming languages that the industry decided to ignore is quite big.


All but one of these are Wirthian languages. They all had overhead for their time period, and many of them were academic rather than pragmatic (consequently becoming influential for language designers).

Ada? You're trying too hard. It used to be a government standard and wasn't ignored in the least (see: http://www.seas.gwu.edu/~mfeldman/ada-project-summary.html)... much to a lot of people's chagrin. It's widely regarded as an example of a monster language, but it is still absolutely used where the safety is critical.


> They all had overhead for their time period, and many of them were academic rather than pragmatic

What overhead? The one spread by C crowed without experience in the said languages?

I used a few commercial compilers for those languages. They were quite comparable in terms of generated code quality to C compilers of the same generation, back in the day.

C was also a research language until AT&T made the code available.

> It's widely regarded as an example of a monster language, but it is still absolutely used where the safety is critical.

Actually, I would dare to say that Ada 2012 is smaller than C++14.

Its use has increased in Europe thanks to what is being discussed here, namely the amount of money lost in security issues thanks to the industry adoption of C due to its relation to UNIX.

I included Ada in the list, because most developers aren't aware that it still exists and is being used. Or that GNAT is just one of many compilers that are still available.


Ada (and I guess the other ones) are unlikely to be memory safe in the presence of arbitrary pointers without a GC. That is, having a pointer/reference deep into a vector or some nested element of a tree requires a garbage collector or is unsafe (runs the risk of becoming dangling). So, while they are possibly better than C in many ways there are others in which they are lacking.

This lack of references means code is forced to do more copies or data-structure lookups.

(I dont actually know Ada or those other languages, but I did do some research a while ago and discussed this publically here and on /r/programming a few times, and have never been corrected, so I guess it is close to correct.)


having a pointer/reference deep into a vector or some nested element of a tree requires a garbage collector or is unsafe (runs the risk of becoming dangling)

Yes, this is a problem, certain data structures need to be intrusive in order to be fast. Rust for example is a highly memory-safe language but its "unsafe" mechanism needs to be used in order to implement performant data structures for this reason.

http://www.reddit.com/r/rust/comments/2jec05/problem_with_im...


That is not what I'm talking about. e.g. (in Rust)

  let map: TreeMap<uint, SomeHugeThing> = make_map();

  let value: Option<&SomeHugeThing> = map.find(&123);
`map` is a (recursive) tree that associates a `uint` key with a value of type `SomeHugeThing`, imagine that is, e.g. 1KB, or otherwise expensive to copy. The `value` is a direct pointer to the memory of the value associated with the key 123 (if it exists), that is, it is extremely cheap to manipulate `value` because it is basically a machine pointer directly into the memory of the TreeMap. Rust gives you the power for that to be perfectly safe without a GC: there's no risk that changes to the `map` structure will cause the `value` pointer to be invalidated.

As far as I know, Ada etc. do not allow for this without a GC. That is, there's no way to have a safe pointer directly into the memory controlled by some dynamic data structure. Instead, one would have to search the `map` each time the value is wanted, or turn on GC.

> its "unsafe" mechanism needs to be used in order to implement performant data structures for this reason.

Implement some performant data structures. There's a lot of data structures that can be implemented performantly without begin intrusive (sure, one might want to use `unsafe` occasionally to optimise them fully, but the `unsafe` is almost always not being used to make it intrusive).

Also, I don't understand the relevance of that link, the top comment (which is mine, btw) clearly demonstrates that `unsafe` is not necessary.


Ada specification allows for a GC, but few implementations provide one.

Oberon and Modula-3 have GC and had real usable OS implemented in them, not just some kind of concept OS.

However those languages already offer the following in terms of security over C:

- String data type

- Open arrays (aka slices nowadays)

- Reference parameters (no need to pass pointers to functions/procedures)

- Bound checked arrays (you can bounds checking off, if you really need it)

- Pointer arithmetic is explicit operation

- Casting between types is explicit

- Enumerations are their own types, there is no implicit conversion to/from ints.

While they still don't cover all use cases in terms of memory safety, they already cover quite a few scenarios that in C just lead to unsafe code without the help of a static analyser.


Things get uglier when I want to then call the whole shebang from - let's say - Go! Go -> C -> Rust. Now I am paying the penalty for 2 runtimes!


Nope, you can use Rust without the runtime, so you're not _forced_ into paying the overhead.


So the Rust library you write is fully self sufficient without calling into any Rust stdlib functions or without using any runtime initialization mechanisms like ctors and dtors? That sounds too good to be true - If I am calling Rust then I am getting the Rust runtime just like if I can strlen I am getting the C library.


You can use a significant portion of Rust with only libc (i.e. no different to C) and this will be increasing. Things like "constructors" and destructors do not need the support of a runtime at all, they're entirely just conventional static function calls.

That is to say, the only overhead/problem with calling using C vs. using C and Rust is the increased binary size of having additional code included. There's no loss of flexibility (as long as you're not trying to run on a tiny microcontroller).


That does sound great overall - and I will take a deeper look at Rust - but this thread is about rewriting the core Linux utilities in Rust - if I had to do that with full featured Rust it would be hard enough. But if I had to restrict myself to a subset to avoid the "runtime penalty" for programs embedding my Rust shared library implementation of libbfd - then the rewrite just went from nearly impractical to completely impractical in terms of the efforts required.


> But if I had to restrict myself to a subset to avoid the "runtime penalty" for programs embedding my Rust shared library implementation of libbfd

The point is there is no 'runtime' penalty with Rust†. It doesn't have a compulsory garbage collector, it doesn't have a compulsory complicated IO manager, it doesn't have a compulsory multiplexed threading system. In some circumstances lacking any of those will be a downside, of course, but not having it built-in and compulsory means Rust gains flexibility. In several ways, there's actually lower overhead with Rust, because it is a modern language designed around the advances and research that has happened, e.g. dynamic (de)allocation can be faster due to putting some sizedness requirements on the very lowest level APIs (which one rarely calls directly, so the programmer will never notice the restrictions, just the improved speed).

My comment was trying to address the Go <-> C <-> Rust bridge, pointing out that C and C <-> Rust are essentially the same, other than the extra code one must have from writing in two languages rather than one. If one was to forgo the C (which is possible, Rust can easily be used to write a library directly exposing a C ABI), there won't be much difference at all between C and Rust.

In summary, in future, any restrictions required would not be very significant, and, even now before Rust's runtime has been excised, the restrictions will be things like "no network IO", one still gets access to all sorts of fancy iterators and data structures without requiring any runtime.

†Strictly speaking, this isn't quite true right now, but there is a concrete plan currently being executed to make it true.


Wow, in that case Rust is looking like it could be hell of a lot better C replacement! If I could just write all my libs in Rust using the full power of the language and then just call them easily from any language without ungodly overhead or complexity like in case of say Java/JNI, that's nothing short of fascinating!

Thanks for taking the time to explain.


It's not all brilliant though, there's still a lot of useful tooling that Rust needs to develop to make it really nice, e.g. there's no nice way to create a C header file describing a Rust library (since C headers are the lingua-franca of FFI libraries, in many ways), https://github.com/rust-lang/rust/issues/10530 .

And, being a young language, it's not too hard to be doing something no-one else has ever done, especially with this low-level stuff. :) (Meaning compiler bugs and some "I don't know" answers.)


By the way, you may be interested in https://github.com/uutils/coreutils


You can use the core library to get standard functions with no runtime. I'm not sure what you're saying about constructors and destructors. There's nothing special about calling strlen that adds overhead.


C is fully self sufficient without calling into libc. Otherwise all the libc alternatives couldn't exist.

Sure, you don't get to call strlen (actually, you do if the compiler treats it as an intrinsic), but that's no big deal.


So we would need to write Rust libraries without using Rust stdlib? Not exactly a great advantage in that case. Especially so when you are rewriting a ton of C code.


So what are you complaining about? Do you want to use runtime libs or not? It sounds like you're dissatisfied either way.


It's about the context - rewriting ton of code that is in C libraries right now. If that had to be done using Rust without its runtime libs it is an huge disadvantage to solve the issue at hand.


How high is that penalty, relative to, say, loading a variety of C shared libraries? At some point safety and security trumps saving a few kilobytes of memory...in a server environment, that point probably has already been passed a decade ago. Of course, if the penalty is hundreds of megabytes, then the equation might add up differently.

But, we already have a huge number of system tools and utilities running in a huge number of languages. Python has become the lingua franca of Linux management tools; Perl was in that role in the past, and still exists in a lot of places; bash and shell are integral. All have separate runtimes, and sometimes interact with system-level C libraries, either directly or through command line interfaces.

Is it really that big of a deal to have a Rust or Go shared library in a system that already has a half dozen different languages and a variety of shared and unshared libraries existing at once? I suspect it would be unnoticeable on a modern system. I'd choose safety over shaving a few kilobytes of memory used.

There would be a cost in having that interface friction for developers...and we'd be paying that cost for years. But, it seems like both Rust and Go have planned for the languages to be integrated with C libraries from the beginning, so it seems less of a problem. If I were going to attempt such a thing, I'd probably start with the front end utilities that use the libraries and then work my way down to converting the libraries (even though the libraries, in this case, are where the problems lie). But, maybe it's possible to make Go or Rust code that provide the same C interface as the existing libraries...I don't know enough to even guess.


Arguably 3.


Most languages have some type of "extern C" as part of their FFI. It should be possible to use C as a langua franca between two non C languages.


> It should be possible to use C

Ugh, wouldn't that fall in "just because you can do something doesn't mean you actually should" category? :) Suppose I wrote the library in Go and exported some type of C wrapper and then I loaded that in Rust using Rust->C mechanism - that will load the Golang runtime in Rust! And you still got non-trivial C code to deal with anyways!


If you want to interface two languages, then I do not see a way arround involving the runtime of both languages (in the same way that linking to C code still uses the C runtime). However, using my method does not actually involve any C code. "extern C" makes the compiler produce object code that can be called as if it was produced by C code; it does not produce any C code that would compile to the given object code.

Of course, it will also produce C headers to give the type information of the exported symbols, but I wouldn't call that code.


There was a project to rewrite common binaries in Perl a few years back, the "perl power tools project".

https://metacpan.org/pod/PerlPowerTools

I'd rather see that than a language such as Rust, D, or Go.


Crap, so if `objdump` is likely vulnerable to overflows, and `ldd` is a simple bash script ripe for abuse¹, is there a safe and easy way to determine dynamic library dependencies in an executable?

¹ http://www.catonmat.net/blog/ldd-arbitrary-code-execution/


There's also readelf:

$ readelf -d /bin/ls

[...] Tag Type Name/Value 0x0000000000000001 (NEEDED) Shared library: [libselinux.so.1] 0x0000000000000001 (NEEDED) Shared library: [libcap.so.2] 0x0000000000000001 (NEEDED) Shared library: [libacl.so.1] 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] [...]

I don't know how careful readelf is with its input validation.


The article mentions readelf as using the same buggy library :(


readelf specifically doesn't use BFD - one of it's majors reasons for existence is to validate libbfd.


Hmm, my mistake. You're completely correct.

From the readelf man page:

      This program performs a similar function to objdump but it goes into
      more detail and it exists independently of the BFD library,
      so if there is a bug in BFD then readelf will not be affected.


For an executable you don't trust? Probably running objdump as an unprivileged user inside a VM, and working on future hardening of libbfd.


You can run ldd safely as a normal user.


ldd actually executes the program being probed; it is not a static analyzer. It's only safe if you trust the program in the first place.

From my /usr/bin/ldd:

    try_trace() (
      output=$(eval $add_env '"$@"' 2>&1; rc=$?; printf 'x'; exit $rc)
      rc=$?
      printf '%s' "${output%x}"
      return $rc
    )
    …
    # If the program exits with exit code 5, it means the process has been
    # invoked with __libc_enable_secure.  Fall back to running it through
    # the dynamic linker.
    try_trace "$file"


It depends, but in general, ldd(1) should be considered unsafe. See under Security: http://man7.org/linux/man-pages/man1/ldd.1.html

This is an insightful bug report: https://bugzilla.redhat.com/show_bug.cgi?id=531160


> ldd actually executes the program being probed;

It's insecure, but it's actually a bit more nuanced. What I'm describing applies to up-to-date glibc, it might have been even more insecure in the past.

ldd first calls the dynamic linker with --verify, i.e.:

    /lib/ld-linux.so.2 --verify $file
This is the system's dynamic linker, which should be considered safe. The --verify parameter tells the dynamic linker to check if the file is a valid dynamically linked file. Assuming that the file exists and it is an ELF file, the dynamic linker exits with status 1 if the file doesn't have a DYNAMIC segment, with 2 if the file doesn't have a INTERP segment or with 0 if it has both (the typical case). Execution is never passed to the file with this flag.

Depending on the exit status of the loader, ldd does different things:

* If it was 0, it calls try_trace "$file", meaning that the file is invoked directly. However, remember that for the status to be 0, the file must have an INTERP segment, meaning that the kernel will call the interpreter, typically the dynamic linker, i.e. /lib/ld-linux.so.2. A well behaved linker doesn't pass control to the file if LD_TRACE_LOADED_OBJECTS is set.

This is the insecure case. The interpreter can be set to any other application (it must be statically linked or it must dynamically link itself). However, if the attacker doesn't control the file system, there typically aren't any files which match this condition and either pass control to the application or do anything harmful when invoked with no arguments. The alternative is to set the interpreter field to be the file itself. The problem with that is the full path to the ELF file must be known by the attacker (or it must be known that the victim will call ldd from the directory in which the file resides). /proc/self/exe will not work because at the time the kernel reads the entry, it will still point to the executable of the calling process, usually bash in this case.

* If the exit status was 1, ldd quits (because it's an invalid file).

* If the exit status was 2 (no INTERP segment), the dynamic loader is invoked directly, i.e.

    try_trace "$RTLD" "$file"
These two latter cases are safe.

A bug which can significantly simplify the attack would be a mismatch between the dynamic loader and the kernel, where the dynamic linker thinks that the ELF file has an INTERP filed and returns 0 for --verify and then the kernel doesn't find the INTERP field and passes control directly to the application. I've poked a bit around and as far as I can tell, there's no such bug in the latest kernel and glibc releases. In addition, the kernel aborts the loading process for any errors related to loading the interpreter.

This 30C3 talk[0] discusses potential security issues arising because of mismatches between the various ELF parsers.

[0] http://www.youtube.com/watch?v=1-tUo6RUzBU


> the Linux version of strings is an integral part of GNU binutils...

I think almost everyone ships that version of strings and objdump, fwiw. FreeBSD and NetBSD ship an almost verbatim GNU binutils; OpenBSD's seems to have more local changes (partly b/c it's based on an older binutils they've diverged from), but its 'strings' still uses libbfd.

The only exception I ran across in some quick digging is that Illumos ships the Solaris version of 'strings', and doesn't ship an 'objdump'. This 'strings' seems to have come via Sun via Microsoft via AT&T via UC Berkeley: https://github.com/illumos/illumos-gate/blob/master/usr/src/.... Whether it's safer I haven't investigated; it also parses ELF files, but via its own libelf rather than GNU libbfd.


OpenBSD's strings outputs this: BFD: strings-bfd-badptr: invalid string offset 1179403647 >= 0 for section `'


Apple also does not ship this tool from binutils (though as they aren't dealing with ELF this may be out of the scope of your analysis). They have used their own version of strings which also descended from BSD (like the one you point to from Illumos), though more indirectly (they used some of BSD's code, but I think mostly rewrote it while they were NeXT).


I'm probably going to sound painfully naive now... but why is that a security risk? So libbfd reads past the end of a buffer and segfaults.. so what? It's not writing or executing anything untoward, so who cares?


The security risk is that input from the file that strings is run on ends up in places it shouldn't. One classic place is on the stack; after the random data your function puts on the stack is a bunch of bookkeeping information the compiler puts there, including the address to jump back to when the function returns. If you read user data onto the stack carelessly, you can overwrite the instruction pointer with attacker-controlled data, and the program will return from the function, jump there, and begin executing code.

In this case, he was just fuzzing with "AAAAAAAAAAA" and the program ended up at 0x41414141 ("AAAA"), which is alarming. The next step would to be to figure out where the input file is in memory, replace AAAA with that address, and replace the data there with code. Now "strings" is executing CPU instructions that were in the input file. That's bad.

Some compilers do workarounds, like putting a canary value between the user data and the compiler data, and checking the canary before returning. Some compilers also randomize the location where things go in memory, so it's harder for an attack to predict that address. Some OSes set certain pages to "not executable", so even if an attacker can jump to that memory address, the CPU won't run code from there. None of these are fixes; just barriers for the determined attacker. (W^X is easy to get around, just call code that's already in the binary legitimately, like execve()!)

The fix is to only write to memory you've actually allocated; something the C compiler will not help you with.


> That's bad.

To emphasize to the casual reader: 'bad' is a bit of an understatement: it's 'game over'.


Segfaults frequently indicate exploitable security holes, and in particular, the fuzz test shown in the article shows a segfault at a user-controlled address (0x41414141 is AAAA).


Consider the case of a buffer overrun, a vulnerability that can often cause segfaults. A clever attacker could write a bunch of nasty code to that buffer, then use the buffer overrun to insert a jump statement back to that malicious code. Maybe they added the equivalent "rm -rf $HOME`" or something. Or maybe they decided to dial home with a reverse shell on your system. You have no idea, since you have no control over what code is being executed.


How will they do this if it's a read buffer overflow, not a writer overflow?


A "read buffer" overflow is still nothing but a program reading some form of input and then overwriting the buffer. The target program has to read to somewhere -- namely to the buffer -- and thus it has to write over the boundaries of the buffer.

Because the buffer is local to the function, and because the function return address happens to reside at a higher address than the buffer itself, you probably get to overwrite the return address. Thus, after the function is executed, the execution doesn't return to the call-site but to the address you specified. Place your payload code in the buffer you provided, and overwrite the function return address to be the address of your buffer and you might do all sorts of fun things. Spawn a root shell (if suid binary), spawn a reverse shell, execute a kernel exploit to get root etc.


> A "read buffer" overflow is still nothing but a program reading some form of input and then overwriting the buffer. The target program has to read to somewhere -- namely to the buffer -- and thus it has to write over the boundaries of the buffer.

Who says that? That would be a write buffer overflow. The place where they write to might be properly allocated, so no memory that shouldn't be written is ever written. At least that is how I read it. The OpenSSL bug (heart bleed) was a read overflow. You couldn't use it to inject code, but you could use it to read out private keys.


Because you control a bunch of memory around that buffer anyway. It's reading a file that you have complete control over, after all. The fact that it crashed at address "41414141" strongly suggests its exploitable. That value is like the "hello world" of testing vulnerabilities.


If something that shouldn't be written is written then yes. But from all the reports it sounds like it just reads something that shouldn't be read and produces a segmentation fault because it reads an unallocated page. I could be wrong about what happens, I haven't looked at it in detail. That is how the news reports sound like to me, so I wonder how could one use that to execute code?


To quote the linked article: "The 0x41414141 pointer being read and written by the code comes directly from that proof-of-concept file and can be freely modified by the attacker to try overwriting program control structures."


I seem to have overlooked the "and written".


Having any segfault is historically a strong indicator that there are worse things you have yet to discover.


Segfault: interesting. Segfault at 0x41414141 when your input is "AAAA": alarming.


I don't see the vulnerability here either. The current responses to your post don't actually point to any exploitable vulnerability either.


The binutils maintainers aren't exactly responsive when it comes to following up on security-impacting bug reports: https://sourceware.org/bugzilla/show_bug.cgi?id=16825


Is 'cat foo|strings' immune to the problems of libbfd?


I wouldn't think so. The data being processed are the same.


Using cat and piping it to string actually works without crashing. It seems that libbfd style parsing is disabled on stdin.


I thought that reading from stdin might bypass the file type analysis.


Haven't checked but I'm pretty sure the binutils just look for "magic numbers" (ie are the first four bytes "0x7F 'E' 'L' 'F'") and don't look at any filesystem information.


Despite what the blog comments suggest, gdb does not have to link to libbfd.

But what about objcopy? Much more important utility than strings(1), in my opinion. I admit I rely on it and do not a have a substitute at the ready.

At some stage we need a BSD alternative to the GNU binutils (aside from gcc alternatives). I have seen it discussed several times over the years, but as far as I know it does not exist?


I wonder if this counts for 'file' as well


Yes, for example this bug affects 'file': https://sourceware.org/bugzilla/show_bug.cgi?id=16825

The code in 'file' tries to parse the given file with every built-in format loader, so there are likely many more vulnerabilities like this one.


I don't think so. "file" only needs to read the first few bytes at the beginning of a file to guess what type of file it is, so it isn't likely to have any buffer overflow problems.


You might think so, but 'file' has its share of buffer overflows, integer overflows, attacker-inducible infinite loops, etc. It does some more extensive parsing for some kinds of files, and some of those end up with edge cases. Here's one from two days ago, a buffer overrun in parsing ELF files: https://access.redhat.com/security/cve/CVE-2014-3710. Others: https://security-tracker.debian.org/tracker/source-package/f...


So, can anyone suggest a safer alternative? Or should I use this as an excuse to pick up a new language?


Can't this bug be fixed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: