Am I the only person who thinks there's something fundamentally wrong with computing if running "strings" could let someone take over your computer? (I'm not being snarky; I seriously think the whole approach to security need to be redone somehow.)
I think this is an indicator how fundamentally over-engineered all of the GNU tools are. strings was supposed to be a simple tool that finds bits of data that look like human-readable strings. It wasn't meant to parse ELF binaries and suddenly be a security risk, especially since it's one of the first tools you would use in computer forensics.
The goal of the tool "strings" has always been to dump the strings table of a binary object. It happens to also have a mode that lets you try to find random strings-like content in any file. It happens to default to this mode if it can't parse the file as an executable. People thereby have gotten somewhat used to using this tool to having this functionality at hand, and use this tool a lot of this purpose.
This is not, however, the actual goal or purpose of this tool. The fact that many people use Perl as nothing more than a slightly better version of sed doesn't mean that Perl's ability to write complex object-oriented software is "over-engineering". You just don't know what this tool is actually for, which is OK, but means you can't judge whether or not it is "over-engineered".
BSD, apparently going back to at least BSD 4.3, also had a strings tool, and it did the exact same thing: it parsed binary files to dump their strings table. Apple's strings tool has no code heritage from the GNU version, instead being a vague descendent of the one from BSD 4.3. This is how this tool has always worked: stop being part of the noise trying to turn this into a GNU-bash fest :/.
> The fact that many people use Perl as nothing more than a slightly better version of sed doesn't mean that Perl's ability to write complex object-oriented software is "over-engineering".
Not sure I agree with you on Perl in particular, but other than that I agree with what you're saying ;-) (That is, I don't think GNU being over-engineered is the (or perhaps even "a") problem here.
>> Am I the only person who thinks there's something fundamentally wrong with computing if running "strings" could let someone take over your computer?
No, the underlying library (libbfd) is an example of something that should've been fixed a long time ago. Maybe not quite the horror that was/is openssl -- but clearly an example of "old c code that sort of works" -- perhaps in some ways like bash was/is. It's old, it works, but it could use some clean-up (as evidenced by a number of related buffer under/over-flows and whatnot).
Note that parsing arbitrary binary (or otherwise) input safely, is a pretty hard problem. There was recently a (resource DOS) bug in libxml2, which have been under quite a lot of scrutiny lately (by virtue of being a brilliant injection vector for malicious code, if a bug can be found).
I read this as a two-part bug: one, a lot of people didn't know that strings did more complex parsing than a hexdump and a filter for printable strings (me included) -- and it turns out that the "smart" library isn't terribly robust (in other words: it's typical C code).
While it is possible (at least in theory) to write small C utilities that are safe, if you want them to be (wildly) portable, and have sane handling of various kinds of encoded strings, and encoded data, along with handling different endianess -- apparently most people screw up.
I think there are two basic camps wrt what should be done: those that think we need something like rust, so that we can have safety without much of a slowdown, and those that say screw it, we're no longer running on 5Mhz (or 50Mhz) cpus, we can take anything up to a 100x (10x) slow down without it really being an issue -- Security/stability/predictability is more important.
Those that can't decide between the two, continue writing C like it was still 1989, and we get lots of stuff like this.
I'm not sure if it's usually the mix of a "smart" C programmer writing a program that is patched by a "hobby" C programmer, or the fact that getting C right is just too hard -- or that people don't -Wall and don't run fuzzers and static checkers -- but whatever the reason, we keep seeing serious bugs in C programs.
I'd like to think some of it could be avoided if people wrote more "bloated" C with copious use of functions, more call-by-value, smaller loops, perhaps more computer generated code -- and other "slow" things (while still being C). But I'm probably hopelessly naive.
While I'm definitely not sold on C++ (at least not as viable "better" C for systems programming), I think the old 1998 article[1] by Strostrup on "simple" C and C++ programs illustrate quite well how hard C can be to get reasonably right, for even simple problems. Perhaps rather than waiting for Rust, a reimplementation on large parts of the backbone of our OSs/GNU in Guile, Lua, or some other higher-than-C level language could be worthwhile.
As a side note -- does anyone know of any follow up on Stroustrup's article?
> I'm not sure if it's usually the mix of a "smart" C programmer writing a program that is patched by a "hobby" C programmer, or the fact that getting C right is just too hard -- or that people don't -Wall and don't run fuzzers and static checkers -- but whatever the reason, we keep seeing serious bugs in C programs.
I think the problem is that people don't take programming seriously. The basic flow of development seems to be to write the first thing that comes to mind as it comes to mind, and then incrementally patch things up until the code seems to work on every test case you've thought up.
This approach is at best inefficient when working with "safe" languages. With C or C++, it's nothing short of irresponsible. Part of the problem, I think, is that programmers are never taught to reason formally about their code: to catalogue their preconditions and postconditions and verify with some semblance of rigour that the code they write respects these. At best, they might be treated to a passing reference to object invariants if they happen to take a class or (God forbid) read a book on "OOP."
It's perfectly possible to write good, safe code in C and C++, but not if you're hung up on a "smart" or "hobbyist" programmer mindset, and not if you're not willing to put in a fair amount of effort into thinking before you write.
I don't think that's the root of the problem. While you could make a simpler strings(1), which would help people who only use that one, more complex stuff like objdump(1) really does need to parse binaries. And that should be possible to do without worrying about security problems: you're just reading a file and extracting some information, which even in the worst case should be possible to do without accidentally executing arbitrary code. It's just that libbfd seems to have a lot of bugs, and because it's written in an unsafe language, such bugs can not only cause incorrect information or crashes, but sometimes attacker-controlled code execution. But if you don't do it via libbfd, you're going to need some library that can parse binaries, since many utilities end up needing to do it, and it shouldn't be impossible to safely do so.
An alternative is to implement a subset of full parsing specifically tailored to each utility. In the case of strings(1) that's very simple; in the case of some other utilities it's of intermediate complexity; all the way up to some that need to parse every corner of ELF. Whether that produces a bigger or smaller attack surface depends on a lot of factors: each parser might be simpler, but there are many more of them. FreeBSD was contemplating centralizing more of that into a common libelf, so I don't think it's only GNU who think that's a good idea in principle: https://wiki.freebsd.org/LibElf
The problem is not additional features, the problem is unsafe parsers. Adding features is natural and healthy. The damage is that we've been so afraid of parsers for so long that we associate "let's understand this bytestream better so we can be more useful" with implicit danger.
Some friends of mine are tackling this problem. You should help them. http://langsec.org
There's something fundamentally wrong with still using unsafe languages for system software.
C should have been buried decades ago, it's a toy language for small pieces of code on isolated systems (and yes, I've written larger programs in C myself, 20 years ago).
Nowdays, C should not even be considered safe for implementing interpreters and runtime systems for other languages. There are plenty of reasonably portable choices available (and no, don't use Java / anything JVM based).
It can't be terribly hard to reimplement "strings" and similar software in a modern language without such deficiencies (i.e. plenty of ways to shoot oneself in the foot and overwrite the stack etc.).
The problem is, until now there has not been a very good systems programming language that allowed you to stay lean on memory without introducing any performance overhead. Rust is certainly a contender, thankfully.
So, there is a perfectly good reason C is still prevalent to this day, even if there are many security implications in doing so.
> The problem is, until now there has not been a very good systems programming language that allowed you to stay lean on memory without introducing any performance overhead. Rust is certainly a contender, thankfully.
We've only had Pascal and Modula-2 for what, 40-45 years?
>perfectly good reason C is still prevalent to this day
... it's not that and it's not good. It's just that security implications were largely ignored, people were lazy and innovation and making sound decisions (rather than popular ones) has never been strong in the OSS/Linux community (apart from the kernel itself).
It's a hobby horse by this point, but it makes it clear that the Pascal of 30 years ago is not the Pascal of today. The same can be said of C.
These languages have evolved to be where they are today, mostly because hardware has evolved to be where it is today. Castigating past decisions as laziness really seems to be ignorant of this process, what it involved, and why it was necessary to make the decisions that we have up until now.
Engineering, on the whole, is the art of compromise.
Reading through some gstreamer code this weekend, I contend that "C of 30 years ago is not the C of today".
It is still all the most horribly insecure and obtuse passing of raw pointer manipulation and bitwise logic ever. The preprocessor is still hell on software. Every variable is named something like gst_hello_world_parser_box because when you write complex software in it you always get huge name conflicts.
It still thinks the best way to manage memory is to not manage it at all. People complained for so long about having to put your deletes in destructors in C++, too bad you have no such thing in C at all - you don't have classes, after all. Guess it has to go in the procedural function logic. Like 30 years ago.
> These languages have evolved to be where they are today, mostly because hardware has evolved to be where it is today.*
No. Pascal and Modula-2 were perfectly usable, safe systems programming capable languages 20 years ago. Pascal was widely used in commercial product development (DOS/Windows), Modula-2 was scarce, but taught at universities with high quality development environments available for many platforms and the Oberon OS was based on the language Oberon, which was heavily influenced by/derived from Modula-2.
Everything necessary to produce safe, maintainable, fast software was available back then, but lazy/uneducated/stubborn people used C instead and wrote crappy software we still have to use today. It's a shame really.
I'm sure Pascal was really the silver bullet and its failure had nothing to do with Pascal derivatives being academic, non-portable, not very expressive, and having numerous other issues.
I think your ad hominem comments about people who implemented in C are in poor taste and not based in reality.
The problems mentioned there (assigning to array elements with out-of-bounds indices) can be avoided by using range checks (-CR option for fpc). It's strange (and silly) that this is apparently off by default for fpc, but it's not a language feature of Pascal.
bwk's point, then as now, was that once you deviated from Standard Pascal, you had to either bet on a specific horse in the non-standard Pascal race, or develop your own Pascal-like language with its own inherent defects, similar to but different from the defects of Standard Pascal and other Pascal-like languages. Your code would likely never interoperate with anyone else's Pascal variant.
C, which had one de facto standard implementation and less genetic drift (because the standards, first K&R and then ANSI and ISO, were never as terrible as Standard Pascal's definition), didn't suffer from this nearly as much.
Mesa didn't, either, but only because it was fairly obscure, and there sure were a lot of Foogols running around for a while, of which Pascal and its semi-clones formed only one family.
The first K&R was published in 1978, ANSI/ISO C in 1989.
In 1981 C only mattered if you had access to a UNIX system, there were hardly any K&R compliant compilers.
Anyone using C outside a UNIX system was also betting on a specific horse.
Which is one of reasons why ANSI C has so much undefined and unspecified behaviours defined, as the committee didn't want to rule out any implementation.
And then see it fall into disrepair and disuse because it becomes unmaintainable.
LISP is 50 years old. If it was going to become useful for high performance software in large scale distributed teams of development, it would have happened already.
No no, you are mistaken. His reply is a special one. See, it not only refers to your reply, but it refers to itself as well! It's amazingly efficient, no need to respond to it any further!
You mean the myth that C was the first systems programming language, ignoring the fact that its designers just decided to ignore what was being done since the early 60's?
OK, how about this: C was the first portable, non-assembly language used for writing OS kernels. Is that true? And I mean portable in practice, not just in theory, and portable beyond one family of computers.
No. OS were already being written in BCPL, Algol, Algol W, Algol 68, PL/I, PL/M and many other languages, battling for a place in the podium of systems programming languages.
C was tied to UNIX, just as the others were tied to the OS of their vendor.
C only became portable after UNIX was available in a few American universities outside AT&T, some students decided to create workstations based on UNIX, while others started to develop C compilers, so that they could continue their work on other systems.
BCPL not much. It was barely an high level assembler.
Now Algol, it had:
- Bounds checked arrays
- Reference parameters to functions/procedures
- Real strings (not the first version in 1960 though)
- Explicit conversions
Quote from Tony Hoare's ACM award article[1]:
"A consequence of this principle is that every occurrence of
every subscript of every subscripted variable was on every
occasion checked at run time against both the upper and the lower declared bounds of the array. Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they
urged us not to--they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980 language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have
long been against the law."
Are there examples of either OCaml or Haskell being used for systems programming? I know that algorithmically they can achieve pretty close performance, but I don't think support is that great for low-level operations.
"Ivory" is a Haskell embedded domain specific language for writing safe C programs. One (very loose) way of thinking of it is that it compiles a restricted subset of Haskell to low level C.
It's hard to tell what you mean by "low-level operations." A lot of systems programming, particularly the GNU suite, just deals with manipulating files, sockets, and so on, and Haskell can do that just fine.
Which is relatively inconsequential as long as there's a clear boundary between things that can be manipulated by a user and the C code.
It'd be better if it wasn't C at all, sure, but, from a security perspective I'd take a language with a C runtime over a language where code has to be carefully written in a memory-safe manner any day.
Whatever. One can always write unsafe code. No matter in what language. A really secure language is almost worthless because there's so much it can't do.
No, the problem comes from the other end. Strings should not be able to compromise a system. It only needs to read a user supplied file, and write to stdout. There's no reason for it to be allowed to exec a piece of code.
You seem to be arguing against yourself. If 'strings' were written in a safe language, compromising the system would be extremely unlikely. Or are you seriously arguing that languages that are safer than C aren't Turing complete?
There are plenty of memory-safe languages in which you can do nearly everything you'd do in C, and much, much safer. There's no reason whatsoever a program like 'strings' couldn't be written in a memory-safe language.
> If 'strings' were written in a safe language, compromising the system would be extremely unlikely.
Shellshock was a parsing bug, the memory safety of bash helped nothing.
Bugs are bugs. When you get an out of bounds exception that leaves your program in an inconsistent state somewhere halfway up the call stack in a code path with poor test coverage, "safe" is not the correct word.
The premise of this conversation is flawed: A language isn't "insecure" or "secure." It's placed at a certain point in a safety spectrum ranging from "Do anything with no safety" to "Safely heat the room and achieve nothing."
What I'm saying is not that you can't write software that can be abused in memory-safe languages. I'm saying that you're much less likely to have extremely serious code execution vulnerabilities if you write a program in Go instead of C.
> What I'm saying is not that you can't write software that can be abused in memory-safe languages. I'm saying that you're much less likely to have extremely serious code execution vulnerabilities if you write a program in Go instead of C.
What I'm saying is that "much" less likely is overstating the difference. Most bugs in C programs are not buffer overruns, and even the ones that are would still be bugs in Go or Rust, they would just be a different kind of bug which is still plausibly exploitable under real conditions.
This is not a silver bullet kind of situation. "Rewrite everything in Go" is not actually a fix -- it's replacing code with 30 years worth of bug reports and vulnerability testing with completely untested new code, at the cost of significant resources that could better be used to fix the remaining vulnerabilities.
I'm not even saying that all the existing code is perfect. Replacing OpenSSL with entirely new code would probably do more good than harm just because the existing code is so ugly. But that's the exception rather than the rule.
There are a ridiculous number of serious vulnerabilities in C code related to lack of bounds checking and manual memory management. This isn't really opinion so much as fact. I understand what you're saying, but you're also downplaying the seriousness of C bugs relative to <most other languages> bugs.
> Most bugs in C programs are not buffer overruns, and even the ones that are would still be bugs in Go or Rust, they would just be a different kind of bug which is still plausibly exploitable under real conditions.
I don't follow. What's the equivalent of forgetting to check the length of an input string before chugging it into a too-small array in Go?
> "Rewrite everything in Go" is not actually a fix
I never said that, though. I'm not suggesting we rewrite the GNU tools in Go. But if you were to write the GNU tools from scratch today, C would be a bad choice simply because it's so easy to slip up with devastating effects, and there aren't many advantages to using it for simple tools like 'strings'.
> There are a ridiculous number of serious vulnerabilities in C code related to lack of bounds checking and manual memory management. This isn't really opinion so much as fact. I understand what you're saying, but you're also downplaying the seriousness of C bugs relative to <most other languages> bugs.
I feel like "<most other languages> bugs" tend to get ignored because it's not popular to blame <language> for bugs unless <language> is C. For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.
> I don't follow. What's the equivalent of forgetting to check the length of an input string before chugging it into a too-small array in Go?
In Go the program terminates, which is at best a denial of service vulnerability. If the program is anything in the nature of Fail2ban then just causing it to die is a serious problem. Meanwhile when it restarts it will have to somehow deal with whatever corrupted state the crash left behind, which depending on the context can provide the attacker with opportunities to do arbitrarily bad things by manipulating the state to be something the programmer never anticipated. Being able to induce a restart is a huge increase in attack surface.
Immediate program termination is the "take cyanide capsule" solution to serious bugs. It may be better than some of the alternatives but it's still very bad.
> For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.
Using bind parameters instead of putting data directly in queries + escaping has been standard for a long time now. That is, instead of saying "make sure to escape everything", like "make sure to avoid any bugs in C code", we indeed prefer to switch to a technique which which doesn't act pathologically in the presence of small errors. In SQL's case it doesn't require replacing the whole language. It technically doesn't have to in C, either - you could have bounds checked C - but I guess once you give up the absolute-maximum-performance goal, people prefer to use different languages.
edit: Also, while program termination is not ideal, in many cases, such as this one (strings), it is basically a non-problem, and at worst, denial of service is still loads better than arbitrary code execution.
> I feel like "<most other languages> bugs" tend to get ignored because it's not popular to blame <language> for bugs unless <language> is C. For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.
I don't even use SQL, but I've heard a similar mantra plenty enough times to internalize it.
This is an API level change, not a protocol level change - but an API change is the correct answer anyways. If you change the protocol and slap an API like the SQL injection prone ones on top of it, you'll have the same vulnerabilities no matter what the protocol.
For bonus points, use static analysis to catch and forbid query strings that can't be trivially proven static.
> For example an enormous number of high severity CVEs are SQL injection but I never hear anybody saying we should replace SQL with a binary interface that clearly distinguishes statements from data, even though that would make more difference in practice than replacing C with something else.
I see 3x more memory corruption CVEs alone, than SQLI CVEs. I'm not convinced that replacing SQL (or SQL APIs) would be higher impact than replacing C and C++. And since I actually use C and C++ in my daily bread and butter, they're significantly more relevant to me.
But SQLI may be significantly more relevant to you. If that's the case, by all means, focus on them more.
> If the program is anything in the nature of Fail2ban then just causing it to die is a serious problem
If that causes a serious problem, then the architecture is bad and unsuitable for solving that problem (remember: programs can also die because of OOM conditions, flakey hardware, admin errors. If this leaves a gaping security hole open for an attacker, or lead to DoS, then you need a better approach; fail2ban functionality should be in the process handling logins for example, not a separate entity watching logfiles, which might be broken too).
> Being able to induce a restart is a huge increase in attack surface.
Compared to what? Certainly not C code, where very common issues are easily exploitable for arbitrary code execution.
Because I am sure if had ported bash to Go, you would still have the same issue with the broken parser. I am not sure if you would still have had heart bleed, but I know you wouldn't have had heart bleed if the openssl people had used the platform libc, instead of rolling their own, so I wouldn't consider heart bleed an issue with C, but an issue with the programmer.
Multiple factors are to blame. I agree this is dumb behavior here by strings, but if the ELF parsing code were to be written in a language like Rust, it's far less likely it would have a bug of this nature.
C is scary from a security perspective because it is both incredibly easy to write code that has subtle but very serious bugs leading to e.g. arbitrary code execution, and it is easy to exploit those vulnerabilities.
So what you're saying is that the bugs from a piece of software written 20 years ago (roughly) can be solved by writing the program in a language that hasn't had a stable release yet? Sure there are better alternatives now, but what widespread, well known, portable language was available in the mid 90's that this could have been written in?
Despite the volume of messages on this thread about securing C or replacing it with a safer language for systems implementation, I think your answer is the only practical one.
If you haven't seen it already, you might want to check out HiStar:
Counterpoint: All the rails security exploits, python pickling and sql injection issues.
It is common to blame C for being unsafe but the real issue seems to be trusting user input, without sanitising it first. You could use a typed language to enforce sanitising by having a special type for user input (whether read from a file or received from the system) and have the converters sanitise it.
Sure. I re-wrote strings a few years ago in C#, worked great. That was at Microsoft, and sadly the code can't be made public.
I've done most of my OS-level hacking in C and a limited subset of C++. I think it works well down there, where resources can't be "magic" and it's really important to know what's going on (and you often have to tell the optimizer "hands off").
At higher levels, it's lunacy to be using C. I sure wouldn't write a compiler in one, for instance.
They have every right - even if they translated the source code to "strings" to C# line by line, they aren't distributing the result, and hence have no GPL obligations whatsoever. By the same token, if they did distribute it under something other than the GPL (which they would), and it were found to be a "derived work", they would be infringing copyright. However far-fetched, it's no surprise they play it safe.
It's very common for places to claim ownership of all code you write as an employee, both in and out of work. I don't know whether this would stand up in court, but nobody I've spoken to has tried to find out.
As for the provenance of the code, it's quite probable it was a clean-room reimplementation. Writing such tools isn't exactly rocket science. I wrote a PDB-oriented addr2line for VC++ a few years ago, and it didn't have any of the original addr2line code in it at all. In fact, I've never even looked at the addr2line code. I just ran addr2line --help and copied the command line options I saw there. I suspect a reimplementation of strings would be just as straightforward.
Yes, it was clean-room, along the lines of "I need a program to extract strings from binary files." I don't think I even looked at a man page for an existing implementation, so the options and functionality are almost certainly different from other implementations of strings.
This was driven by Microsoft's hostile policies towards running anything open source inside the company (you have to get special dispensation to install Linux, for instance). Actually looking at outside source code is a big no-no.
There has been at least one case of a contractor including GPL3 code in a project, and MS responded by (a) letting the contractor go, and (b) releasing the source for the product in question. [And no, it wasn't Windows]
Pretty much anything written inside a company is born secret. There are exceptions to this, but shipping "free" code (especially from a bureaucracy that is ultimately controlled by people who want to make money, and that includes not giving competitors any advantage at all) is a big political deal.
I'll happily lose 5% compile performance in exchange for a safe compiler. Unfortunately, writing a compiler in a safe language isn't enough to make it a safe compiler.
Are you claiming that C is fast to compile? If compile speed mattered that much we'd all still be using Turbo Pascal/Borland Delphi, or perhaps more recently Go.
No, what I meant is that a compiler written in C is likely to be faster than a compler written in other language, which is a good reason to write a compiler in C.
Nah. What you usually want from a compiler is maintainability and correctness. There's also a grand tradition for dogfooding by writing the compiler for language X in language X. If you don't, then how do you know you're actually making progress?
Out of curiosity, what's the problem JVM-based languages as a safe choice? (I'm legitimately curious here.)
It's certainly not the right solution for everything, but there are plenty of cases where I'd think it could be an excellent choice. Most of Android (except the kernel) is written in Java, as an example.
The most popular implementation of the JVM itself is written in C++. But more importantly, Java has experienced a multitude of serious security bugs over the years.
I'm not a huge JVM fan, but the majority of the Java vulnerabilities are related to a very different threat model that many languages don't even purport to guard against: safely running code that you know may be directly written by an adversary. The Java SecurityManager is supposed to allow that by sandboxing code with limited permissions, but has had a number of bugs, some of which can be exploited to let malicious apps break out of the sandbox. That's bad for cases where you are actually relying on running potentially malicious code, like applets in browsers. But not too relevant to the case where you'd be considering whether to choose Java, C, or C++ for a desktop or server app. In that case, treating the app itself as potentially malicious is not a common threat model: people don't normally run C/C++ apps in a sandbox. Although I suppose with the rise of Docker that might start becoming plausible for certain apps, relying on some OS facilities rather than a VM to do the sandboxing.
There's already been multiple security vulnerabilities in Docker, and likely there will continue to be more in the future. Virtualization and sanboxing is hard!
>The most popular implementation of the JVM itself is written in C++.
The amount of C++code decreases with each release.
For example, the version 8 of the OpenJDK had quite a few code rewritten in Java, thanks to the work introduced with invokedynamic.
There are plans to eventually replace Hotspot with Graal and SubstrateVM, in some future version of the OpenJDK, thus reducing even more the C++ surface.
This is why project Sumatra is using Graal and now jRuby is also playing with it.
The vast majority of which where actually in parts of the C++/C code used to implement the most popular implementation of the JVM. Which is in C++ for historical reasons not because C++ is needed to implement a JVM.
And as other commenters note the amount of C++ decreases each release because its less and less useful to have any code in the JVM as C++.
In graal calls to the OS are going to more limited in surface area than in Hotspot C2.
For all reasons removing a class of errors e.g. buffer overflow leading to executable code is going to give more secure code.
Not that I think Java is a silver bullet that will solve all problems. I do think that a language like Java or Rust is a major improvement over C with minimal performance issues.
Garbage collection is an issue breaker for some systems software, you can't use the "fork and start a new process model" for utilities, and memory usage for small systems is going to be an issue too.
The problem is that "safer" languages just encourages programmers that don't really think carefully about what they're doing because of the "the language is safe, it'll protect me from everything" effect. "Simple" bugs get hidden, programmers are encouraged to create more complex systems as a result, and the bugs thus created become even more subtle and difficult to find.
Is it really so bloody hard to ensure that e.g. the fields of the executable header, if they're offsets, are actually valid values? I've written tools that work with PEs, many of them reading the entire file into a buffer first, and "make sure you're inside the file" was one of the points I always kept in mind.
I say we need to fix how programmers think, not the language, because the same mindset that leads to bugs like these in "unsafe" languages will also lead to (maybe less severe, maybe more severe but also more subtle) bugs in "safe" languages too.
> The problem is that "safer" languages just encourages programmers that don't really think carefully about what they're doing because of the "the language is safe, it'll protect me from everything" effect. "Simple" bugs get hidden, programmers are encouraged to create more complex systems as a result, and the bugs thus created become even more subtle and difficult to find.
Citation needed. The plethora of over overcomplicated C code is a strong argument against this claim.
The plethora of over overcomplicated C code is a strong argument against this claim.
There certainly is plenty of overly complicated C code, but I doubt there is more of it than there is the amount of overly complicated code written in some of the other popular "safe" languages: Java and JavaScript.
If you are disregarding Java when you are looking for portable options, then you are IMHO limiting yourself quite a bit and could be missing out on a good solution to your problem.
For software like "strings"? Startup times and memory footprint/performance due to JIT compilation / the JVM, as well as the notorious complexity and bugginess of the runtime. Just consider the frequent updates and incompatibilities you'd get with it - for example, we had plenty of Java-based remote management GUIs which wouldn't run on newer Java runtime versions. Maybe with the (limiting) static compilation options available for Java it'd make more sense.
http://www.erights.org/talks/no-sep/ agrees that there's a basic problem with having a program run as the user. "Treating security as a separate concern has not succeeded in bridging the gap between principle and practice, because it operates without knowledge of what constitutes least authority. Only when requests are made -- whether by humans acting through a user interface, or by one object invoking another -- can we determine how much authority is adequate. Without this knowledge, we must provide programs with enough authority to do anything they might be requested to do."
It's also a basic problem that so much complicated systems software is written in C, but that's better known.
Having a trivial binary format is not particularly great, nor is supporting one binary format and not another. Yet when you get to the complexity of ELF parsing, while I'm sure it's possible to code defensively and end up with fewer bugs than something like libbfd, parser bugs in general are the bread and butter of C.
Solution? In my opinion, either don't bother parsing any binary formats (who actually needs that functionality?), or use a safe language.
Even without such vulnerabilities, I would be wary of printing out stuff from any untrusted files in a terminal. Most terminal emulators have been vulnerable to escape character attacks at some point.
even on 'valid' binaries, it still tends to mess up your terminal. I noticed that pretty quickly when I started working with Linux. Are there really people that work with cat and grep on binaries files ?
Perhaps not intentionally, but cat has valid uses for concatenating binary files and sometimes they end up going to the terminal just by accident. As far as grep goes, the answer to your question is "yes":
http://stackoverflow.com/questions/9988379/how-to-grep-a-tex...
Of course you can use grep on binaries perfectly save, if you just don't print all the results to the terminal. Use `grep -lr <pattern> <dir>` to find binaries that contain a certain pattern, use `grep --byte-offset --only-matching --text <pattern> <file>` to find the offsets in a file.
I'm a regular user of this utility and this came as a complete surprise to me. So much so that I checked the source myself before believing the article.
I thought "strings" was just a dumb scan over the file. Does this mean that with a properly crafted binary it is also possible to hide strings from a quick check with "strings"?
Yes, though a properly crafted binary has always been able to hide from strings with even a minimal amount of obfuscation or encryption. There's no way to know all the strings a program can output without running it.
Of course, a properly crafted program can always arbitrarily obfuscate strings.
But if you manage to trick libbfd into thinking it's looking at a particular format, you can hide plain text from a simple "strings" invocation even in files that are not executable. I've been using "strings" on all kinds of files, not only executables, and assumed that it will always display all sequences of printable characters present in the file.
Ah, I see. Fair enough, but I don't think it was ever a great assumption that "strings" would uncover all the text in a file. There are so many ways to screw with a file at the byte level that could confuse "strings" but still appear fine when read by an application.
This tool has always, going way back to at least BSD 4.3 (if not earlier), been a tool for dumping the strings table of a binary object, which just so happened to also have a fallback mode for things it didn't know how to parse as an object file.
It's time to start converting the low-level Linux/UNIX utilities to a language with subscript checking. Go, or Rust (if and when it's finished), or D, or something. We have some good options now.
The Linux kernel and the core utilities needed for an user-friendly OS add up to a mind-boggling amount of code, written by thousands of hobbyists over the course of decades. That code base has the benefit of actually existing, being familiar to a lot of people, and (mostly) behaving in predictable ways that are consistent from one un*x-like system to another.
So in that sense, it's somewhat counterproductive to just say "somebody oughta rewrite this stuff", unless (like RMS) you're willing to dedicate a good chunk of your life to that mission - or think that your post will inspire somebody else to do the same.
I agree completely about C. I've been saying this for years. There are three big problems that cause crashes in C programs: "How big is it?", "Who owns it?", and "Who locks it?". The result is over three decades of segfaults and buffer overflows.
There have been three or four variants on C which address some of those issues. I've proposed one myself. None got any traction. The only thing that might work is if someone developed a safe variant of C which could be machine-generated from existing C code, and didn't add significant overhead. GCC already has a fat-pointer subscript checking option, but nobody uses it. That approach is usually slow, with a subscript check on every reference. If you do it right, most subscript checks get hoisted out of loops. Go does that for many FOR statements.
Rust is one of the very few languages which addresses all three of those issues without resorting to garbage collection. I really hope the Rust crowd doesn't screw it up.
> GCC already has a fat-pointer subscript checking option, but nobody uses it
My experience with adding extensions to C++ is that nobody will use them, not even the people who proposed the extension, unless it is adopted by the Standard. The same goes for C.
The feature I proposed for C has been in D since the beginning, and has a very strong track record of success - both in user acceptance and in eliminating bugs. Whether the runtime bounds checking is actually done or not is controlled by a compiler switch - but most users choose to leave it on.
How would that fix the problem at hand? The code in question isn't using array notation. At the end of the day, it's not a case of using C when another language would be better, it's a case of crappy coding.
It helps because instead of rewriting the whole app, just the function parameter types are redone where pointers to data are changed to bounds checked arrays, on a case by case basis.
It isn't a magic bullet, but as buffer overflows are (I presume) the most common cause of C security exploits, this would help a lot.
Maybe, but my fear would be that the act of rewriting it to use array notation will introduce yet more bugs. Some of the comments point out that the problem exists because nobody cared enough to fix it for several years. Having seen that nobody cared enough to fix it, it's not clear that anyone will care enough to fix it properly.
Of the three you described, Rust makes the strongest memory safety guarantees. Neither D nor Rust require a complicated runtime. I'd say this isn't really Go's intended area of usage.
Also, Go is missing things like ASLR and DEP (last time I checked), which means that if you link to any vulnerable non-Go code using cgo (which is almost inevitable when writing core utils), or if you find a good bug in the Go runtime, it's trivial to get code execution.
Address space randomization is security theater. It slows down attackers, but doesn't stop them. See "stack spraying". What it does prevent is replication of security bugs, which allows vendors to ignore them.
If it slows them down and makes their lives harder, then it is a worthy mitigation. That's what a mitigation is, after all: it reduces severity and effectiveness. It's not a full stopgap. Even still, ASLR implementation quality has left much to be desired throughout the years.
No it isn't. If trying to exploit some daemon requires thousands of tries instead of one try, I am thousands of times more secure. Because they will almost certainly fail the first time, causing the daemon to crash, and then have no more tries. It is only security theater if you live in the land of "lets use broken software that crashes all the time and then run a monitoring tool to automatically restart it". That is the problem.
Or, the authors of Linux/UNIX utilities should have implemented and should be implementing bound checking. It seems excessive to switch languages rather than encourage stronger programming habits.
Are there any examples of that being successfully done? Even djb's software, intentionally written to be minimalist and as secure as possible, has had exploitable overflows (both qmail and djbdns have suffered from this). Every Linux and BSD distribution (even OpenBSD) has suffered buffer overflows so severe that arbitrary internet users could get remote root access. Etc.
> Even djb's software, intentionally written to be minimalist and as secure as possible, has had exploitable overflows (both qmail and djbdns have suffered from this).
Which of these are you describing as a buffer overflow?
That's one approach, yeah. I used to think it was a likely one, but I now think three others are more likely:
1. A language that is low-level and safe but also gives you enough interesting & new to build some buzz/interest, rather than "just" safety. Rust is a candidate here, perhaps.
2. Static analyzers in C advance to the point where a subset of C large enough to be useful can be routinely checked for common types of errors. And it then becomes socially expected that at least core OS stuff will be written in that "checkable" subset of C, treating "unable to prove safety" warnings as errors, or at the very least as suspicious.
3. Mitigate it at the OS level with finer-grained access controls. Utilities like strings(1) or objdump(1) are the easy case here: they do not need to actually have permissions other than "read a file" and "print to output". Even in the worst case, arbitrary code execution in objdump(1) should not be able to delete your home directory, join a botnet, or email your ssh key somewhere, because objdump(1) does not need those permissions. FreeBSD's libcapsicum looks promising, in the sense that it is actually being implemented in the base system, rather than just being yet another ACL proposal going nowhere (Solaris/Illumos also has an actually-shipped privileges system, but I don't know how extensively the base install itself uses it).
1. "Just" safety is hardly peanuts given the status quo. In fact, "just" safety would be much more practical. It'd be much easier to port all the existing code to a C dialect (like Cyclone) than rewrite from scratch in something like Rust.
2. I find compiler instrumentation (think AddressSanitizer and Mudflap) to be more promising than static analysis. Much of the latter is still stuck in the lint era and give out too much noise. That said, tools like Coverity have come a long way and I know a lot of FOSS projects use them frequently. I personally haven't.
3. Capsicum is quite promising, indeed. I like that it extends the existing file descriptor metaphor and offers sandboxing based on namespaces instead of system calls (unlike seccomp), as opposed to the crufty POSIX 1003.1e capabilities which are underdeveloped and still limited to executable processes, AFAIK. That said, we shouldn't just rely on sandboxing, jailing and capability-based security. We need to fix underlying application bugs, as well (the applications that implement the capabilities and sandboxing themselves, particularly so!)
No. Three decades of C have clearly demonstrated that trying to make programmers be "very careful" will not work.
(This evening, I'm struggling with a broken "gedit" on Ubuntu. It turns out that editing a sufficiently large file will break "gedit". Not just crash it once, mess its configuration up so badly that future uses of "gedit" hang the entire GUI.
It's not the utilities themselves that are the bigger problem - rather a library the utility uses. libbfd in this case. Problem with writing libraries in anything other than C of course is that they are not very usable by a large variety of software.
All but one of these are Wirthian languages. They all had overhead for their time period, and many of them were academic rather than pragmatic (consequently becoming influential for language designers).
Ada? You're trying too hard. It used to be a government standard and wasn't ignored in the least (see: http://www.seas.gwu.edu/~mfeldman/ada-project-summary.html)... much to a lot of people's chagrin. It's widely regarded as an example of a monster language, but it is still absolutely used where the safety is critical.
> They all had overhead for their time period, and many of them were academic rather than pragmatic
What overhead? The one spread by C crowed without experience in the said languages?
I used a few commercial compilers for those languages. They were quite comparable in terms of generated code quality to C compilers of the same generation, back in the day.
C was also a research language until AT&T made the code available.
> It's widely regarded as an example of a monster language, but it is still absolutely used where the safety is critical.
Actually, I would dare to say that Ada 2012 is smaller than C++14.
Its use has increased in Europe thanks to what is being discussed here, namely the amount of money lost in security issues thanks to the industry adoption of C due to its relation to UNIX.
I included Ada in the list, because most developers aren't aware that it still exists and is being used. Or that GNAT is just one of many compilers that are still available.
Ada (and I guess the other ones) are unlikely to be memory safe in the presence of arbitrary pointers without a GC. That is, having a pointer/reference deep into a vector or some nested element of a tree requires a garbage collector or is unsafe (runs the risk of becoming dangling). So, while they are possibly better than C in many ways there are others in which they are lacking.
This lack of references means code is forced to do more copies or data-structure lookups.
(I dont actually know Ada or those other languages, but I did do some research a while ago and discussed this publically here and on /r/programming a few times, and have never been corrected, so I guess it is close to correct.)
having a pointer/reference deep into a vector or some nested element of a tree requires a garbage collector or is unsafe (runs the risk of becoming dangling)
Yes, this is a problem, certain data structures need to be intrusive in order to be fast. Rust for example is a highly memory-safe language but its "unsafe" mechanism needs to be used in order to implement performant data structures for this reason.
That is not what I'm talking about. e.g. (in Rust)
let map: TreeMap<uint, SomeHugeThing> = make_map();
let value: Option<&SomeHugeThing> = map.find(&123);
`map` is a (recursive) tree that associates a `uint` key with a value of type `SomeHugeThing`, imagine that is, e.g. 1KB, or otherwise expensive to copy. The `value` is a direct pointer to the memory of the value associated with the key 123 (if it exists), that is, it is extremely cheap to manipulate `value` because it is basically a machine pointer directly into the memory of the TreeMap. Rust gives you the power for that to be perfectly safe without a GC: there's no risk that changes to the `map` structure will cause the `value` pointer to be invalidated.
As far as I know, Ada etc. do not allow for this without a GC. That is, there's no way to have a safe pointer directly into the memory controlled by some dynamic data structure. Instead, one would have to search the `map` each time the value is wanted, or turn on GC.
> its "unsafe" mechanism needs to be used in order to implement performant data structures for this reason.
Implement some performant data structures. There's a lot of data structures that can be implemented performantly without begin intrusive (sure, one might want to use `unsafe` occasionally to optimise them fully, but the `unsafe` is almost always not being used to make it intrusive).
Also, I don't understand the relevance of that link, the top comment (which is mine, btw) clearly demonstrates that `unsafe` is not necessary.
Ada specification allows for a GC, but few implementations provide one.
Oberon and Modula-3 have GC and had real usable OS implemented in them, not just some kind of concept OS.
However those languages already offer the following in terms of security over C:
- String data type
- Open arrays (aka slices nowadays)
- Reference parameters (no need to pass pointers to functions/procedures)
- Bound checked arrays (you can bounds checking off, if you really need it)
- Pointer arithmetic is explicit operation
- Casting between types is explicit
- Enumerations are their own types, there is no implicit conversion to/from ints.
While they still don't cover all use cases in terms of memory safety, they already cover quite a few scenarios that in C just lead to unsafe code without the help of a static analyser.
So the Rust library you write is fully self sufficient without calling into any Rust stdlib functions or without using any runtime initialization mechanisms like ctors and dtors? That sounds too good to be true - If I am calling Rust then I am getting the Rust runtime just like if I can strlen I am getting the C library.
You can use a significant portion of Rust with only libc (i.e. no different to C) and this will be increasing. Things like "constructors" and destructors do not need the support of a runtime at all, they're entirely just conventional static function calls.
That is to say, the only overhead/problem with calling using C vs. using C and Rust is the increased binary size of having additional code included. There's no loss of flexibility (as long as you're not trying to run on a tiny microcontroller).
That does sound great overall - and I will take a deeper look at Rust - but this thread is about rewriting the core Linux utilities in Rust - if I had to do that with full featured Rust it would be hard enough. But if I had to restrict myself to a subset to avoid the "runtime penalty" for programs embedding my Rust shared library implementation of libbfd - then the rewrite just went from nearly impractical to completely impractical in terms of the efforts required.
> But if I had to restrict myself to a subset to avoid the "runtime penalty" for programs embedding my Rust shared library implementation of libbfd
The point is there is no 'runtime' penalty with Rust†. It doesn't have a compulsory garbage collector, it doesn't have a compulsory complicated IO manager, it doesn't have a compulsory multiplexed threading system. In some circumstances lacking any of those will be a downside, of course, but not having it built-in and compulsory means Rust gains flexibility. In several ways, there's actually lower overhead with Rust, because it is a modern language designed around the advances and research that has happened, e.g. dynamic (de)allocation can be faster due to putting some sizedness requirements on the very lowest level APIs (which one rarely calls directly, so the programmer will never notice the restrictions, just the improved speed).
My comment was trying to address the Go <-> C <-> Rust bridge, pointing out that C and C <-> Rust are essentially the same, other than the extra code one must have from writing in two languages rather than one. If one was to forgo the C (which is possible, Rust can easily be used to write a library directly exposing a C ABI), there won't be much difference at all between C and Rust.
In summary, in future, any restrictions required would not be very significant, and, even now before Rust's runtime has been excised, the restrictions will be things like "no network IO", one still gets access to all sorts of fancy iterators and data structures without requiring any runtime.
†Strictly speaking, this isn't quite true right now, but there is a concrete plan currently being executed to make it true.
Wow, in that case Rust is looking like it could be hell of a lot better C replacement! If I could just write all my libs in Rust using the full power of the language and then just call them easily from any language without ungodly overhead or complexity like in case of say Java/JNI, that's nothing short of fascinating!
It's not all brilliant though, there's still a lot of useful tooling that Rust needs to develop to make it really nice, e.g. there's no nice way to create a C header file describing a Rust library (since C headers are the lingua-franca of FFI libraries, in many ways), https://github.com/rust-lang/rust/issues/10530 .
And, being a young language, it's not too hard to be doing something no-one else has ever done, especially with this low-level stuff. :) (Meaning compiler bugs and some "I don't know" answers.)
You can use the core library to get standard functions with no runtime. I'm not sure what you're saying about constructors and destructors. There's nothing special about calling strlen that adds overhead.
So we would need to write Rust libraries without using Rust stdlib? Not exactly a great advantage in that case. Especially so when you are rewriting a ton of C code.
It's about the context - rewriting ton of code that is in C libraries right now. If that had to be done using Rust without its runtime libs it is an huge disadvantage to solve the issue at hand.
How high is that penalty, relative to, say, loading a variety of C shared libraries? At some point safety and security trumps saving a few kilobytes of memory...in a server environment, that point probably has already been passed a decade ago. Of course, if the penalty is hundreds of megabytes, then the equation might add up differently.
But, we already have a huge number of system tools and utilities running in a huge number of languages. Python has become the lingua franca of Linux management tools; Perl was in that role in the past, and still exists in a lot of places; bash and shell are integral. All have separate runtimes, and sometimes interact with system-level C libraries, either directly or through command line interfaces.
Is it really that big of a deal to have a Rust or Go shared library in a system that already has a half dozen different languages and a variety of shared and unshared libraries existing at once? I suspect it would be unnoticeable on a modern system. I'd choose safety over shaving a few kilobytes of memory used.
There would be a cost in having that interface friction for developers...and we'd be paying that cost for years. But, it seems like both Rust and Go have planned for the languages to be integrated with C libraries from the beginning, so it seems less of a problem. If I were going to attempt such a thing, I'd probably start with the front end utilities that use the libraries and then work my way down to converting the libraries (even though the libraries, in this case, are where the problems lie). But, maybe it's possible to make Go or Rust code that provide the same C interface as the existing libraries...I don't know enough to even guess.
Ugh, wouldn't that fall in "just because you can do something doesn't mean you actually should" category? :) Suppose I wrote the library in Go and exported some type of C wrapper and then I loaded that in Rust using Rust->C mechanism - that will load the Golang runtime in Rust! And you still got non-trivial C code to deal with anyways!
If you want to interface two languages, then I do not see a way arround involving the runtime of both languages (in the same way that linking to C code still uses the C runtime). However, using my method does not actually involve any C code. "extern C" makes the compiler produce object code that can be called as if it was produced by C code; it does not produce any C code that would compile to the given object code.
Of course, it will also produce C headers to give the type information of the exported symbols, but I wouldn't call that code.
Crap, so if `objdump` is likely vulnerable to overflows, and `ldd` is a simple bash script ripe for abuse¹, is there a safe and easy way to determine dynamic library dependencies in an executable?
This program performs a similar function to objdump but it goes into
more detail and it exists independently of the BFD library,
so if there is a bug in BFD then readelf will not be affected.
ldd actually executes the program being probed; it is not a static analyzer. It's only safe if you trust the program in the first place.
From my /usr/bin/ldd:
try_trace() (
output=$(eval $add_env '"$@"' 2>&1; rc=$?; printf 'x'; exit $rc)
rc=$?
printf '%s' "${output%x}"
return $rc
)
…
# If the program exits with exit code 5, it means the process has been
# invoked with __libc_enable_secure. Fall back to running it through
# the dynamic linker.
try_trace "$file"
It's insecure, but it's actually a bit more nuanced. What I'm describing applies to up-to-date glibc, it might have been even more insecure in the past.
ldd first calls the dynamic linker with --verify, i.e.:
/lib/ld-linux.so.2 --verify $file
This is the system's dynamic linker, which should be considered safe. The --verify parameter tells the dynamic linker to check if the file is a valid dynamically linked file. Assuming that the file exists and it is an ELF file, the dynamic linker exits with status 1 if the file doesn't have a DYNAMIC segment, with 2 if the file doesn't have a INTERP segment or with 0 if it has both (the typical case). Execution is never passed to the file with this flag.
Depending on the exit status of the loader, ldd does different things:
* If it was 0, it calls try_trace "$file", meaning that the file is invoked directly. However, remember that for the status to be 0, the file must have an INTERP segment, meaning that the kernel will call the interpreter, typically the dynamic linker, i.e. /lib/ld-linux.so.2. A well behaved linker doesn't pass control to the file if LD_TRACE_LOADED_OBJECTS is set.
This is the insecure case. The interpreter can be set to any other application (it must be statically linked or it must dynamically link itself). However, if the attacker doesn't control the file system, there typically aren't any files which match this condition and either pass control to the application or do anything harmful when invoked with no arguments. The alternative is to set the interpreter field to be the file itself. The problem with that is the full path to the ELF file must be known by the attacker (or it must be known that the victim will call ldd from the directory in which the file resides). /proc/self/exe will not work because at the time the kernel reads the entry, it will still point to the executable of the calling process, usually bash in this case.
* If the exit status was 1, ldd quits (because it's an invalid file).
* If the exit status was 2 (no INTERP segment), the dynamic loader is invoked directly, i.e.
try_trace "$RTLD" "$file"
These two latter cases are safe.
A bug which can significantly simplify the attack would be a mismatch between the dynamic loader and the kernel, where the dynamic linker thinks that the ELF file has an INTERP filed and returns 0 for --verify and then the kernel doesn't find the INTERP field and passes control directly to the application. I've poked a bit around and as far as I can tell, there's no such bug in the latest kernel and glibc releases. In addition, the kernel aborts the loading process for any errors related to loading the interpreter.
This 30C3 talk[0] discusses potential security issues arising because of mismatches between the various ELF parsers.
> the Linux version of strings is an integral part of GNU binutils...
I think almost everyone ships that version of strings and objdump, fwiw. FreeBSD and NetBSD ship an almost verbatim GNU binutils; OpenBSD's seems to have more local changes (partly b/c it's based on an older binutils they've diverged from), but its 'strings' still uses libbfd.
The only exception I ran across in some quick digging is that Illumos ships the Solaris version of 'strings', and doesn't ship an 'objdump'. This 'strings' seems to have come via Sun via Microsoft via AT&T via UC Berkeley: https://github.com/illumos/illumos-gate/blob/master/usr/src/.... Whether it's safer I haven't investigated; it also parses ELF files, but via its own libelf rather than GNU libbfd.
Apple also does not ship this tool from binutils (though as they aren't dealing with ELF this may be out of the scope of your analysis). They have used their own version of strings which also descended from BSD (like the one you point to from Illumos), though more indirectly (they used some of BSD's code, but I think mostly rewrote it while they were NeXT).
I'm probably going to sound painfully naive now... but why is that a security risk? So libbfd reads past the end of a buffer and segfaults.. so what? It's not writing or executing anything untoward, so who cares?
The security risk is that input from the file that strings is run on ends up in places it shouldn't. One classic place is on the stack; after the random data your function puts on the stack is a bunch of bookkeeping information the compiler puts there, including the address to jump back to when the function returns. If you read user data onto the stack carelessly, you can overwrite the instruction pointer with attacker-controlled data, and the program will return from the function, jump there, and begin executing code.
In this case, he was just fuzzing with "AAAAAAAAAAA" and the program ended up at 0x41414141 ("AAAA"), which is alarming. The next step would to be to figure out where the input file is in memory, replace AAAA with that address, and replace the data there with code. Now "strings" is executing CPU instructions that were in the input file. That's bad.
Some compilers do workarounds, like putting a canary value between the user data and the compiler data, and checking the canary before returning. Some compilers also randomize the location where things go in memory, so it's harder for an attack to predict that address. Some OSes set certain pages to "not executable", so even if an attacker can jump to that memory address, the CPU won't run code from there. None of these are fixes; just barriers for the determined attacker. (W^X is easy to get around, just call code that's already in the binary legitimately, like execve()!)
The fix is to only write to memory you've actually allocated; something the C compiler will not help you with.
Segfaults frequently indicate exploitable security holes, and in particular, the fuzz test shown in the article shows a segfault at a user-controlled address (0x41414141 is AAAA).
Consider the case of a buffer overrun, a vulnerability that can often cause segfaults. A clever attacker could write a bunch of nasty code to that buffer, then use the buffer overrun to insert a jump statement back to that malicious code. Maybe they added the equivalent "rm -rf $HOME`" or something. Or maybe they decided to dial home with a reverse shell on your system. You have no idea, since you have no control over what code is being executed.
A "read buffer" overflow is still nothing but a program reading some form of input and then overwriting the buffer. The target program has to read to somewhere -- namely to the buffer -- and thus it has to write over the boundaries of the buffer.
Because the buffer is local to the function, and because the function return address happens to reside at a higher address than the buffer itself, you probably get to overwrite the return address. Thus, after the function is executed, the execution doesn't return to the call-site but to the address you specified. Place your payload code in the buffer you provided, and overwrite the function return address to be the address of your buffer and you might do all sorts of fun things. Spawn a root shell (if suid binary), spawn a reverse shell, execute a kernel exploit to get root etc.
> A "read buffer" overflow is still nothing but a program reading some form of input and then overwriting the buffer. The target program has to read to somewhere -- namely to the buffer -- and thus it has to write over the boundaries of the buffer.
Who says that? That would be a write buffer overflow. The place where they write to might be properly allocated, so no memory that shouldn't be written is ever written. At least that is how I read it. The OpenSSL bug (heart bleed) was a read overflow. You couldn't use it to inject code, but you could use it to read out private keys.
Because you control a bunch of memory around that buffer anyway. It's reading a file that you have complete control over, after all. The fact that it crashed at address "41414141" strongly suggests its exploitable. That value is like the "hello world" of testing vulnerabilities.
If something that shouldn't be written is written then yes. But from all the reports it sounds like it just reads something that shouldn't be read and produces a segmentation fault because it reads an unallocated page. I could be wrong about what happens, I haven't looked at it in detail. That is how the news reports sound like to me, so I wonder how could one use that to execute code?
To quote the linked article: "The 0x41414141 pointer being read and written by the code comes directly from that proof-of-concept file and can be freely modified by the attacker to try overwriting program control structures."
Haven't checked but I'm pretty sure the binutils just look for "magic numbers" (ie are the first four bytes "0x7F 'E' 'L' 'F'") and don't look at any filesystem information.
Despite what the blog comments suggest, gdb does not have to link to libbfd.
But what about objcopy? Much more important utility than strings(1), in my opinion. I admit I rely on it and do not a have a substitute at the ready.
At some stage we need a BSD alternative to the GNU binutils (aside from gcc alternatives). I have seen it discussed several times over the years, but as far as I know it does not exist?
I don't think so. "file" only needs to read the first few bytes at the beginning of a file to guess what type of file it is, so it isn't likely to have any buffer overflow problems.