It's called the short string optimization. It was used once but I am not sure if it is any more. C++11 move semantics may help a lot with string churning...
As others have mentioned some libraries do use it. It has tradeoffs though. In a "traditional" implementation sizeof(std::string)==sizeof(char * ) -- it keeps a pointer to the first byte of the text with the metadata (minimally: size and capacity) stored before it in memory. c_str() is just a "return p_;" and size() is something like "return reinterpret_cast<const size_t * >(p_)[-1];"
Now to add the short-string optimization you need to do one of two things:
1. Add a small buffer we can point "p_" to inside the std::string itself. However, strings are very common inside of structures so this will bloat objects throughout your program
2. If you're clever you can use the lowest bit of the pointer to indicate that it's an interned string which would let you store 6 byte strings inside the pointer itself on a 64-bit machine (remember you still need the '\0' for c_str()'s benefit) However now c_str(), size(), etc all need an extra branch instruction. These are normally inlined methods, so now you've bloated/slowed the code instead.
Personally I prefer a keep-it-simple string implementation for most things.
2. You can actually go up to 7 bytes as long as you're sure you get word-aligned pointers back from malloc (usually a good bet). Make the tag the last byte, indicate a short string by "tag_byte & 0x07 != 7", and then store the length as "7 - tag", reserving tag 7 for pointers. If it's a 7-byte string, then the tag byte itself will be 0, serving as the null terminator. If it's < 6 bytes, the null terminator gets stored in one of the earlier bytes. If it's a pointer, the pointer will be word-aligned, which on a 64-bit machine means that you can zero-out the low-order 3 bits (exactly what you need to store the tag) to extract the actual pointer. If it's a null string (the one case we couldn't represent, since we use 0x07 as the tag for pointers), store it with a null first byte, which also gives compatibility with C.
Whether these gymnastics are worth it is debatable. My intuition is that you lose more in bit-twiddling instructions on common operations than you gain by being able to store an extra byte in short strings, but I'd want to benchmark on real data before implementing.
Remember that on a little-endian machine (i.e. nearly everything now) the "last byte" of the pointer is actually the first byte of the string. The only way you can use the 8th byte of the string as your sentinel is if you can be sure that the allocator won't ever give you a pointer with MSB=0x00.
You are right that on big-endian machines you can smuggle a 7th byte into the pointer though by sharing the "tag" and the '\0' terminator. You don't really have to worry about the 0-byte case since in a traditional implementation there is a shared empty-string sentinel that the default constructor uses. So if you are mutating a short-string and the result is 0 bytes you can always just replace it with a pointer to the shared sentinel.
I agree with your intuition about the costs. I think your program would have to be pretty dominated with tiny strings for all of this optimization to help much. My guess is that it would microbenchmark well. However, all of those extra branches would add pressure to I-Cache and branch predictor history which would offset it in the real world.
folly's fbstring actually has 3 separate regimes (interned tiny strings, classic normal strings, threadsafe-COW large strings) so I guess they decided that the extra branches were worth it for them. I still prefer a simpler design where c_str()/size() don't require any branches though.