Occasionally, I have an irresistible urge to strangle everyone who uses Unicode and UTF-8 (or other UTF encodings) interchangeably.
UNICODE is a good thing because it provides a codepoint for every character that we care about, instead of having a 256-character subset for every groups of languages and needing complicated software to puzzle out how to convert from one subset to the other. Unicode allows fantastic stuff such as upper/lowercasing text including all the weird letters that you previously had to special-case.
ASCII used to be a good thing because it allowed people to ship around basic English and Cobol code without any worries, but is actually pretty evil because people from Anglosaxon countries assume that every other bit of text is composed of English and Cobol.
Having a notion of ENCODINGS is useful if you occasionally get bits of text that are neither English nor Cobol. You still needed different encodings for different groups of languages, and arcane mechanisms to provide hints on which encoding is meant, at least if you got non-English bits of text. The very notion of an Encoding scares the people who used to think the world consists of English and Cobol.
UTF-8 is a very reasonable encoding that can be used to represent all of Unicode while being Ascii-compatible. Hence it is a sane choice as a default encoding for people who are scared of having to think about encodings. Because UTF-8 is not the only encoding out there, Unicode-compatible programs accept Unicode text in many other encodings, including those that cannot represent the full range of Unicode and are only a good choice for some people but not others.
tl;dr: non-UTF-8 text can (and should) still be read as unicode codepoints. Ignoring the >=40% of texts out there or saying that they're "not Unicode" doesn't help anybody.
And to a programmer, what matters is whether the encoding is present and correct.
From a security perspective, if you don't include the right encoding meta-data, an attacker can include an XSS attack in UTF-7. Your server reads normal (but gibberish) ASCII, escapes it properly, then shleps it to the client who thinks it's UTF-7 (because of all the UTF-7 chars) and suddenly they are running malicious javascript that you didn't escape.
If you're scraping sites, the one thing more annoying than trying to guess the encoding, and that's dealing with multiple encodings in the same document.
Putting UTF-8 characters in a document doesn't mean you are using "Unicode". It means you are using some undeclared encoding, which is not a step forwards.
Not sure if this "counts," but it is still used to encode IMAP folder names. (Technically, the IMAP version is slightly different from standard UTF-7.)
I'm confused by that as well. I never really understood the motivation behind it - though I guess it's obsolescent, so hopefully it'll soon be just another one of those amusing anachronisms we see occasionally in computer science.
As I understand it, the motivation was to be able to transmit Unicode text over a channel that's only 7-bit safe (e.g. mail protocols) without having to do something silly like Base64-encoding the whole thing.
Are you saying that we should still use character encodings rather than UTF encodings, or are you saying that we shouldn't assume that raw text is ASCII, or are you saying something else?
Unicode is, essentially, nothing unless it is encoded. When you encode it you must decide whether to use one byte, two bytes, three bytes or four bytes.
Only UTF-8 and UTF-32 are really big enough to hold the world's characters. everything else is a fudge.
ASCII was never Anglo-Saxon: it was always American. COBOL is a red herring here, too. ASCII was all about teleprinters and was a clever use of 8-Bits for its time. Bell labs invented ASCII and Bell labs invented UTF-8.
UTF-8 is much better than reasonable. It is a compact way to represent Unicode while preventing the western world from having to re-encode every text document. That's a lot of good news the internet.
Are you saying that we shouldn't ignore other charsets as they are still valid Unicode? If so I agree up to a point, the point being that there is no longer any need to have any other unicode encoding other than UTF-8. If you need to access your local characters as a byte array: choose your internal encoding and translate, do your magic, and then spit out UTF-8 again then we can all simply read the same documents without the need for over-complexity.
It makes sense to me. His point is that calling UTF-8 "Unicode" is wrong, that just confuses people. Windows programmers usually also call UTF-16 "Unicode" which is also wrong.
Unicode is not a text encoding, it's a standard that assigns a number to every character.Saying that a particular text is "Unicode" doesn't give any info on how to decode it, we should just say that a given text is UTF-7,8,16,32, etc.
I'm guilty of using UTF16 when I really mean 16 bit unicode arrays. I see that now.
But it does beg the question why would you use UTF-16? Yes if you have more than 512 characters in your common script then OK I can see it might make sense, but not much. UTF-8 will still average out in a reasonable way.
I wonder if we can now start to abandon non-UTF8 charsets? Ignoring Windows which is wilfully incompatible and broken, UTF-8 is used pretty much everywhere that matters, and I would argue that those who don't use it should use it. If there is something UTF-8 or Unicode can't do, let's fix that.
I wouldn't recommend dropping other Unicode encodings. I assume e.g. UTF-16 is more popular in non-ASCII-based locales since it never requires more than 2 bytes for characters in the BMP (basic multilingual plane). I guess you could pretend the 30% of the web that's in ASCII is UTF-8, but then you'd still be missing all the Chinese and Japanese sites that aren't using Unicode yet. That's almost 10% of the web as a whole and obviously if you're targeting those markets it would be a lot higher.
There are very few rational reasons to be using utf-16 instead of utf-8 and I would be as bold as to claim that the two aren't even close in popularity for non-ascii texts, utf-8 is clearly well above. Regarding memory consumption, latin texts also primordially use ascii characters, with only the occasional non-ascii glyphs (cedillas, accents, tremas and other such characters), the difference between utf-8 and 16 is that the former uses 1 byte to represent characters in the ascii range and 2 bytes for the occasional â, whereas the latter always uses 2 bytes, which would explain utf-8's popularity in latin based alphabets. Also, because having ordinary characters whose code points are padded with zeros is just wasteful and inefficient, utf-16 and its brethren utf-32 introduced funky attempts at optimizations such as Byte Order Marks. Why would anyone wanna deal with this stuff when writing an internationalized program?
It doesn't really matter since most document formats already use compression. No matter what encoding you use, the amount of entropy is the same. So UTF-8 usually compresses better than 2-byte encodings for CJK languages. This nearly compensates for the increased size.
Example: The Korean text of the Universal Declaration of Human Rights [1] is 8.1KB in EUC-KR and 11.2KB in UTF-8. When compressed with bzip2, it's only 3.1KB and 3.2KB, respectively. I assume Japanese would behave similarly.
Very astute point and I would grant that it could be a consideration when writing a program for which you don't intend to ever output content on the web. I personally believe that an extra byte per character is a rather small price to pay for all the good of utf-8 and I would not even give a second look to utf-16.
Now, consider that if you decide to use utf-16 to publish your Japanese text as an HTML document (whose syntax is entirely ascii based), you might lose its "saving advantage" and could easily end up with text that is actually larger in utf-16 than in utf-8.
I'm curious whether there is more information per byte in UTF-8 hanzi / kanji compared to English, considering each character is about 3 bytes but can represent an entire word. A potential human-readable compression technique! ;)
The web consists of a lot of markup and javascript. That's all in ASCII compatible characters. So on the web UTF-8 usually wins even for Japanese or Chinese texts. For example take a random page of the Japanese Wikipedia and encode it in UTF-8 and in UTF-16. In all my tries UTF-8 was smaller.
What the Japanese use isn't even UTF-16, it's more often something like EUC-JP which predates Unicode and uses different encodings. Same with EUC-KR in Korea and several different Chinese encodings.
You might want to use UCS-4 as an internal format, or some other convenient format (for example ASCII if you know everything is 7/8 bit). But UTF-8 should still be the only thing seen externally.
Learned a fantastic new word from this article, thanks!
Mojibake (文字化け?) (IPA: [modʑibake]; lit. "unintelligible sequence of characters"), from the Japanese 文字 (moji) "character" + 化け (bake) "change", is the occurrence of incorrect, unreadable characters shown when software fails to render text correctly according to its associated character encoding.
That is an incredible increase in a short amount of time, especially to pass 60% with no sign yet of even a decreasing slope. The post offers no explanation for the sudden surge, and I am too ignorant on this topic to speculate. Was there a sudden change in the defaults of operating systems / languages / etc? (Without details that's the only thing that seems generally plausible to me)
Most of the rise started when blogging took off. A few hundred million WordPress and Blogger blogs would do it. The non-UTF-8 slice of the pie got smaller.
It's good to know that it's on the increase. Still, character encoding seems to be something that is understood by very few people... I wonder how many of these sites are just reporting UTF-8 or something because their web server defaults to it, and not actually encoding special characters properly?
It's certainly very easy to do - I had the problem a little while back setting up a little web app where the web server and MySQL database were both UTF-8, but the db connection was defaulting to ISO-8859-1 or something, causing all sorts of issues with curly quotes etc.
> I wonder how many of these sites are just reporting UTF-8 or something because their web server defaults to it, and not actually encoding special characters properly?
I think they are measuring the encoding that is picked before adding a page to the index. Which would be a combination of explicit information (headers, xml and html metadata) and heuristics (which may fail, but are useful when the explicit information is missing or — even though ignoring explicit metadata is bad — obviously incorrect).
It also seems they are looking for a subset encoding once they have the metadata; the posts describe explicitly labelling ASCII when the contents are within that subset.
The great thing is that one now only has to explain to people to use UTF-8 in all cases, which is preferable to having to fool around with a lot of language-specific encodings.
And it helps that almost all editors, these days, also default to UTF-8 encoding. So you can just copy/paste special characters and it works...
Mysql's character encoding is pretty hilarious. I love how its "latin1" is nothing of the sort (it's actually cp1252 with eight characters shuffled around).
If have just been writing a lot of C code to support unicode in a new project. It reads UTF-8 converts to UTF-32 does lots of clever stuff and spits UTF-8 out once again.
Here's my take: UTF-8 should, rightly, be the only interchangeable text format of choice for the right-minded individual. The other UTFs should be internal formats used in-memory or on disk cache / database / etc
Why? Simple: UTF-8 rocks!
It's a beautifully designed, backwardly compatible, nifty piece of back-of-a-napkin genius. Simple to code and decode (once you understand it), simple to check with a regular expression (once you realise it is essentially just a token with a set number of chars), and simple to add the wealth of the world's characters to your app with a reasonable amount of code. Plus, it's compact in the way Huffman coding is compact (at least from a western perspective).
Also, no-one should ever be using (char ❄) in the 21st century unless it is to temporarily hold a UTF-8 string before converting to wchar.
Why? Because you almost certainly don't have a (char ❄), you have a UTF-8 sequence (^^^see above). Sadly this makes your memory mapped files slightly redundant. But don't fret, this is the future: convert them to 32-bits and release them. Be happy that you can now treat any character sequence, in common usage, in the whole of humanity like an array.
As for UTF-16, why bother? It's neither compact, clever, nor big enough to hold every character on the internet. 💩 needs more than 16 bits and everyone, now, needs to support a poop with eyes.
tl;dr: Share UTF-8 promiscuously, keep UTF-32 for private moments. Don't dally with UTF-16, she's an old tease and can't handle poop.
UTF-16 will handle poop just fine. UTF-16 handles higher characters in a manner just like UTF-8. Pile-of-poo is encoded as D83D DCA9 in UTF-16, The same size as it is in UTF-8.
There may be a distinct size advantage for some asian cultures to using UTF-16 instead of UTF-8 as it will allow for encoding more of the glyphs without having to add more overhead bits. How much this saves in reality I'm not sure.
I wouldn't use UTF-16 unless having to work with a legacy system. A lot of software claiming to handle UTF-16 is broken and really only works with UCS-2. You have to worry about endianess and so on.
If you develop an application for the international market you should probably go with UTF-8 as well. Maybe if you develop only for the Asian market it's a bit different. But in my experience significant amounts of text usually come in some form of data or markup format (HTML, XML, JSON, etc.) and usually those markup formats are defined in the ASCII subset. So UTF-8 still wins. Just take a random page from the Japanese Wikipedia and encode it in UTF-8 and in UTF-16. You'll see that UTF-8 almost always wins.
The advantage of moving to one standard, plus the backwards compatibility for most of the internet will outweigh any small size advantage that UTF-16 will have on real file sizes for some countries / cultures.
A saving of ⅓ on a text file will barely be noticed in a world seemingly governed by Moore's law in nearly every future metric.
(also my bad for using UTF-16 where I meant 16-bit unicode character arrays :))
Can you find a website that uses a non-UTF-8 Unicode encoding? (Actually now that I think about it, UTF-16 probably makes sense for non-Latin languages and might well be common. Does anyone have any insights here?)
ISO-8859-n were pretty common in Europe, because they were the default charset on windows. Just checked the web pages of two major danish newspapers: They use ISO-8859-1. Utf-8 is getting more widespread though.
"Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 ..."
All that's saying is that software developers have discovered that there are alphabets other than roman, and people use them. But all it takes to be UTF-8 is to have your web server set the Content-Encoding header or add a doctype to your page. If you stick to roman characters utf8 and extended ascii are exactly the same.
What I really want to know is the breakdown by alphabet. How many sites are in Cyrillic? Kanji? Arabic? Thai?
As Finnish user, I have found out that using UTF-8 as default charset (in browser) is not practical option. There are so many sites using ISO-8859-1 out there, which do not inform browser about their charset choise at all. So when I'm using UTF-8 as default, I get pages with mangled scandinavian characters all the time. öäåÖÄÅ
for a moment, I parsed the headline as "You need to scan 60% of the web to find an instance of each and every unicode glyph (being used organically)..."
Actually, outside of listings of all unicode/code pages, I'm 98% sure there is a vast quantity of unicode characters that is not used once (organically) on the entire Internet.
I'd bet even money you can craft a two-character 'word' such that you would be the top Google result for it if you use it just once or twice in any context on any page Google indexes, just because you're the first person to use those characters organically, to say nothing of together. /s
> As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006.
> The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.
Hmm. Well, I know that's true in theory, but my personal experience is that the occurrence of mojibake has increased lately. I even wrote about it here on HN: http://news.ycombinator.com/item?id=2075010
I think a lot of new code tends to just assume UTF-8, and if you try and use something else that's your problem (I know I do this). Older code had to be able to detect encodings.
Actually, the whole web is unicode today, AFAIT. Unicode is defined as the character repertoire of HTML and XML. In olden days we had different charsets (like ASCII, ISO-8899-n and so on), but these have simply been redefined in a html/xml context to be character encodings. So ASCII, ISO-8859-n etc. are considered unicode encodings which happen to be only able to represent a subset of the full unicode character repertoire.
HTML is defined to use Unicode as the document character set. But the charaters can be represented as byte-streams using different encodings, UTF-8 beeing one encoding, ISO-8859-1 beeing another encoding.
> The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters.
A lot of people seem to confuse Unicode with the UTF-encodings.
That document is badly-written; a more reasonable way to interpret it is to conclude that user agents will use some form of Unicode internally, after converting whatever character encoding the document they received used. Which is, indeed, a very reasonable way to design your software, but it doesn't make Latin-1 (for example) a Unicode encoding by any reasonable standard.
> A lot of people seem to confuse Unicode with the UTF-encodings.
UNICODE is a good thing because it provides a codepoint for every character that we care about, instead of having a 256-character subset for every groups of languages and needing complicated software to puzzle out how to convert from one subset to the other. Unicode allows fantastic stuff such as upper/lowercasing text including all the weird letters that you previously had to special-case.
ASCII used to be a good thing because it allowed people to ship around basic English and Cobol code without any worries, but is actually pretty evil because people from Anglosaxon countries assume that every other bit of text is composed of English and Cobol.
Having a notion of ENCODINGS is useful if you occasionally get bits of text that are neither English nor Cobol. You still needed different encodings for different groups of languages, and arcane mechanisms to provide hints on which encoding is meant, at least if you got non-English bits of text. The very notion of an Encoding scares the people who used to think the world consists of English and Cobol.
UTF-8 is a very reasonable encoding that can be used to represent all of Unicode while being Ascii-compatible. Hence it is a sane choice as a default encoding for people who are scared of having to think about encodings. Because UTF-8 is not the only encoding out there, Unicode-compatible programs accept Unicode text in many other encodings, including those that cannot represent the full range of Unicode and are only a good choice for some people but not others.
tl;dr: non-UTF-8 text can (and should) still be read as unicode codepoints. Ignoring the >=40% of texts out there or saying that they're "not Unicode" doesn't help anybody.