Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unicode over 60 percent of the web (googleblog.blogspot.com)
107 points by robin_reala on Feb 6, 2012 | hide | past | favorite | 72 comments


Occasionally, I have an irresistible urge to strangle everyone who uses Unicode and UTF-8 (or other UTF encodings) interchangeably.

UNICODE is a good thing because it provides a codepoint for every character that we care about, instead of having a 256-character subset for every groups of languages and needing complicated software to puzzle out how to convert from one subset to the other. Unicode allows fantastic stuff such as upper/lowercasing text including all the weird letters that you previously had to special-case.

ASCII used to be a good thing because it allowed people to ship around basic English and Cobol code without any worries, but is actually pretty evil because people from Anglosaxon countries assume that every other bit of text is composed of English and Cobol.

Having a notion of ENCODINGS is useful if you occasionally get bits of text that are neither English nor Cobol. You still needed different encodings for different groups of languages, and arcane mechanisms to provide hints on which encoding is meant, at least if you got non-English bits of text. The very notion of an Encoding scares the people who used to think the world consists of English and Cobol.

UTF-8 is a very reasonable encoding that can be used to represent all of Unicode while being Ascii-compatible. Hence it is a sane choice as a default encoding for people who are scared of having to think about encodings. Because UTF-8 is not the only encoding out there, Unicode-compatible programs accept Unicode text in many other encodings, including those that cannot represent the full range of Unicode and are only a good choice for some people but not others.

tl;dr: non-UTF-8 text can (and should) still be read as unicode codepoints. Ignoring the >=40% of texts out there or saying that they're "not Unicode" doesn't help anybody.


And to a programmer, what matters is whether the encoding is present and correct.

From a security perspective, if you don't include the right encoding meta-data, an attacker can include an XSS attack in UTF-7. Your server reads normal (but gibberish) ASCII, escapes it properly, then shleps it to the client who thinks it's UTF-7 (because of all the UTF-7 chars) and suddenly they are running malicious javascript that you didn't escape.

If you're scraping sites, the one thing more annoying than trying to guess the encoding, and that's dealing with multiple encodings in the same document.

Putting UTF-8 characters in a document doesn't mean you are using "Unicode". It means you are using some undeclared encoding, which is not a step forwards.


Re: UTF-7, is anyone on the world actually using that? I've only ever read about it in articles about security problems.


Certainly not on the web, it’s disallowed in the HTML5 spec:

User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.

and major browsers removed support, e.g.:

https://bugzilla.mozilla.org/show_bug.cgi?id=414064


Not sure if this "counts," but it is still used to encode IMAP folder names. (Technically, the IMAP version is slightly different from standard UTF-7.)


I'm confused by that as well. I never really understood the motivation behind it - though I guess it's obsolescent, so hopefully it'll soon be just another one of those amusing anachronisms we see occasionally in computer science.


As I understand it, the motivation was to be able to transmit Unicode text over a channel that's only 7-bit safe (e.g. mail protocols) without having to do something silly like Base64-encoding the whole thing.


This post makes no sense to me.

Are you saying that we should still use character encodings rather than UTF encodings, or are you saying that we shouldn't assume that raw text is ASCII, or are you saying something else?

Unicode is, essentially, nothing unless it is encoded. When you encode it you must decide whether to use one byte, two bytes, three bytes or four bytes.

Only UTF-8 and UTF-32 are really big enough to hold the world's characters. everything else is a fudge.

ASCII was never Anglo-Saxon: it was always American. COBOL is a red herring here, too. ASCII was all about teleprinters and was a clever use of 8-Bits for its time. Bell labs invented ASCII and Bell labs invented UTF-8.

UTF-8 is much better than reasonable. It is a compact way to represent Unicode while preventing the western world from having to re-encode every text document. That's a lot of good news the internet.

Are you saying that we shouldn't ignore other charsets as they are still valid Unicode? If so I agree up to a point, the point being that there is no longer any need to have any other unicode encoding other than UTF-8. If you need to access your local characters as a byte array: choose your internal encoding and translate, do your magic, and then spit out UTF-8 again then we can all simply read the same documents without the need for over-complexity.


It makes sense to me. His point is that calling UTF-8 "Unicode" is wrong, that just confuses people. Windows programmers usually also call UTF-16 "Unicode" which is also wrong.

Unicode is not a text encoding, it's a standard that assigns a number to every character.Saying that a particular text is "Unicode" doesn't give any info on how to decode it, we should just say that a given text is UTF-7,8,16,32, etc.


> Only UTF-8 and UTF-32 are really big enough to hold the world's characters

So is GB18030.

Edit: and UTF-16, of course (just don't confuse it with UCS-2)


I'm guilty of using UTF16 when I really mean 16 bit unicode arrays. I see that now.

But it does beg the question why would you use UTF-16? Yes if you have more than 512 characters in your common script then OK I can see it might make sense, but not much. UTF-8 will still average out in a reasonable way.


Because it's the standard in the Windows API.


I wonder if we can now start to abandon non-UTF8 charsets? Ignoring Windows which is wilfully incompatible and broken, UTF-8 is used pretty much everywhere that matters, and I would argue that those who don't use it should use it. If there is something UTF-8 or Unicode can't do, let's fix that.


I wouldn't recommend dropping other Unicode encodings. I assume e.g. UTF-16 is more popular in non-ASCII-based locales since it never requires more than 2 bytes for characters in the BMP (basic multilingual plane). I guess you could pretend the 30% of the web that's in ASCII is UTF-8, but then you'd still be missing all the Chinese and Japanese sites that aren't using Unicode yet. That's almost 10% of the web as a whole and obviously if you're targeting those markets it would be a lot higher.


There are very few rational reasons to be using utf-16 instead of utf-8 and I would be as bold as to claim that the two aren't even close in popularity for non-ascii texts, utf-8 is clearly well above. Regarding memory consumption, latin texts also primordially use ascii characters, with only the occasional non-ascii glyphs (cedillas, accents, tremas and other such characters), the difference between utf-8 and 16 is that the former uses 1 byte to represent characters in the ascii range and 2 bytes for the occasional â, whereas the latter always uses 2 bytes, which would explain utf-8's popularity in latin based alphabets. Also, because having ordinary characters whose code points are padded with zeros is just wasteful and inefficient, utf-16 and its brethren utf-32 introduced funky attempts at optimizations such as Byte Order Marks. Why would anyone wanna deal with this stuff when writing an internationalized program?


Tell that to the Japanese, whose glyphs are always 3+ bytes in UTF-8!


It doesn't really matter since most document formats already use compression. No matter what encoding you use, the amount of entropy is the same. So UTF-8 usually compresses better than 2-byte encodings for CJK languages. This nearly compensates for the increased size.

Example: The Korean text of the Universal Declaration of Human Rights [1] is 8.1KB in EUC-KR and 11.2KB in UTF-8. When compressed with bzip2, it's only 3.1KB and 3.2KB, respectively. I assume Japanese would behave similarly.

[1] http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=kkn


Very astute point and I would grant that it could be a consideration when writing a program for which you don't intend to ever output content on the web. I personally believe that an extra byte per character is a rather small price to pay for all the good of utf-8 and I would not even give a second look to utf-16.

Now, consider that if you decide to use utf-16 to publish your Japanese text as an HTML document (whose syntax is entirely ascii based), you might lose its "saving advantage" and could easily end up with text that is actually larger in utf-16 than in utf-8.

Also take a look at this: http://programmers.stackexchange.com/questions/102205/should...


I'm curious whether there is more information per byte in UTF-8 hanzi / kanji compared to English, considering each character is about 3 bytes but can represent an entire word. A potential human-readable compression technique! ;)


The web consists of a lot of markup and javascript. That's all in ASCII compatible characters. So on the web UTF-8 usually wins even for Japanese or Chinese texts. For example take a random page of the Japanese Wikipedia and encode it in UTF-8 and in UTF-16. In all my tries UTF-8 was smaller.


What the Japanese use isn't even UTF-16, it's more often something like EUC-JP which predates Unicode and uses different encodings. Same with EUC-KR in Korea and several different Chinese encodings.


Afaik UTF-32 is relatively convenient as it is fixed-width, and thus many operations are faster to perform than in UTF-8.


You still have to worry about combining characters.


Exactly. One UTF-32 character (code-point) is not one displayed character (glyph), and one UTF-32 character is not one character to search.

So basically it's not fixed width in any meaningful way.


I largely share sqrt17's grief.

There is no such thing as "UTF-32 character".

Abstract character is not code point.

UTF-32 _is_ fixed width because it's defined on code points, not glyphs, not characters.


Fixed width in what way? It's not fixed width for display or search, so what's fixed about it?


You might want to use UCS-4 as an internal format, or some other convenient format (for example ASCII if you know everything is 7/8 bit). But UTF-8 should still be the only thing seen externally.


Learned a fantastic new word from this article, thanks!

Mojibake (文字化け?) (IPA: [modʑibake]; lit. "unintelligible sequence of characters"), from the Japanese 文字 (moji) "character" + 化け (bake) "change", is the occurrence of incorrect, unreadable characters shown when software fails to render text correctly according to its associated character encoding.

http://en.wikipedia.org/wiki/Mojibake


The best part of that article: handwritten mojibake https://en.wikipedia.org/wiki/File:Letter_to_Russia_with_kro...


I’m amazed that the postal services decoded that!


Also крякозябры.


Obligatory reference to Unicode and encoding - http://www.joelonsoftware.com/articles/Unicode.html


That is an incredible increase in a short amount of time, especially to pass 60% with no sign yet of even a decreasing slope. The post offers no explanation for the sudden surge, and I am too ignorant on this topic to speculate. Was there a sudden change in the defaults of operating systems / languages / etc? (Without details that's the only thing that seems generally plausible to me)


Most of the rise started when blogging took off. A few hundred million WordPress and Blogger blogs would do it. The non-UTF-8 slice of the pie got smaller.


It's good to know that it's on the increase. Still, character encoding seems to be something that is understood by very few people... I wonder how many of these sites are just reporting UTF-8 or something because their web server defaults to it, and not actually encoding special characters properly?

It's certainly very easy to do - I had the problem a little while back setting up a little web app where the web server and MySQL database were both UTF-8, but the db connection was defaulting to ISO-8859-1 or something, causing all sorts of issues with curly quotes etc.


> I wonder how many of these sites are just reporting UTF-8 or something because their web server defaults to it, and not actually encoding special characters properly?

I think they are measuring the encoding that is picked before adding a page to the index. Which would be a combination of explicit information (headers, xml and html metadata) and heuristics (which may fail, but are useful when the explicit information is missing or — even though ignoring explicit metadata is bad — obviously incorrect).

It also seems they are looking for a subset encoding once they have the metadata; the posts describe explicitly labelling ASCII when the contents are within that subset.

But their methodology has changed from the previous two posts: latin above ascii in 2008 didn't exist in http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of... or http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.... .


Yes I wondered this. I would love to see figures for incorrectly labelled on that chart.


The great thing is that one now only has to explain to people to use UTF-8 in all cases, which is preferable to having to fool around with a lot of language-specific encodings.

And it helps that almost all editors, these days, also default to UTF-8 encoding. So you can just copy/paste special characters and it works...


Java is among the last holdouts for Latin-1-encoding - you still have to jump through hoops with Spring MVC to properly send UTF-8 encoded responses

http://stackoverflow.com/questions/3616359

Also the encoding for .properties files is specified as Latin-1

http://en.wikipedia.org/wiki/.properties


Mysql's character encoding is pretty hilarious. I love how its "latin1" is nothing of the sort (it's actually cp1252 with eight characters shuffled around).


All browsers use UTF by default and all web pages should be serving UTF-8: http://hsivonen.iki.fi/accept-charset/


If have just been writing a lot of C code to support unicode in a new project. It reads UTF-8 converts to UTF-32 does lots of clever stuff and spits UTF-8 out once again.

Here's my take: UTF-8 should, rightly, be the only interchangeable text format of choice for the right-minded individual. The other UTFs should be internal formats used in-memory or on disk cache / database / etc

Why? Simple: UTF-8 rocks!

It's a beautifully designed, backwardly compatible, nifty piece of back-of-a-napkin genius. Simple to code and decode (once you understand it), simple to check with a regular expression (once you realise it is essentially just a token with a set number of chars), and simple to add the wealth of the world's characters to your app with a reasonable amount of code. Plus, it's compact in the way Huffman coding is compact (at least from a western perspective).

Also, no-one should ever be using (char ❄) in the 21st century unless it is to temporarily hold a UTF-8 string before converting to wchar.

Why? Because you almost certainly don't have a (char ❄), you have a UTF-8 sequence (^^^see above). Sadly this makes your memory mapped files slightly redundant. But don't fret, this is the future: convert them to 32-bits and release them. Be happy that you can now treat any character sequence, in common usage, in the whole of humanity like an array.

As for UTF-16, why bother? It's neither compact, clever, nor big enough to hold every character on the internet. 💩 needs more than 16 bits and everyone, now, needs to support a poop with eyes.

tl;dr: Share UTF-8 promiscuously, keep UTF-32 for private moments. Don't dally with UTF-16, she's an old tease and can't handle poop.

note: ❄ = asterisk :)


UTF-16 will handle poop just fine. UTF-16 handles higher characters in a manner just like UTF-8. Pile-of-poo is encoded as D83D DCA9 in UTF-16, The same size as it is in UTF-8.

There may be a distinct size advantage for some asian cultures to using UTF-16 instead of UTF-8 as it will allow for encoding more of the glyphs without having to add more overhead bits. How much this saves in reality I'm not sure.


I wouldn't use UTF-16 unless having to work with a legacy system. A lot of software claiming to handle UTF-16 is broken and really only works with UCS-2. You have to worry about endianess and so on.

If you develop an application for the international market you should probably go with UTF-8 as well. Maybe if you develop only for the Asian market it's a bit different. But in my experience significant amounts of text usually come in some form of data or markup format (HTML, XML, JSON, etc.) and usually those markup formats are defined in the ASCII subset. So UTF-8 still wins. Just take a random page from the Japanese Wikipedia and encode it in UTF-8 and in UTF-16. You'll see that UTF-8 almost always wins.


True. True. And probably not true.

The advantage of moving to one standard, plus the backwards compatibility for most of the internet will outweigh any small size advantage that UTF-16 will have on real file sizes for some countries / cultures.

A saving of ⅓ on a text file will barely be noticed in a world seemingly governed by Moore's law in nearly every future metric.

(also my bad for using UTF-16 where I meant 16-bit unicode character arrays :))


In other words, UTF-16 (variable length) is different from UCS-2 (fixed length).


FWIW, UTF-8 was designed by Ken Thompson: http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

So his inventions of Unix and UTF-8 now dominate both the back- and front-ends of the internet.


From the arty: When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8).

...which indicates that nearly all unicode is UTF-8.


Can you find a website that uses a non-UTF-8 Unicode encoding? (Actually now that I think about it, UTF-16 probably makes sense for non-Latin languages and might well be common. Does anyone have any insights here?)


> Actually now that I think about it, UTF-16 probably makes sense for non-Latin languages

It actually doesn't because of all the HTML tags being ASCII.



ISO-8859-n were pretty common in Europe, because they were the default charset on windows. Just checked the web pages of two major danish newspapers: They use ISO-8859-1. Utf-8 is getting more widespread though.


They aren't the default windows charsets. Windows code pages tend to be close, with differences here and there that are impossible to sniff for.


ISO-8859-n is not Unicode.


ISO-8859-n is not ASCII either, so if your software is not Unicode-aware (or encoding-aware) at all, things will break.


They are considered unicode encodings.


No, they are not.


http://www.w3.org/TR/html4/charset.html considers ISO-8859-n character encodings:

"Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 ..."


ISO-8859-5 has NEVER been a common encoding on the Web.

KOI8-R was, then Windows-1251. Now it's often UTF-8.


But Latin-1 is not a Unicode encoding.


All that's saying is that software developers have discovered that there are alphabets other than roman, and people use them. But all it takes to be UTF-8 is to have your web server set the Content-Encoding header or add a doctype to your page. If you stick to roman characters utf8 and extended ascii are exactly the same.

What I really want to know is the breakdown by alphabet. How many sites are in Cyrillic? Kanji? Arabic? Thai?


As Finnish user, I have found out that using UTF-8 as default charset (in browser) is not practical option. There are so many sites using ISO-8859-1 out there, which do not inform browser about their charset choise at all. So when I'm using UTF-8 as default, I get pages with mangled scandinavian characters all the time. öäåÖÄÅ


One example, even people at http://cert.fi don't know how to provider correct character encoding data: http://bayimg.com/EaMOjaADi


I thought only dinosaurs used something different than utf-8.


for a moment, I parsed the headline as "You need to scan 60% of the web to find an instance of each and every unicode glyph (being used organically)..."

Actually, outside of listings of all unicode/code pages, I'm 98% sure there is a vast quantity of unicode characters that is not used once (organically) on the entire Internet.

I'd bet even money you can craft a two-character 'word' such that you would be the top Google result for it if you use it just once or twice in any context on any page Google indexes, just because you're the first person to use those characters organically, to say nothing of together. /s


> As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006.

> The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Hmm. Well, I know that's true in theory, but my personal experience is that the occurrence of mojibake has increased lately. I even wrote about it here on HN: http://news.ycombinator.com/item?id=2075010


I think a lot of new code tends to just assume UTF-8, and if you try and use something else that's your problem (I know I do this). Older code had to be able to detect encodings.


Unicode is the dollar of the internet?


Actually, the whole web is unicode today, AFAIT. Unicode is defined as the character repertoire of HTML and XML. In olden days we had different charsets (like ASCII, ISO-8899-n and so on), but these have simply been redefined in a html/xml context to be character encodings. So ASCII, ISO-8859-n etc. are considered unicode encodings which happen to be only able to represent a subset of the full unicode character repertoire.


Absolutely none of that is correct.


A lot of people seem to agree with you, but I believe the HTML 4 spec supports my point: http://www.w3.org/TR/html4/charset.html

HTML is defined to use Unicode as the document character set. But the charaters can be represented as byte-streams using different encodings, UTF-8 beeing one encoding, ISO-8859-1 beeing another encoding.

> The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters.

A lot of people seem to confuse Unicode with the UTF-encodings.


> I believe the HTML 4 spec supports my point

That document is badly-written; a more reasonable way to interpret it is to conclude that user agents will use some form of Unicode internally, after converting whatever character encoding the document they received used. Which is, indeed, a very reasonable way to design your software, but it doesn't make Latin-1 (for example) a Unicode encoding by any reasonable standard.

> A lot of people seem to confuse Unicode with the UTF-encodings.

True. I do not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: