UTF-8 Woes on Icecast

You may have spotted some broken characters in names of servers and genres on the Icecast add-on while in other names UTF-8 is perfectly OK. I’ve been fighting this issue for some time but it is still a tie.

First, I checked the yp.xml file that the Icecast directory feeds. I expected to see some 8-bit characters; instead, I found something which seems much more like double- and triple-encoded UTF-8.

How does double-encoded UTF-8 come to life? Lets say you have an UTF-8 string. You pass it through some „channel“ which itself is UTF-8 (e.g, a print() to STDOUT), but you forgot to „tell“ it that the string is already UTF-8 encoded. The „channel“ therefore assumes your string has some 8-bit encoding and encodes each byte again into UTF-8. If you do it second time, you get triple-encoded UTF-8 etc.

The good news is that multiple-encoded UTF-8 can be reversed by converting to 8-bit – e.g., by calling iconv(). Each pass will strip one „layer“ of UTF-8 until you get back to regular UTF-8. The bad news is this does not work on Icecast, because their multiple-encoded UTF-8 is somehow broken.

As an example, let’s take the German letter „ü“ (a letter „u“ with two dots overhead; Germans call it „umlaut“). In UTF-8 as all non-ASCII characters it is represented by two bytes:

C3 BC

We can simulate double-encoding with „iconv -f iso-8859-1 -t utf-8″. If we pass it the letter „ü“, we get:

C3 83 C2 BC

We can now do a triple-encoding by calling same iconv once more. We get:

C3 83 C2 83 C3 82 C2 BC

Let’s get back to yp.xml and see what we have there. By chance I know that one of the radio stations has the word „Türingens“ in its name, but the letter „ü“ comes out broken. Looking at yp.xml we find the following sequence:

C3 83 C3 83 C3 82 C2 BC

It is very close to the one we expected – with one difference: the third byte does not match.

This is very weird, because what we have in yp.xml cannot be reversed into readable UTF-8. You can try reversing it with iconv or any other, more sophisticated method (the Perl module „Encode“ has several extensions aimed at this, reaching up to some levels of AI – doing auto-detection of the „path“ that lead to a multiple-encoded string) – they will all fail.

Even more interesting, same issue can be seen in many other cases of partially broken strings in yp.xml – they all look almost like they should, but one byte is wrong. Reversing them also fails.

Exploring further on this broken UTF-8, I wrote a small Perl script to remove in a single pass everything between the first and last byte of the broken part, thus reverting the string back to its original UTF-8. Amazingly, this results in fully legible text. So, as a conclusion, we know so far that:

  • We do have a multiple UTF-8 encoded strings in yp.xml
  • There are many cases, independent of each other, when this multiple encoded UTF-8 is broken at the last pass of encoding.

Getting broken UTF-8 may be a result of a software bug; since we see it also in the name of servers (which, I presume, should not depend on the streaming clients used), we might narrow the possible places to either the streaming server or the directory server. However, there are multiple places where the issue may come from (e.g., a database used to store the server configuration; a web form used to enter and save it et al).

If this path of investigation is right, we’ll probably have to set up an Icecast server in order to look deeper into its configuration. We’ll even have to consider that this bug has already been found and killed, so a fully up-to-date Icecast server might not „feature“ it.

This entry was posted in Нули и единици. Bookmark the permalink.

Вашият коментар

Вашият имейл адрес няма да бъде публикуван. Задължителните полета са отбелязани с *

Можете да използвате тези HTML тагове и атрибути: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>