To UTF-8 or UCS-2 (not UTF-16) for strings

Since a long time ago, I was coding my … code with ASCII strings in mind, in my little box of the world, thought that everyone would basically use English on their computers. Well of course this is not the case, especially in the present and future. While I was sort of supporting UTF-8, partially, I wanted to make sure I have a proper String class in place that will correctly handle any UTF-8 subtlety.

I have checked what other game engines do, some are using UCS-2, on two bytes, covering the BMP (Basic Multilingual Plane) which is good enough for the majority of languages. The main issue with it is that I did not liked to add L”some text” or a macro TEXT(“some text”) in front of ALL strings in the engine. Plus you have the problem of cross-platform wchar_t which is not always 2 bytes (this could be fixed with using uint16_t). So I have chosen UTF-8 strings, they can be either simple ASCII or unicode codepoints, transparently.

On Windows I would need to convert from UTF-8 to wchar_t to pass strings to the file/misc. APIs. On Linux I can just pass it as it is, if UTF-8 is used as encoding. On OSX it is probably something like in Linux, but I do not know yet.

String operations still use strstr, strcmp, etc. they work ok with UTF-8 string data and also with normal one byte ASCII. Indexing the string will take care to index on codepoint and not on byte. Text files in UTF-8 are pretty standard and I do not need a BOM header to detect the format of the text file, it can even be plain ASCII or an UTF-8 byte sequence.

And that’s pretty much it, the Tachyon engine is using UTF-8 strings, with minimal overhead.

