Properly handle supplementary characters when saving XML files
(too old to reply)
Ryan Sakowski
2017-07-06 21:27:05 UTC
About a month ago, I posted a pull request on GitHub that should resolve the numerous "reference to invalid character number at line ###" errors that have been posted on the Windows board. Here's the pull request:


What's going on here is that Audacity for Windows (but not Linux and probably not macOS) is saving Unicode supplementary characters (characters greater than U+FFFF) as escaped UTF-16 surrogate pairs, which are illegal in XML. (Supplementary characters include, among other things, many emoji and lesser-used writing systems.) Supplementary characters instead need to be saved either directly or as single 5- or 6-digit codes, which I have confirmed happens on Linux.

The cause of this problem is that Audacity uses wxString, which, at least on Windows*, stores characters as wchar_t, which is 2 bytes (UTF-16) on Windows and 4 bytes (UTF-32) on Linux and macOS. While a 4-byte wchar_t can store any Unicode code point in one unit, a 2-byte wchar_t cannot store supplement characters, so in the latter case such characters are represented as surrogate pairs (U+D800..U+DBFF followed by U+DC00..U+DFFF). However, the GetChar function of wxString doesn't seem to decode surrogate pairs, and Audacity does nothing to handle them in the XMLWriter::XMLEsc function (which is what escapes and filters out certain characters for project files, and is where surrogates are incorrectly being escaped).

My pull request fixes this by detecting surrogate pairs and passing them unescaped to the output string; the surrogate pairs are eventually decoded and the characters they encode are written properly to the project file.

*According to the "Performance characteristics" section of the wxString documentation (http://docs.wxwidgets.org/3.0.2/classwx_string.html#string_performance), wxString uses wchar_t as its character type by default. However, the documentation for wxStringCharType (http://docs.wxwidgets.org/3.0.2/group__group__funcmacro__string.html#gaf558f1d34fbf3cf5e3258e42a40875fd) says that it is the type used by wxString and that it is, by default, wchar_t on Windows and char on various other platforms (in which case I think wxString uses the UTF-8 encoding). I haven't tested which of these is more accurate, but since Audacity on Linux saves supplementary characters properly to project files, I don't think it's a concern.