XML file format saves unicode incorrectly

Before anybody panics: Studio reads these files just fine, despite them being wrong XML. I will make a second bug report for reading if requested.

Roblox’s XML format escapes UTF-8 text in string properties under all circumstances. As an example, it would save a StringValue with its Value set to as follows:

<roblox xmlns:xmime="http://www.w3.org/2005/05/xmlmime" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.roblox.com/roblox.xsd" version="4">
	<Meta name="ExplicitAutoJoints">true</Meta>
	<External>null</External>
	<External>nil</External>
	<Item class="StringValue" referent="RBXF3D08260A7644806A9C1DF20B3C26425">
		<Properties>
			<BinaryString name="AttributesSerialize"></BinaryString>
			<SecurityCapabilities name="Capabilities">0</SecurityCapabilities>
			<bool name="DefinesCapabilities">false</bool>
			<string name="Name">Value</string>
			<int64 name="SourceAssetId">-1</int64>
			<BinaryString name="Tags"></BinaryString>
			<string name="Value">&#226;&#152;&#186;</string>
		</Properties>
	</Item>
</roblox>

Note that the Value property become &#226;&#152;&#186;, which is seemingly an escape of the three bytes that is under the hood (you can see them here). This would be fine, except that is not how it works.

The sequence &#226;&#152;&#186; is a sequence of what are called character references, and as per the XML standard they expand by codepoint. This would cause the above sequence to expand to the sequence \u{e2}\u{98}\u{ba} which in turn expands to c3 a2 c2 98 c2 ba. That can be visualized here:

image

Expected behavior

Every XML parser I have tried causes this to expand as the standard says it should. My expectation would be that Roblox either escaped unicode characters properly (in the above example, you would escape as &#9786;) or simply not escape them at all since XML supports UTF-8 encoding.

7 Likes

Thanks for the report! We’ll follow up when there’s any update on the issue.

Hello! Thanks for pointing this out. The XML format deviates a bit form the standard here because strings can contain any sequence of bytes, not just bytes that form valid unicode codepoints. There’s not really a good way to emit the code points for unicode data and the current encoding for non-unicode data. We may address this in future version of the format.

I’m going to close this out for now as everything is working properly despite being a non-standard.

I figured this was the result it’d get, since it’s if nothing else too late to change it now. I wanted to report it though since it’s been a source of grief for implementing a 3rd party parser for the format and if nothing else writing it down is important.

All of the parsing libraries we can find for XML handle character references correctly (this is a good thing). As a result, people using Rojo’s parser have been basically unable to use unicode in the XML format, which is a bummer.

We can definitely fix this by deliberately not doing the right thing but it’ll involve manually encoding and decoding strings. Not the end of the world but not what we’d like to do.

If another version of the XML format is ever released (heaven forbid), it’d be great to see this fixed though.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.