Avoid using "<![CDATA[ ... ]]>" in RSS
Published on
Updated on
<![CDATA[ ... ]]> is very commonly used in RSS (also Atom) feeds to escape XML special
characters. At first glance, it looks very convenient,
you simply add <![CDATA[ ... ]]> blocks and write any (almost) content inside of them without
worrying
about escaping characters:
<item>
<title><![CDATA[Using <CDATA> in Titles]]></title>
<link>http://example.com</link>
<description>
<![CDATA[
<p>This description contains <strong>HTML markup</strong>.</p>
<p>It allows us to use characters like "<b>&</b>" and brackets directly.</p>
]]>
</description>
</item>
Why not CDATA?
CDATA seems to be perfect, isn't it? Except it's not possible to escape some CDATA special character sequences
inside a
single CDATA block, particularly ]]> (the one that ends the CDATA block). In order to do that,
you have to split the CDATA block
into multiple parts:
<text>
<![CDATA[hello ]]]]><![CDATA[> world]]>
</text>
The encoded text is "hello ]]> world". As you can see, the XML code is less readable now. CDATA loses most of its simplicity advantage.
Even though splitting makes the encoding of ]]> possible, I would say it's still not worth using
CDATA:
- It adds a special edge case for
]]>, which the serializer must handle. - It can mislead people into thinking the content is raw HTML or somehow safer. No, it is not.
Also, this might create a false sense of security in inexperienced people, which could even lead them to
overlook
]]>(especially considering the rarity of]]>). - It makes output less uniform, because sometimes you need split CDATA blocks.
- It does not change the parsed value. XML parsers expose the same text either way.
- It can make debugging confusing, especially if the content itself discusses CDATA, like this article title does... Just look at the RSS feed of this blog and see that it just escapes XML characters.
What to do instead?
Just escape these characters (works for HTML too):
function xmlEscape(text) {
return text
.replaceAll("&", "&")
.replaceAll("<", "<")
.replaceAll(">", ">")
.replaceAll('"', """)
.replaceAll("'", "'");
}
Normal escaping is simpler and more uniform.
OK, but some people might say that CDATA might make the RSS content smaller on average since
characters don't need any escape (which requires more characters in encoded form) and ]]> is
encountered rarely. Fair point, however:
-
Feeds are usually gzip-compressed. Repeated strings like
<,>, and&compress very well. - RSS feed size is rarely the bottleneck. Images, HTML pages, CSS, JS, and network latency usually matter much more.
-
CDATA has a special edge case. You still need to correctly handle
]]>. - Normal escaping is simpler and more uniform. One escaping path works for titles, descriptions, Atom, RSS, attributes, metadata, etc.
Conclusion
Here I listed the reasons why you should avoid using CDATA. This is especially true if you are going to
implement your custom RSS / Atom feed generator.
Many libraries / frameworks / CMSs still generate CDATA for RSS / Atom feeds and many of them handle the
mentioned character sequence
]]> in their own ways. And they are perfectly fine to use if you have to rely on them. CDATA is
common because it
is convenient for legacy feed generators and visually cleaner for embedded HTML. But for new code, ordinary XML
escaping is usually cleaner and more uniform.
See you later.