How to sanitize HTML text using only vanilla DOM API
Published on

What is sanitization and why is it important?
Many websites, especially content management systems, heavily rely on dynamic text content, such as saved rich texts, comments, posts, etc. Oftentimes such content is saved in a form of a raw HTML which can be directly embedded in the web page html. So, this creates potential security issues if the website doesn't do any validations on the HTML text. Such security issues are oftentimes exploited in Cross Site Scripting (XSS) attacks. Some common security problems are:
- Inline scripts, which can contain potentially malicious code. For example:
<script> alert("Potentially malicious code"); </script>
-
Inline styles, which don't execute code by themselves but can still modify or break the
website UI. For example:
<style> body { background: red; } </style>
-
Dangerous HTML tags and attributes. This is also related to the previous 2 points
to some extent. Some HTML elements, especially when used with certain attributes can also have dangerous
behaviors. For example:
<a href="javascript:alert('Potentially malicious link');">Potentially malicious link</a> <button onclick="alert('Potentially malicious button');">Potentially malicious button</button>
Sanitization is the removal (or the replacement to safe HTML text) of such dangerous HTML fragments. Now it's understandable why HTML sanitization is very important, especially for websites with dynamic content.
Implementing a basic HTML sanitizer
Here I'll implement a basic HTML sanitizer using JS generator and
DOMParser
which does the following:
-
Supports the following elements:
<a>
. Thehref
attribute must be a valid URL that starts with http:// or https://, otherwise the link is not valid and the element is considered invalid. The element also supportstarget
attribute. If the target attribute is specified as _blank or blank, thetarget
in the sanitized HTML will be _blank, otherwise the attribute will not be added.-
<img>
. Thesrc
attribute must be a valid URL that starts with http:// or https://, otherwise the image is not valid and the element is considered invalid. -
<font>
. It supportscolor
andsize
attributes. No validation is done for attributes, since they are harmless even if they are invalid. <br>
.<b>
.<strong>
.<i>
.<em>
.<del>
.<s>
.<u>
.<p>
.<hr>
.<li>
.<ul>
.<ol>
.- Text nodes.
- If the element is unsupported or invalid the nested elements and text nodes will be added recursively.
- It will only copy the supported tags and attributes, the unsupported attributes will be always ignored.
- There will be no unclosed tags.
The code
The parsing and the construction of the elements is done via DOMParser
, because the construction of
the
elements via the methods of the active document (window.document
) and innerHTML
can
still send http requests and
execute some
callbacks, such as onload
or onerror
.
function sanitizeHtml(html) {
// construct an inactive document by
// parsing the html with DOMParser
// in order to prevent any possible
// code execution and http requests
const inactiveDocument = new DOMParser()
.parseFromString(html, 'text/html');
const inputElement = inactiveDocument.documentElement;
// construct the output element via the
// inactive document in in order to prevent
// any possible code execution and http requests
const outputElement = inactiveDocument.createElement('div');
function* sanitizeRecursively(root) {
for (const child of root.childNodes) {
if (child instanceof HTMLElement) {
try {
// check if the element is in the list
// of the supported types
if (![
'A', 'IMG', 'FONT', 'BR', 'B',
'STRONG', 'I', 'EM', 'DEL', 'S', 'U',
'P', 'HR', 'LI', 'UL', 'OL'
].includes(child.tagName)) {
throw new Error(`${child.tagName} is not supported`);
}
// construct the new child via the inactive document in
// in order to prevent any possible
// code execution and http requests
const newChild = inactiveDocument
.createElement(child.tagName);
// handling the <a> tag
if (
newChild instanceof HTMLAnchorElement &&
child instanceof HTMLAnchorElement
) {
const url = new URL(child.href);
// validate URL
if (url.protocol !== 'https:' && url.protocol !== 'http:') {
throw new Error(
`href ${url.protocol} is not supported`
);
}
newChild.href = url.href;
// set target _blank if valid
if (child.target === 'blank' || child.target === '_blank') {
newChild.target = '_blank';
}
}
// handling the <img> tag
else if (
newChild instanceof HTMLImageElement &&
child instanceof HTMLImageElement
) {
const url = new URL(child.src);
// validate URL
if (url.protocol !== 'https:' && url.protocol !== 'http:') {
throw new Error(
`src ${url.protocol} is not supported`
);
}
newChild.src = url.href;
}
// handling the <font> tag
else if (
newChild instanceof HTMLFontElement &&
child instanceof HTMLFontElement
) {
// set size if present
if (child.size) {
newChild.size = child.size;
}
// set color if present
if (child.color) {
newChild.color = child.color;
}
}
// append children
newChild.append(...sanitizeRecursively(child));
yield newChild;
} catch (e) {
console.error(e);
// if some validation error occurred just try
// to recursively copy the children
yield* sanitizeRecursively(child);
}
} else if (
child instanceof Node &&
child.nodeType === Node.TEXT_NODE
) {
// copying text nodes
yield child.cloneNode(true);
}
}
}
// filling with the copied children
outputElement.append(...sanitizeRecursively(inputElement));
return outputElement.innerHTML;
}
Since NodeJS doesn't have DOM API natively, you can use jsdom package if you want this on NodeJS.
Playground
Here you can see how the sanitizer works. Just type some html code in the input and see the output immediately.
Input html:Output html:
Output visual look: