How to sanitize HTML text using only vanilla DOM API
Published on
Updated on
What is sanitization and why is it important?
Many websites, especially content management systems, heavily rely on dynamic text content, such as saved rich texts, comments, posts, etc. Oftentimes such content is saved in a form of a raw HTML which can be directly embedded in the web page html. So, this creates potential security issues if the website doesn't do any validations on the HTML text. Such security issues are oftentimes exploited in Cross Site Scripting (XSS) attacks. Some common security problems are:
- Inline scripts, which can contain potentially malicious code. For example:
<script> alert("Potentially malicious code"); </script> -
Inline styles, which don't execute code by themselves but can still modify or break the
website UI. For example:
<style> body { background: red; } </style> -
Dangerous HTML tags and attributes. This is also related to the previous 2 points
to some extent. Some HTML elements, especially when used with certain attributes can also have dangerous
behaviors. For example:
<a href="javascript:alert('Potentially malicious link');">Potentially malicious link</a> <button onclick="alert('Potentially malicious button');">Potentially malicious button</button> <img src="some_invalid_url" onerror="javascript:alert('Potentially malicious error callback');">
Sanitization is the removal (or the replacement to safe HTML text) of such dangerous HTML fragments. Now it's understandable why HTML sanitization is very important, especially for websites with dynamic content.
Implementing a basic HTML sanitizer
Here I'll implement a basic HTML sanitizer using JS generator and
DOMParser.
If anyone is wondering why
Element.innerHTML
is not used, it's because Element.innerHTML is not safe enough. Yes,
<script> tags are not executed, but network requests and the associated event
listeners
can still be executed
even when the element is not attached to the document body:
const element = document.createElement("div");
element.innerHTML = `<script>alert("This one will not be executed, good")</script>
<img src="some_invalid_url" onerror="javascript:alert('But this one will be executed!');">`;
Now let's get back to the sanitizer. The sanitizer does the following:
-
Supports the following elements:
<a>. Thehrefattribute must be a valid URL that starts with http:// or https://, otherwise the link is not valid and the element is considered invalid. The element also supportstargetattribute. If the target attribute is specified as _blank or blank, thetargetin the sanitized HTML will be _blank, otherwise the attribute will not be added.-
<img>. Thesrcattribute must be a valid URL that starts with http:// or https://, otherwise the image is not valid and the element is considered invalid. -
<font>. It supportscolorandsizeattributes. No validation is done for attributes, since they are harmless even if they are invalid. <br>.<b>.<strong>.<i>.<em>.<del>.<s>.<u>.<p>.<hr>.<li>.<ul>.<ol>.- Text nodes.
- If the element is unsupported or invalid the nested elements and text nodes will be added recursively.
- It will only copy the supported tags and attributes, the unsupported attributes will be always ignored.
- There will be no unclosed tags.
The code
The parsing and the construction of the elements is done via DOMParser, because the construction of
the
elements via the methods of the active document (window.document) and innerHTML can
still send http requests and
execute some
callbacks, such as onload or onerror.
function sanitizeHtml(html) {
// construct an inactive document by
// parsing the html with DOMParser
// in order to prevent any possible
// code execution and http requests
const inactiveDocument = new DOMParser()
.parseFromString(html, 'text/html');
const inputElement = inactiveDocument.documentElement;
// construct the output element via the
// inactive document in in order to prevent
// any possible code execution and http requests
const outputElement = inactiveDocument.createElement('div');
function* sanitizeRecursively(root) {
for (const child of root.childNodes) {
if (child instanceof HTMLElement) {
try {
// check if the element is in the list
// of the supported types
if (![
'A', 'IMG', 'FONT', 'BR', 'B',
'STRONG', 'I', 'EM', 'DEL', 'S', 'U',
'P', 'HR', 'LI', 'UL', 'OL'
].includes(child.tagName)) {
throw new Error(`${child.tagName} is not supported`);
}
// construct the new child via the inactive document in
// in order to prevent any possible
// code execution and http requests
const newChild = inactiveDocument
.createElement(child.tagName);
// handling the <a> tag
if (
newChild instanceof HTMLAnchorElement &&
child instanceof HTMLAnchorElement
) {
const url = new URL(child.href);
// validate URL
if (url.protocol !== 'https:' && url.protocol !== 'http:') {
throw new Error(
`href ${url.protocol} is not supported`
);
}
newChild.href = url.href;
// set target _blank if valid
if (child.target === 'blank' || child.target === '_blank') {
newChild.target = '_blank';
}
}
// handling the <img> tag
else if (
newChild instanceof HTMLImageElement &&
child instanceof HTMLImageElement
) {
const url = new URL(child.src);
// validate URL
if (url.protocol !== 'https:' && url.protocol !== 'http:') {
throw new Error(
`src ${url.protocol} is not supported`
);
}
newChild.src = url.href;
}
// handling the <font> tag
else if (
newChild instanceof HTMLFontElement &&
child instanceof HTMLFontElement
) {
// set size if present
if (child.size) {
newChild.size = child.size;
}
// set color if present
if (child.color) {
newChild.color = child.color;
}
}
// append children
newChild.append(...sanitizeRecursively(child));
yield newChild;
} catch (e) {
console.error(e);
// if some validation error occurred just try
// to recursively copy the children
yield* sanitizeRecursively(child);
}
} else if (
child instanceof Node &&
child.nodeType === Node.TEXT_NODE
) {
// copying text nodes
yield child.cloneNode(true);
}
}
}
// filling with the copied children
outputElement.append(...sanitizeRecursively(inputElement));
return outputElement.innerHTML;
}
The code can be easily extended to handle other HTML tags, such as table elements (including rows, cells, etc), iframes, headings, etc.
Since NodeJS doesn't have DOM API natively, you can use jsdom package if you want this on NodeJS. But in this case it's better to use a dedicated library instead (such as DOMPurify, since you will be relying on a third party library anyways).
Playground
Here you can see how the sanitizer works. Just type some html code in the input and see the output immediately.
Please enable JavaScript if you want to test the html input.
Output html:
Output visual look:
Element.setHtml() will simplify things a lot
The EcmaScript standard is also working on this problem and is starting to catch up. The newly introduced
Element.setHtml()
will solve many of the pains and complexities of HTML sanitization. As of October 2025
Element.setHtml() has very limited support in Firefox Nightly. So you can start to
experiment with it in Firefox Nightly. Hopefully this will become supported by all the major browsers soon.