How to sanitize HTML text using only vanilla DOM API

Published on

DOM sanitization

What is sanitization and why is it important?

Many websites, especially content management systems, heavily rely on dynamic text content, such as saved rich texts, comments, posts, etc. Oftentimes such content is saved in a form of a raw HTML which can be directly embedded in the web page html. So, this creates potential security issues if the website doesn't do any validations on the HTML text. Such security issues are oftentimes exploited in Cross Site Scripting (XSS) attacks. Some common security problems are:

Sanitization is the removal (or the replacement to safe HTML text) of such dangerous HTML fragments. Now it's understandable why HTML sanitization is very important, especially for websites with dynamic content.

Implementing a basic HTML sanitizer

Here I'll implement a basic HTML sanitizer using JS generator and DOMParser which does the following:

  1. Supports the following elements:
    • <a>. The href attribute must be a valid URL that starts with http:// or https://, otherwise the link is not valid and the element is considered invalid. The element also supports target attribute. If the target attribute is specified as _blank or blank, the target in the sanitized HTML will be _blank, otherwise the attribute will not be added.
    • <img>. The src attribute must be a valid URL that starts with http:// or https://, otherwise the image is not valid and the element is considered invalid.
    • <font>. It supports color and size attributes. No validation is done for attributes, since they are harmless even if they are invalid.
    • <br>.
    • <b>.
    • <strong>.
    • <i>.
    • <em>.
    • <del>.
    • <s>.
    • <u>.
    • <p>.
    • <hr>.
    • <li>.
    • <ul>.
    • <ol>.
    • Text nodes.
  2. If the element is unsupported or invalid the nested elements and text nodes will be added recursively.
  3. It will only copy the supported tags and attributes, the unsupported attributes will be always ignored.
  4. There will be no unclosed tags.

The code

The parsing and the construction of the elements is done via DOMParser, because the construction of the elements via the methods of the active document (window.document) and innerHTML can still send http requests and execute some callbacks, such as onload or onerror.

function sanitizeHtml(html) { // construct an inactive document by // parsing the html with DOMParser // in order to prevent any possible // code execution and http requests const inactiveDocument = new DOMParser() .parseFromString(html, 'text/html'); const inputElement = inactiveDocument.documentElement; // construct the output element via the // inactive document in in order to prevent // any possible code execution and http requests const outputElement = inactiveDocument.createElement('div'); function* sanitizeRecursively(root) { for (const child of root.childNodes) { if (child instanceof HTMLElement) { try { // check if the element is in the list // of the supported types if (![ 'A', 'IMG', 'FONT', 'BR', 'B', 'STRONG', 'I', 'EM', 'DEL', 'S', 'U', 'P', 'HR', 'LI', 'UL', 'OL' ].includes(child.tagName)) { throw new Error(`${child.tagName} is not supported`); } // construct the new child via the inactive document in // in order to prevent any possible // code execution and http requests const newChild = inactiveDocument .createElement(child.tagName); // handling the <a> tag if ( newChild instanceof HTMLAnchorElement && child instanceof HTMLAnchorElement ) { const url = new URL(child.href); // validate URL if (url.protocol !== 'https:' && url.protocol !== 'http:') { throw new Error( `href ${url.protocol} is not supported` ); } newChild.href = url.href; // set target _blank if valid if (child.target === 'blank' || child.target === '_blank') { newChild.target = '_blank'; } } // handling the <img> tag else if ( newChild instanceof HTMLImageElement && child instanceof HTMLImageElement ) { const url = new URL(child.src); // validate URL if (url.protocol !== 'https:' && url.protocol !== 'http:') { throw new Error( `src ${url.protocol} is not supported` ); } newChild.src = url.href; } // handling the <font> tag else if ( newChild instanceof HTMLFontElement && child instanceof HTMLFontElement ) { // set size if present if (child.size) { newChild.size = child.size; } // set color if present if (child.color) { newChild.color = child.color; } } // append children newChild.append(...sanitizeRecursively(child)); yield newChild; } catch (e) { console.error(e); // if some validation error occurred just try // to recursively copy the children yield* sanitizeRecursively(child); } } else if ( child instanceof Node && child.nodeType === Node.TEXT_NODE ) { // copying text nodes yield child.cloneNode(true); } } } // filling with the copied children outputElement.append(...sanitizeRecursively(inputElement)); return outputElement.innerHTML; }

Since NodeJS doesn't have DOM API natively, you can use jsdom package if you want this on NodeJS.

Playground

Here you can see how the sanitizer works. Just type some html code in the input and see the output immediately.

Input html:

Output html:

Output visual look:




Read previous