Clean HTML String to get Safe HTML from Untrusted HTML in Java using jsoup

Tags: jsoup HTML Parser

Introduction

The jsoup library provides static methods Jsoup.clean() to allow cleaning the HTML String from untrusted input. This feature can be used to sanitizer input of your web application in order to prevent XSS attacks. In this tutorial, we show you how to use this feature to get the safe HTML String from an untrusted HTML input.

Add jsoup library to your Java project

To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.

compile 'org.jsoup:jsoup:1.13.1'

To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download

How to clean HTML String using jsoup library

To clean a HTML String we use Jsoup.clean() static method with a given HTML String and Whitelist object which defines a list of white listed tags.

String outputHtml = Jsoup.clean(htmlContent, Whitelist.basic());

Whitelist class

To support cleaning HTML String jsoup provides the whitelist approach which we can choose a set of HTML tags to allow in HTML String and remove all other tags.

There are some re-defined whitelists in jsoup library.

  • Whitelist.none() allows only text and removes all HTML tags.
  • Whitelist.simpleText() allows b, em, i, strong, u tag and removes other tags.
  • Whitelist.basic() allow a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul and removes other tags. Whitelist.basicWithImages() allow a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul and img tag with appropriate attributes and the src attribute point to http or https.
  • Whitelist.relaxed() allow all of text and structural tag a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul

Example Jsoup.clean() with Whitelist.none()

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class CleanHtmlWhitelistNoneExample {
    public static void main(String... args) {
        String htmlContent = "<div><p>Simple <b>Solution</b></p></div>";

        String outputHtml = Jsoup.clean(htmlContent, Whitelist.none());

        System.out.println("Input String: " + htmlContent);
        System.out.println("Output String: " + outputHtml);
    }
}
Output:
Input String: <div><p>Simple <b>Solution</b></p></div>
Output String: Simple Solution

Example Jsoup.clean() with Whitelist.simpleText()

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class CleanHtmlWhitelistSimpleTextExample {
    public static void main(String... args) {
        String htmlContent = "<div><p>Simple <b>Solution</b></p></div>";

        String outputHtml = Jsoup.clean(htmlContent, Whitelist.simpleText());

        System.out.println("Input String: " + htmlContent);
        System.out.println("Output String: " + outputHtml);
    }
}
Output:
Input String: <div><p>Simple <b>Solution</b></p></div>
Output String: Simple <b>Solution</b>

Example Jsoup.clean() with Whitelist.basic()

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class CleanHtmlWhitelistBasicExample {
    public static void main(String... args) {
        String htmlContent = "<div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>";

        String outputHtml = Jsoup.clean(htmlContent, Whitelist.basic());

        System.out.println("Input String: " + htmlContent);
        System.out.println("Output String: " + outputHtml);
    }
}
Output:
Input String: <div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>
Output String: <p>Simple <b>Solution</b></p>

Example Jsoup.clean() with Whitelist.basicWithImages()

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class CleanHtmlWhitelistBasicWithImagesExample {
    public static void main(String... args) {
        String htmlContent = "<div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>";

        String outputHtml = Jsoup.clean(htmlContent, Whitelist.basicWithImages());

        System.out.println("Input String: " + htmlContent);
        System.out.println("Output String: " + outputHtml);
    }
}
Output:
Input String: <div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>
Output String: <p>Simple <b>Solution</b></p>
<img src="https://simplesolution.dev/images/Logo_S_v1.png">

Example Jsoup.clean() with Whitelist.relaxed()

import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;

public class CleanHtmlWhitelistRelaxedExample {
    public static void main(String... args) {
        String htmlContent = "<div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>";

        String outputHtml = Jsoup.clean(htmlContent, Whitelist.relaxed());

        System.out.println("Input String: " + htmlContent);
        System.out.println("Output String: " + outputHtml);
    }
}
Output:
Input String: <div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>
Output String: <div>
 <p>Simple <b>Solution</b></p>
 <img src="https://simplesolution.dev/images/Logo_S_v1.png">
</div>

Happy Coding 😊

jsoup Get HTML elements by CSS class name in Java

jsoup Get HTML Element by ID in Java

jsoup Get HTML Elements by Tag Name in Java

jsoup Get HTML Elements by Attribute Name in Java

jsoup Get HTML Elements by Attribute Value in Java

jsoup Get All HTML Elements in Java