Clean HTML String to get Safe HTML from Untrusted HTML in Java using jsoup
Tags: jsoup HTML Parser
Introduction
The jsoup library provides static methods Jsoup.clean() to allow cleaning the HTML String from untrusted input. This feature can be used to sanitizer input of your web application in order to prevent XSS attacks. In this tutorial, we show you how to use this feature to get the safe HTML String from an untrusted HTML input.
Add jsoup library to your Java project
To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.
compile 'org.jsoup:jsoup:1.13.1'
To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download
How to clean HTML String using jsoup library
To clean a HTML String we use Jsoup.clean() static method with a given HTML String and Whitelist object which defines a list of white listed tags.
String outputHtml = Jsoup.clean(htmlContent, Whitelist.basic());
Whitelist class
To support cleaning HTML String jsoup provides the whitelist approach which we can choose a set of HTML tags to allow in HTML String and remove all other tags.
There are some re-defined whitelists in jsoup library.
- Whitelist.none() allows only text and removes all HTML tags.
- Whitelist.simpleText() allows b, em, i, strong, u tag and removes other tags.
- Whitelist.basic() allow a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul and removes other tags. Whitelist.basicWithImages() allow a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul and img tag with appropriate attributes and the src attribute point to http or https.
- Whitelist.relaxed() allow all of text and structural tag a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
Example Jsoup.clean() with Whitelist.none()
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class CleanHtmlWhitelistNoneExample {
public static void main(String... args) {
String htmlContent = "<div><p>Simple <b>Solution</b></p></div>";
String outputHtml = Jsoup.clean(htmlContent, Whitelist.none());
System.out.println("Input String: " + htmlContent);
System.out.println("Output String: " + outputHtml);
}
}
Input String: <div><p>Simple <b>Solution</b></p></div>
Output String: Simple Solution
Example Jsoup.clean() with Whitelist.simpleText()
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class CleanHtmlWhitelistSimpleTextExample {
public static void main(String... args) {
String htmlContent = "<div><p>Simple <b>Solution</b></p></div>";
String outputHtml = Jsoup.clean(htmlContent, Whitelist.simpleText());
System.out.println("Input String: " + htmlContent);
System.out.println("Output String: " + outputHtml);
}
}
Input String: <div><p>Simple <b>Solution</b></p></div>
Output String: Simple <b>Solution</b>
Example Jsoup.clean() with Whitelist.basic()
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class CleanHtmlWhitelistBasicExample {
public static void main(String... args) {
String htmlContent = "<div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>";
String outputHtml = Jsoup.clean(htmlContent, Whitelist.basic());
System.out.println("Input String: " + htmlContent);
System.out.println("Output String: " + outputHtml);
}
}
Input String: <div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>
Output String: <p>Simple <b>Solution</b></p>
Example Jsoup.clean() with Whitelist.basicWithImages()
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class CleanHtmlWhitelistBasicWithImagesExample {
public static void main(String... args) {
String htmlContent = "<div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>";
String outputHtml = Jsoup.clean(htmlContent, Whitelist.basicWithImages());
System.out.println("Input String: " + htmlContent);
System.out.println("Output String: " + outputHtml);
}
}
Input String: <div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>
Output String: <p>Simple <b>Solution</b></p>
<img src="https://simplesolution.dev/images/Logo_S_v1.png">
Example Jsoup.clean() with Whitelist.relaxed()
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class CleanHtmlWhitelistRelaxedExample {
public static void main(String... args) {
String htmlContent = "<div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>";
String outputHtml = Jsoup.clean(htmlContent, Whitelist.relaxed());
System.out.println("Input String: " + htmlContent);
System.out.println("Output String: " + outputHtml);
}
}
Input String: <div><p>Simple <b>Solution</b></p><img src='https://simplesolution.dev/images/Logo_S_v1.png'/></div>
Output String: <div>
<p>Simple <b>Solution</b></p>
<img src="https://simplesolution.dev/images/Logo_S_v1.png">
</div>
Happy Coding 😊
Related Articles
jsoup Get HTML elements by CSS class name in Java
jsoup Get HTML Element by ID in Java
jsoup Get HTML Elements by Tag Name in Java
jsoup Get HTML Elements by Attribute Name in Java