Convert HTML into Plain Text in Java using jsoup
Tags: jsoup HTML Parser Convert
Introduction
In this tutorial, we are going to show how to use jsoup library to convert HTML content into plain text without HTML tag in a Java application.
Add jsoup library to your Java project
To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.
compile 'org.jsoup:jsoup:1.13.1'
To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download
Convert HTML String into Plain Text
The Java application below, we use Jsoup.clean() method to remove HTML tags in a HTML content to return plain text content.
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class ConvertHtmlToText {
public static void main(String... args) {
String htmlString = "<div><h1>Simple Solution</h1><p>Convert HTML to Text</p></div>";
String outputText = Jsoup.clean(htmlString, new Whitelist());
System.out.println(outputText);
}
}
Simple SolutionConvert HTML to Text
Convert HTML from Website into Plain Text
In the following example Java program, we combine Jsoup.clean() with Jsoup.connect() method provided by jsoup library to download HTML content from URL and then remove HTML tags.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Whitelist;
import java.io.IOException;
public class ConvertHtmlToTextFromUrl {
public static void main(String... args) {
try {
String url = "https://simplesolution.dev/";
Document document = Jsoup.connect(url).get();
String htmlString = document.html();
String outputText = Jsoup.clean(htmlString, new Whitelist());
System.out.println(outputText);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Simple Solution ...
Convert HTML File into Plain Text
Following examples to show how to read HTML content from a file and remove HTML tags. For example, we have a sample.html file with the following content.
<!DOCTYPE html>
<html>
<body>
<span class="test">Simple Solution</span>
</body>
</html>
Example 1 read file content NIO classes .
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class ConvertHtmlToTextFromFile1 {
public static void main(String... args) {
try {
String fileName = "sample.html";
Path filePath = Paths.get(fileName);
byte[] fileBytes = Files.readAllBytes(filePath);
String htmlString = new String(fileBytes, "UTF-8");
String outputText = Jsoup.clean(htmlString, new Whitelist());
System.out.println(outputText);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Simple Solution
Example 2 read HTML file using Jsoup.parse() method.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Whitelist;
import java.io.File;
import java.io.IOException;
public class ConvertHtmlToTextFromFile2 {
public static void main(String... args) {
try {
String fileName = "sample.html";
File file = new File(fileName);
Document document = Jsoup.parse(file, "UTF-8");
String htmlString = document.html();
String outputText = Jsoup.clean(htmlString, new Whitelist());
System.out.println(outputText);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Simple Solution
Happy Coding 😊
Related Articles
jsoup parse HTML Document from a File and InputStream in Java