jsoup parse HTML Document from a File and InputStream in Java

Tags: Java jsoup HTML Parser

Introduction

In this tutorial we will explore how to use the jsoup library in Java program to parse HTML from a local file or input stream into a jsoup Document object.

What is jsoup?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

For more information about the library you can visit jsoup homepage at jsoup.org

Add jsoup library to your project

To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.

compile 'org.jsoup:jsoup:1.13.1'

To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download

Sample HTML File

For example, we have a sample.html file on the local machine with the content as follows.

<!DOCTYPE html>
<html>
    <head>
        <title>Simple Solution</title>
    </head>
    <body>
        <p id='content'>jsoup Tutorial</p>
        <a href="/java">Java Tutorials</a>
    </body>
</html>

Parse HTML File into jsoup Document

Jsoup provides Jsoup.parse() static method to parse a HTML file into a jsoup Document object.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.File;
import java.io.IOException;

public class JsoupParseFileExample {
    public static void main(String... args) {
        try {
            String fileName = "sample.html";
            File file = new File(fileName);

            Document document = Jsoup.parse(file, "UTF-8");

            Element contentElement = document.getElementById("content");

            System.out.println("Document Title: " + document.title());
            System.out.println("Content Text: " + contentElement.text());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
Output:
Document Title: Simple Solution
Content Text: jsoup Tutorial

Parse HTML Document from an InputStream

jsoup also allows parsing HTML Document from an InputStream.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class JsoupParseInputStreamExample {
    public static void main(String... args) {
        try {
            String fileName = "sample.html";
            Path filePath = Paths.get(fileName);
            InputStream inputStream = Files.newInputStream(filePath);

            Document document = Jsoup.parse(inputStream, "UTF-8", "https://simplesolution.dev");

            Element linkElement = document.select("a").first();

            System.out.println("Text: " + linkElement.text());
            System.out.println("URL: " + linkElement.attr("href"));
            System.out.println("Absolute URL: " + linkElement.attr("abs:href"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
Output:
Text: Java Tutorials
URL: /java
Absolute URL: https://simplesolution.dev/java

Happy Coding 😊

jsoup parse HTML Document from a Java String

jsoup parse HTML Document from an URL in Java

jsoup extract CSS class name of HTML element in Java

jsoup extract ID and name of HTML element in Java

jsoup extract text and attributes of HTML element in Java

jsoup extract Inner and Outer HTML of HTML Element in Java

jsoup extract JavaScript from HTML script element in Java

jsoup extract custom data attributes of HTML5 Element in Java

jsoup extract Website Title in Java

Pretty Printing HTML String in Java using jsoup

Extract All Links of a web page in Java using jsoup

jsoup Get HTML elements by CSS class name in Java