Jsoup

Topics related to Jsoup:

Getting started with Jsoup

Jsoup is a HTML parsing and data extraction library for Java, focused on flexibility and ease of use. It can be used to extract sepecific data from HTML pages, which is commonly known as "web scraping", as well as modify the content of HTML pages, and "clean" untrusted HTML with a whitelist of allowed tags and attributes.

JavaScript support

Jsoup does not support JavaScript, and, because of this, any dynamically generated content or content which is added to the page after page load cannot be extracted from the page. If you need to extract content which is added to the page with JavaScript, there are a few alternative options:

  • Use a library which does support JavaScript, such as Selenium, which uses an an actual web browser to load pages, or HtmlUnit.

  • Reverse engineer how the page loads it's data. Typically, web pages which load data dynamically do so via AJAX, and thus, you can look at the network tab of your browser's developer tools to see where the data is being loaded from, and then use those URLs in your own code. See how to scrape AJAX pages for more details.

Official website & documentation

You can find various Jsoup related resources at jsoup.org, including the Javadoc, usage examples in the Jsoup cookbook and JAR downloads. See the GitHub repository for the source code, issues, and pull requests.

Download

Jsoup is available on Maven as org.jsoup.jsoup:jsoup, If you're using Gradle (eg. with Android Studio), you can add it to your project by adding the following to your build.gradle dependencies section:

compile 'org.jsoup:jsoup:1.8.3'

If you're using Ant (Eclipse), add the following to your POMs dependencies section:

<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.8.3</version>
</dependency>

Jsoup is also available as downloadable JAR for other environments.

Web crawling with Jsoup

Selectors

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

PatternMatchesExample
*any element*
tagelements with the given tag namediv
ns|Eelements of type E in the namespace nsfb|name finds <fb:name> elements
#idelements with attribute ID of "id"div#wrap, #logo
.classelements with a class name of "class"div.left, .result
[attr]elements with an attribute named "attr" (with any value)a[href], [title]
[^attrPrefix]elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets[^data-], div[^data-]
[attr=val]elements with an attribute named "attr", and value equal to "val"img[width=500], a[rel=nofollow]
[attr="val"]elements with an attribute named "attr", and value equal to "val"span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"]
[attr^=valPrefix]elements with an attribute named "attr", and value starting with "valPrefix"a[href^=http:]
[attr$=valSuffix]elements with an attribute named "attr", and value ending with "valSuffix"img[src$=.png]
[attr*=valContaining]elements with an attribute named "attr", and value containing "valContaining"a[href*=/search/]
[attr~=regex]elements with an attribute named "attr", and value matching the regular expressionimg[src~=(?i)\.(png|jpe?g)]
The above may be combined in any orderdiv.header[title]

Selector full reference

Logging into websites with Jsoup

Parsing Javascript Generated Pages

Formatting HTML Output