Jsoup is a HTML parsing and data extraction library for Java, focused on flexibility and ease of use. It can be used to extract sepecific data from HTML pages, which is commonly known as "web scraping", as well as modify the content of HTML pages, and "clean" untrusted HTML with a whitelist of allowed tags and attributes.
Jsoup does not support JavaScript, and, because of this, any dynamically generated content or content which is added to the page after page load cannot be extracted from the page. If you need to extract content which is added to the page with JavaScript, there are a few alternative options:
Use a library which does support JavaScript, such as Selenium, which uses an an actual web browser to load pages, or HtmlUnit.
Reverse engineer how the page loads it's data. Typically, web pages which load data dynamically do so via AJAX, and thus, you can look at the network tab of your browser's developer tools to see where the data is being loaded from, and then use those URLs in your own code. See how to scrape AJAX pages for more details.
You can find various Jsoup related resources at jsoup.org, including the Javadoc, usage examples in the Jsoup cookbook and JAR downloads. See the GitHub repository for the source code, issues, and pull requests.
Jsoup is available on Maven as org.jsoup.jsoup:jsoup
, If you're using Gradle (eg. with Android Studio), you can add it to your project by adding the following to your build.gradle
dependencies section:
compile 'org.jsoup:jsoup:1.8.3'
If you're using Ant (Eclipse), add the following to your POMs dependencies section:
<dependency>
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
Jsoup is also available as downloadable JAR for other environments.
A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).
The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).
Pattern | Matches | Example |
---|---|---|
* | any element | * |
tag | elements with the given tag name | div |
ns|E | elements of type E in the namespace ns | fb|name finds <fb:name> elements |
#id | elements with attribute ID of "id" | div#wrap, #logo |
.class | elements with a class name of "class" | div.left, .result |
[attr] | elements with an attribute named "attr" (with any value) | a[href], [title] |
[^attrPrefix] | elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets | [^data-], div[^data-] |
[attr=val] | elements with an attribute named "attr", and value equal to "val" | img[width=500], a[rel=nofollow] |
[attr="val"] | elements with an attribute named "attr", and value equal to "val" | span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"] |
[attr^=valPrefix] | elements with an attribute named "attr", and value starting with "valPrefix" | a[href^=http:] |
[attr$=valSuffix] | elements with an attribute named "attr", and value ending with "valSuffix" | img[src$=.png] |
[attr*=valContaining] | elements with an attribute named "attr", and value containing "valContaining" | a[href*=/search/] |
[attr~=regex] | elements with an attribute named "attr", and value matching the regular expression | img[src~=(?i)\.(png|jpe?g)] |
The above may be combined in any order | div.header[title] |