Getting started with Jsoup Web crawling with Jsoup Selectors Logging into websites with Jsoup Parsing Javascript Generated Pages Formatting HTML Output

Selectors

Remarks:

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

Pattern	Matches	Example
`*`	any element	`*`
`tag`	elements with the given tag name	`div`
`ns\|E`	elements of type E in the namespace ns	`fb\|name finds <fb:name> elements`
`#id`	elements with attribute ID of "id"	`div#wrap, #logo`
`.class`	elements with a class name of "class"	`div.left, .result`
`[attr]`	elements with an attribute named "attr" (with any value)	`a[href], [title]`
`[^attrPrefix]`	elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets	`[^data-], div[^data-]`
`[attr=val]`	elements with an attribute named "attr", and value equal to "val"	`img[width=500], a[rel=nofollow]`
`[attr="val"]`	elements with an attribute named "attr", and value equal to "val"	`span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"]`
`[attr^=valPrefix]`	elements with an attribute named "attr", and value starting with "valPrefix"	`a[href^=http:]`
`[attr$=valSuffix]`	elements with an attribute named "attr", and value ending with "valSuffix"	`img[src$=.png]`
`[attr*=valContaining]`	elements with an attribute named "attr", and value containing "valContaining"	`a[href*=/search/]`
`[attr~=regex]`	elements with an attribute named "attr", and value matching the regular expression	`img[src~=(?i)\.(png\|jpe?g)]`
	The above may be combined in any order	`div.header[title]`

Selector full reference

Selecting elements using CSS selectors

String html = "<!DOCTYPE html>" +
              "<html>" +
                "<head>" +
                  "<title>Hello world!</title>" +
                "</head>" +
                "<body>" +
                  "<h1>Hello there!</h1>" +
                  "<p>First paragraph</p>" +
                  "<p class=\"not-first\">Second paragraph</p>" +
                  "<p class=\"not-first third\">Third <a href=\"page.html\">paragraph</a></p>" +
                "</body>" +
              "</html>";

// Parse the document
Document doc = Jsoup.parse(html);

// Get document title
String title = doc.select("head > title").first().text();
System.out.println(title); // Hello world!

Element firstParagraph = doc.select("p").first();

// Get all paragraphs except from the first
Elements otherParagraphs = doc.select("p.not-first");
// Same as
otherParagraphs = doc.select("p");
otherParagraphs.remove(0);

// Get the third paragraph (second in the list otherParagraphs which
// excludes the first paragraph)
Element thirdParagraph = otherParagraphs.get(1);
// Alternative:
thirdParagraph = doc.select("p.third");

// You can also select within elements, e.g. anchors with a href attribute
// within the third paragraph.
Element link = thirdParagraph.select("a[href]");
// or the first <h1> element in the document body
Element headline = doc.select("body").first().select("h1").first();

You can find a detailed overview of supported selectors here.

Extract Twitter Markup

    // Twitter markup documentation: 
    // https://dev.twitter.com/cards/markup
    String[] twitterTags = {
            "twitter:site", 
            "twitter:site:id", 
            "twitter:creator", 
            "twitter:creator:id", 
            "twitter:description", 
            "twitter:title", 
            "twitter:image", 
            "twitter:image:alt", 
            "twitter:player", 
            "twitter:player:width", 
            "twitter:player:height", 
            "twitter:player:stream", 
            "twitter:app:name:iphone", 
            "twitter:app:id:iphone", 
            "twitter:app:url:iphone", 
            "twitter:app:name:ipad", 
            "twitter:app:id:ipad", 
            "twitter:app:url:ipadt",
            "twitter:app:name:googleplay", 
            "twitter:app:id:googleplay", 
            "twitter:app:url:googleplay"        
    };
    
    // Connect to URL and extract source code
    Document doc = Jsoup.connect("http://stackoverflow.com/").get();
    
    for (String twitterTag : twitterTags) {
        
        // find a matching meta tag
        Element meta = doc.select("meta[name=" + twitterTag + "]").first();
        
        // if found, get the value of the content attribute
        String content = meta != null ? meta.attr("content") : "";
        
        // display results
        System.out.printf("%s = %s%n", twitterTag, content);
    }

Output

twitter:site = 
twitter:site:id = 
twitter:creator = 
twitter:creator:id = 
twitter:description = Q&A for professional and enthusiast programmers
twitter:title = Stack Overflow
twitter:image = 
twitter:image:alt = 
twitter:player = 
twitter:player:width = 
twitter:player:height = 
twitter:player:stream = 
twitter:app:name:iphone = 
twitter:app:id:iphone = 
twitter:app:url:iphone = 
twitter:app:name:ipad = 
twitter:app:id:ipad = 
twitter:app:url:ipadt = 
twitter:app:name:googleplay = 
twitter:app:id:googleplay = 
twitter:app:url:googleplay =

Contributors

Topic Id: 515

Example Ids: 1689,11336

This site is not affiliated with any of the contributors.