Web Scraping with JSOUP in JAVA

Web Scraping with JSOUP API in JAVA


jsoup: Java HTML Parser

    jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  1. Multiple Read Support - It reads and parses HTML using URL, file, or string.
  2. CSS Selectors - It can find and extract data, using DOM traversal or CSS selectors.
  3. DOM Manipulation - It can manipulate the HTML elements, attributes, and text.
  4. Prevent XSS attacks-It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.
  5. Tidy- It outputs tidy HTML.
  6. Handles invalid data - jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.


here is example of sample code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://niravpatelonjava.blogspot.in";
        Document document = Jsoup.connect(url).get();

        String text = document.select("div").first().text();
        System.out.println(text);

        Elements links = document.select("a");
        for (Element link : links) {
            System.out.println(link.attr("href"));
        }
    }
 
 
 
referenced by : https://jsoup.org

Comments

Popular posts from this blog

Java LinkedList class in Collection

Java ArrayList class in collections

Constructors in Java