Web Scraping with JSOUP in JAVA
Web Scraping with JSOUP API in JAVA
jsoup: Java HTML Parser
    jsoup is a Java based library to work with HTML based content. It 
provides a very convenient API to extract and manipulate data, using the
 best of DOM, CSS, and jquery-like methods. It implements the WHATWG 
HTML5 specification, and parses HTML to the same DOM as modern browsers 
do.- Multiple Read Support - It reads and parses HTML using URL, file, or string.
 
- CSS Selectors - It can find and extract data, using DOM traversal or CSS selectors.
 
- DOM Manipulation - It can manipulate the HTML elements, attributes, and text.
 
- Prevent XSS attacks-It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.
 
- Tidy- It outputs tidy HTML.
 
- Handles invalid data - jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.
here is example of sample code: 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
    public static void main(String[] args) throws Exception {
        String url = "https://niravpatelonjava.blogspot.in";
        Document document = Jsoup.connect(url).get();
        String text = document.select("div").first().text();
        System.out.println(text);
        Elements links = document.select("a");
        for (Element link : links) {
            System.out.println(link.attr("href"));
        }
    }   referenced by : https://jsoup.org

 
 
 
Comments
Post a Comment