Web Scraping with JSOUP in JAVA
Web Scraping with JSOUP API in JAVA
jsoup: Java HTML Parser
jsoup is a Java based library to work with HTML based content. It
provides a very convenient API to extract and manipulate data, using the
best of DOM, CSS, and jquery-like methods. It implements the WHATWG
HTML5 specification, and parses HTML to the same DOM as modern browsers
do.- Multiple Read Support - It reads and parses HTML using URL, file, or string.
- CSS Selectors - It can find and extract data, using DOM traversal or CSS selectors.
- DOM Manipulation - It can manipulate the HTML elements, attributes, and text.
- Prevent XSS attacks-It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.
- Tidy- It outputs tidy HTML.
- Handles invalid data - jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.
here is example of sample code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws Exception {
String url = "https://niravpatelonjava.blogspot.in";
Document document = Jsoup.connect(url).get();
String text = document.select("div").first().text();
System.out.println(text);
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.attr("href"));
}
}
referenced by : https://jsoup.org
Comments
Post a Comment