Web scraping and parsing was(and will be) an important part of my research interest. There are many useful packages in R
that makes web scraping simple: rvest
, XML
, RCurl
, curl
, RSelenium
, and etc.
rvest
package
install.packages("rvest") #if you haven't installed `rvest` yet
require(rvest) #load the `rvest` package
read_html
read_html(url)
Argument url: A url string
read_html("https://jeffsong9.github.io/")
{xml_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body>\n\n <div class="container">\n <!-- Static navbar -->\ ...
html_nodes
Usage: Parse HTML
documents using XPath or CSS selector. html_nodes(html_doc, css, xpath)
Argument html_doc: A HTML
document or a node set css: css selector xpath: xpath selector
e.g.: read_html(some_url) %>%
html_nodes(xpath=“//b”) Extracts all bold font text using the xpath selector //b.
read_html(some_url) %>%
html_nodes(xpath=’//div [@id="*something*"])
read_html(some_url) %>%
html_nodes(“center”) Extract using css selector center
read_html("https://jeffsong9.github.io/") %>%
html_nodes(xpath="//div") %>%
head(5)
{xml_nodeset (5)}
[1] <div class="container">\n <!-- Static navbar -->\n <nav cl ...
[2] <div class="container-fluid">\n <div class="navbar-header"> ...
[3] <div class="navbar-header">\n <button type="button" class ...
[4] <div id="navbar" class="navbar-collapse collapse">\n <ul ...
[5] <div id="page-wrap">\n <div id="wrap-inner">\n <div ...
cf) html_node
One may be interested in extracting multiple node sets. I find it convenient to save all XPath queries in a vector and use the lapply
function in conjunction with the html_nodes
function.
paths=c(profile='//div [@class="profile-info-mod profile-essentials"]',
side_bar='//div [@class="floating-sidebar-float"]')
lapply(paths, function(x) read_html("https://jeffsong9.github.io/") %>%
html_nodes(xpath=x))
$profile
{xml_nodeset (1)}
[1] <div class="profile-info-mod profile-essentials">\n ...
$side_bar
{xml_nodeset (1)}
[1] <div class="floating-sidebar-float">\n <div class=" ...
html_attr
Usage: Extrac attributes with a given name html_attr(html_doc, name)
Arguments: html node: AHTML
document or a node set name: Name of attribute to extract
e.g. read_html(some_url) %>%
html_nodes(xpath=“//a”) %>%
html_attr(name=“href”)
read_html("https://jeffsong9.github.io/") %>%
html_nodes(xpath="//img [@class='photo']")
{xml_nodeset (1)}
[1] <img class="photo" src="/images/songt.jpg" alt="TaikgunSong" style=" ...
read_html("https://jeffsong9.github.io/") %>%
html_nodes(xpath="//img [@class='photo']") %>%
html_attr(name="src")
[1] "/images/songt.jpg"
cf) html_attrs