HTML Scraping and Parsing

    Web scraping and parsing was(and will be) an important part of my research interest. There are many useful packages in R that makes web scraping simple: rvest, XML, RCurl, curl, RSelenium, and etc.

  1. Using the rvest package
    1. install.packages("rvest") #if you haven't installed `rvest` yet
      require(rvest) #load the `rvest` package

    2. read_html
    3. read_html(url)

      Argument url: A url string

      read_html("https://jeffsong9.github.io/")
      {xml_document}
      <html lang="en">
      [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
      [2] <body>\n\n    <div class="container">\n      <!-- Static navbar -->\ ...

    4. html_nodes
      1. Single Node Set
        • Usage: Parse HTML documents using XPath or CSS selector. html_nodes(html_doc, css, xpath)

        • Argument html_doc: A HTML document or a node set css: css selector xpath: xpath selector

        • e.g.: read_html(some_url) %>%
          html_nodes(xpath=“//b”) Extracts all bold font text using the xpath selector //b.

        read_html(some_url) %>%
        html_nodes(xpath=’//div [@id="*something*"])

        read_html(some_url) %>%
        html_nodes(“center”) Extract using css selector center

        • chain subsetting read_html(some_url) %>%
          html_nodes(“center”) %>%
          html_nodes(“font”)
        read_html("https://jeffsong9.github.io/") %>%
          html_nodes(xpath="//div") %>%
          head(5)
        {xml_nodeset (5)}
        [1] <div class="container">\n      <!-- Static navbar -->\n      <nav cl ...
        [2] <div class="container-fluid">\n          <div class="navbar-header"> ...
        [3] <div class="navbar-header">\n            <button type="button" class ...
        [4] <div id="navbar" class="navbar-collapse collapse">\n            <ul  ...
        [5] <div id="page-wrap">\n        <div id="wrap-inner">\n          <div  ...

        cf) html_node

      2. Multiple Node Sets
      3. One may be interested in extracting multiple node sets. I find it convenient to save all XPath queries in a vector and use the lapply function in conjunction with the html_nodes function.

        paths=c(profile='//div [@class="profile-info-mod profile-essentials"]',
                side_bar='//div [@class="floating-sidebar-float"]')
        
        lapply(paths, function(x) read_html("https://jeffsong9.github.io/") %>%
          html_nodes(xpath=x))
        $profile
        {xml_nodeset (1)}
        [1] <div class="profile-info-mod profile-essentials">\n                  ...
        
        $side_bar
        {xml_nodeset (1)}
        [1] <div class="floating-sidebar-float">\n                  <div class=" ...

    5. html_attr
    6. Usage: Extrac attributes with a given name html_attr(html_doc, name)

      Arguments: html node: AHTML document or a node set name: Name of attribute to extract

      e.g. read_html(some_url) %>%
      html_nodes(xpath=“//a”) %>%
      html_attr(name=“href”)

      read_html("https://jeffsong9.github.io/") %>%
        html_nodes(xpath="//img [@class='photo']")
      {xml_nodeset (1)}
      [1] <img class="photo" src="/images/songt.jpg" alt="TaikgunSong" style=" ...
      read_html("https://jeffsong9.github.io/") %>%
        html_nodes(xpath="//img [@class='photo']") %>%
        html_attr(name="src")
      [1] "/images/songt.jpg"

      cf) html_attrs