Text Mining

  1. Handling and Processing Strings
    1. Reading in Text Data
    2. read text data in table format

      read.table(): main function to read file in table format read.csv(): reads csv files separated by a comma“,” read.csv2(): reads csv files separated by a semicolon“;” read.delim(): reads files separated by tabs“” read.delim2(): similar to read.delim() read.fwf(): read fixed width format files

      stringAsFactors=F

      Raw text data readLines()

    3. String Manipulation
    4. grep gsub gregexpr paste paste0 substr str_split

      basic functions nchar():number of characters tolower():convert to lower case toupper():convert to upper case casefold():case folding chartr():character translation abbreviate():abbreviation substring():substrings of a character vector substr():substrings of a character vec

      print():generic printing. argument quote=F noquote():print with no quotes cat():concatenation format():special formats toString():convert to string sprintf():printing

      pkg stringr nchar on factor fails. use str_length str_trim

      regex Metacharacters

      encoding

      R has five main types of objects to store data: vector, factor, matrix (and array), data.frame, and list. character string