In Class Ex 5 - Text Analytics

Author

Cheng Chun Chieh

Published

May 11, 2024

Modified

June 19, 2024

1. Overview and Getting Started

pacman::p_load(tidytext, tidyverse, readtext, quanteda, ggwordcloud)

Can refer to the following links for info about the packages:

quanteda - https://quanteda.io/articles/quickstart.html
readtext - https://readtext.quanteda.io/articles/readtext_vignette.html

1.1 Importing Text Data using readtext

articles <- "data/articles/*"

text_data <- readtext(articles)

1.2 Corpus

1.2.1 Working with a quanteda corpus

Corpus principles

A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.

A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analysing a reading ease index – can be performed on the same corpus.

A corpus is a special form of character vector, meaning most functions that work with a character input will also work on a corpus. But a corpus object (as do other quanteda core objects) has its own convenient print method.

corpus_text <- corpus(text_data)
summary(corpus_text,5)

Corpus consisting of 338 documents, showing 5 documents:

                                   Text Types Tokens Sentences
 Alvarez PLC__0__0__Haacklee Herald.txt   206    433        18
    Alvarez PLC__0__0__Lomark Daily.txt   102    170        12
   Alvarez PLC__0__0__The News Buoy.txt    90    200         9
 Alvarez PLC__0__1__Haacklee Herald.txt    96    187         8
    Alvarez PLC__0__1__Lomark Daily.txt   241    504        21

1.3 Cleaning Text

usenet_words <- text_data %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)

Doing a word count:

usenet_words %>%
  count(word, sort = TRUE)

readtext object consisting of 3260 documents and 0 docvars.
# A data frame: 3,260 × 3
  word             n text     
  <chr>        <int> <chr>    
1 fishing       2177 "\"\"..."
2 sustainable   1525 "\"\"..."
3 company       1036 "\"\"..."
4 practices      838 "\"\"..."
5 industry       715 "\"\"..."
6 transactions   696 "\"\"..."
# ℹ 3,254 more rows

words_by_doc_id <- usenet_words %>%
  count(doc_id, word, sort = TRUE) %>%
  ungroup()

1.4 Splitting up the doc_id

text_data_split <- text_data %>%
  mutate(Company = str_extract(doc_id, "^[^_]+"),
         News_Agencies = str_extract(doc_id, "(?<=__)[^_]+(?=\\.txt)"))

(?<=__) is a positive lookbehind assertion that ensures the match occurs after “__”.
[^_]+ matches one or more characters that are not underscores, representing the news agency.
(?=\\.txt) is a positive lookahead assertion that ensures the match occurs before “.txt”.

text_data_splitted <- text_data %>%
  separate_wider_delim("doc_id",
                       delim="__0__",
                       names = c("X","Y"),
                       too_few = "align_end"
  )

usenet_words1 <- text_data_split %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)

words_by_news_agencies <- usenet_words1 %>%
  count(News_Agencies, word, sort = TRUE) %>%
  ungroup()

2. Importing json files

pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce, skimr, tidytext, tidyverse, tidyr)

mc1_data <- fromJSON ("data/mc1.json")

mc2_data <- fromJSON ("data/mc2.json")

mc1_data_node <- mc1_data[["nodes"]]

mc1_data_links <- mc1_data[["links"]]

2.2 Importing mc3 data

# Specify the file paths
input_file <- "data/mc3.json"
output_file <- "data/mc3_processed.json"

# Read the JSON file
json_data <- readLines(input_file)

# Replace NaN with null
json_data <- gsub("NaN", "null", json_data)

# Write the updated JSON data back to a file
writeLines(json_data, con = output_file)

# Now you can read the processed JSON file into R
mc3_data <- fromJSON(output_file)