### Load standardpackages
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)
library(tidytext)

This session

This session, we will

  1. Review NLP workflows and data structures in R
  2. Explore different type of DTM matrix type vector representations of text.
  3. Add different types of dimensionality reduction techniques to the repertoir.
  4. HAve a peak into word-embeddings
  5. Add some goddies on top

Refresher:

Bag of words model

  • In order for a computer to understand text we need to somehow find a useful representation.
  • If you need to compare different texts e.g. articles, you will probably go for keywords. These keywords may come from a keyword-list with for example 200 different keywords
  • In that case you could represent each document with a (sparse) vector with 1 for “keyword present” and 0 for “keyword absent”
  • We can also get a bit more sophoistocated and count the number of times a word from our dictionary occurs.
  • For a corpus of documents that would give us a document-term matrix.

example

Let’s try creating a bag of words model from our initial example.

text <- tibble(id = c(1:6),
               text = c('A text about cats.',
                        'A text about dogs.',
                        'And another text about a dog.',
                        'Why always writing about cats and dogs, always dogs?',
                        'There are too little text about cats but to many about dogs',
                        'Cats, cats, cats! I love cats soo much. Cats are way better than dogs'))
text_tidy <- text %>% 
  unnest_tokens(word, text, token = 'words') %>% 
  count(id, word)

The document-term matrix (DTM)

  • The simplest form of vector representation of text is a ddocument-term matrix
  • How to we get a document-term matrix now?
  • We could do it by hand, with well-known dplyr syntax (Note: only works when you have one row per unique document-word pair)
text_tidy %>%
  pivot_wider(names_from = word, values_from = n, values_fill = 0)
  • We could also use cast_dtm() to create a DTM in the format of the tm package.
text_dtm <- text_tidy %>%
  cast_dtm(id, word, n)
text_dtm 
<<DocumentTermMatrix (documents: 6, terms: 25)>>
Non-/sparse entries: 42/108
Sparsity           : 72%
Maximal term length: 7
Weighting          : term frequency (tf)
  • We can simply convert ig to a tibble. Since there exists no direct transfer function, we have to first transform it to a matrix.
  • Notice how we recover the rownames
text_dtm %>% as.matrix() %>% as_tibble(rownames = 'id') 
  • Sidenote: We can also tidy the DTM again to a tidy token-dataframe.
text_dtm %>% tidy()
  • We also can directly use a similar function to cast a sparse matrix (which we for sure then also could transform to a tibble again)
text_tidy %>% cast_sparse(row = id, column = word, value = n)
6 x 25 sparse Matrix of class "dgCMatrix"
                                                   
1 1 1 1 1 . . . . . . . . . . . . . . . . . . . . .
2 1 1 . 1 1 . . . . . . . . . . . . . . . . . . . .
3 1 1 . 1 . 1 1 1 . . . . . . . . . . . . . . . . .
4 . 1 1 . 2 1 . . 2 1 1 . . . . . . . . . . . . . .
5 . 2 1 1 1 . . . . . . 1 1 1 1 1 1 1 . . . . . . .
6 . . 5 . 1 . . . . . . 1 . . . . . . 1 1 1 1 1 1 1
  • Finally, we could just apply a text recipe here
library(recipes)
library(textrecipes)

TF-IDF - Term Frequency - Inverse Document Frequency

  • A token is important for a document if appears very often
  • A token becomes less important for comparison across a corpus if it appears all over the place in the corpus
  • Cat in a corpus of websites talking about cats is not that important

\[w_{i,j} = tf_{i,j}*log(\frac{N}{df_i})\]

  • \(w_{i,j}\) = the TF-IDF score for a term i in a document j
  • \(tf_{i,j}\) = number of occurence of term i in document j
  • \(N\) = number of documents in the corpus
  • \(df_i\) = number of documents with term i
# TFIDF weights
text_tidy %<>%
  bind_tf_idf(term = word,
              document = id,
              n = n)
  • We obviously could also cast a tf_idf weighted dtm…
text_tidy %>%
  select(id, word, tf_idf) %>%
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)
  • btw: this is equivalent to just running a textrecipe like that:
text %>%
  recipe(~.) %>% 
  step_tokenize(text, token = 'words') %>% # tokenize
  step_tfidf(text) %>% # TFIDF weighting
  prep() %>% juice()
  • Sidenote, when we use a POS engine such as spacyr for tokenization, we can also add recipes for lematization, filter for POS etc.
text %>%
  recipe(~.) %>% 
  step_tokenize(text, engine = "spacyr") %>%
  step_pos_filter(text, keep_tags = "NOUN") %>%
  step_lemma(text) %>%
  step_tf(text) %>%
  prep() %>%
  juice()
  • A last reminder on the powerful pairwise_xx() functions from the widyr package
  • For instance, pairwise similarities/distances
library(widyr)
text_tidy %>% pairwise_dist(id, word, tf_idf, method = "manhattan") %>%
  mutate(similarity = 1 - (distance / max(distance)) ) %>%
  select(-distance) %>%
  arrange(desc(similarity))

Dimensionality reduction techniques

rm(list=ls())
  • Ok, lets get first some more interesting data. We will work with the CORDIS project descriptions of EU Horizon 2020 projects again.
text <- read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/cordis-h2020reports.gz')
colnames(text) <- colnames(text) %>% str_to_lower()
text %<>%
  select(-x1) %>%
  rename(id = projectid) %>%
  relocate(id) %>%
  filter(language == 'en') %>%
  drop_na(id)
  • Lets create a tidy tokenlist
text_tidy <- text %>%
  rename(text = summary) %>%
  select(id, text) %>%
  unnest_tokens(word, text, token = "words")
  • some preprocessing
# preprocessing
text_tidy %<>%
  filter(str_length(word) > 2 ) %>% # Remove words with less than  3 characters
  filter(!(word %in% c('project', 'research'))) %>%
  anti_join(stop_words, by = 'word') 
  • We can also ad bigrams
text_tidy %<>%
  unnest_tokens(word, word, token = 'ngrams', n = 2, n_min = 1) %>%
  group_by(word) %>% filter(n() > 25) %>% ungroup() 
text_tidy %>%
  count(word, sort = TRUE)
  • Lets finish this up and also add TF-IDF weights
text_tidy %<>%
  count(id, word) %>%
  bind_tf_idf(term = word,
              document = id,
              n = n) %>%
  select(-tf, -idf)
  • Is there a big difference?
text_tidy %>%
  count(word, wt = tf_idf, sort = TRUE)
  • And finally, lets get a DTM dataframe
text_dtm <- text_tidy %>%
  select(id, word, n) %>%
  pivot_wider(names_from = word, values_from = n, values_fill = 0)
  • And, just in case, a TFIDF weighted version
  • We could also prepare a recipe which doe pretty much the same…
recipe_base <- text %>%
  rename(text = summary) %>%
  select(id, text) %>%
  # BAse recipe starts
  recipe(~.) %>% 
  update_role(id, new_role = "id") %>% # Update role of ID
  step_tokenize(text, token = 'words') %>% # tokenize
  step_stopwords(text, keep = FALSE) %>% # remove stopwords
  step_untokenize(text) %>% # Here we now have to first untokenize
  step_tokenize(text, token = "ngrams", options = list(n = 1, n_min = 1)) %>% # and tokenize again
  step_tokenfilter(text, min_times = 25) 
  • Sidenote

  • Here, we can further preprocess to do whatever we would like, such as obtaining a dtm

recipe_base %>% 
  step_tf(text) %>% 
  prep() %>% 
  juice() %>% 
  head(100)
text_pca <- text_dtm %>% 
  column_to_rownames('id') %>% 
  prcomp(center = TRUE, scale. = TRUE, rank. = 10)
text_pca %>% glimpse()
List of 5
 $ sdev    : num [1:499] 3.58 3.11 2.97 2.85 2.71 ...
 $ rotation: num [1:608, 1:10] 0.01761 0.00292 0.07104 -0.03197 0.01753 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:608] "aim" "allowing" "based" "blood" ...
  .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:608] 0.2265 0.0541 0.6733 0.0701 0.1543 ...
  ..- attr(*, "names")= chr [1:608] "aim" "allowing" "based" "blood" ...
 $ scale   : Named num [1:608] 0.537 0.235 1.049 0.445 0.856 ...
  ..- attr(*, "names")= chr [1:608] "aim" "allowing" "based" "blood" ...
 $ x       : num [1:499, 1:10] -3.259 -0.996 -1.711 -1.379 -1.575 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:499] "115844" "633197" "633249" "633261" ...
  .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"
text_pca[['x']] %>%
  head()
              PC1        PC2       PC3        PC4        PC5        PC6        PC7        PC8        PC9
115844 -3.2588756 -0.8478672  1.286494 -0.3304838  0.6253670 -0.3161002  0.1642597 -2.2037321 -0.1871126
633197 -0.9960611  4.4346346 -1.054370 -2.9036039 -1.4704782 -0.9094432 -1.6293613 -1.6208713 -0.2130936
633249 -1.7111795  3.7095798 -2.546628 -2.6489614 -2.1026976 -0.7091236  0.6661537 -0.1671077  0.3804010
633261 -1.3789058  4.1268532 -2.175831 -4.1895254 -0.8737219 -1.0295514 -1.1417048 -1.2886798 -1.7668852
633382 -1.5749243  4.2602715 -3.418563 -3.7036367 -1.1608198 -1.0926355 -1.1411842  0.2951679 -0.2694360
633571  1.2576733  1.6711741 -2.251064 -0.9706029 -1.5562738  0.6804761 -0.2523918 -0.2671309  0.9906243
             PC10
115844  0.8770474
633197  2.4375617
633249  1.2127379
633261  2.6082576
633382  1.6113388
633571 -3.8692253
text_pca %>% tidy()
  • Again, alternatively with a recipe…
recipe_pca <- recipe_base %>% # tokenize
  step_tfidf(text, prefix = '') %>% # TFIDF weighting
  step_pca(all_predictors(), num_comp = 10) %>% # PCA
  prep() 
recipe_pca %>% juice()
  • Some plotting
recipe_pca %>% juice() %>%
  ggplot(aes(x = PC01, y = PC02)) +
  geom_point() 

  • we can also use the tidy results of the recipe to do some more analytics
recipe_pca %>%
  tidy(7) %>%
  filter(component %in% paste0("PC", 1:4)) %>%
  group_by(component) %>%
    arrange(desc(value)) %>%
    slice(c(1:2, (n()-2):n())) %>%
  ungroup() %>%
  mutate(component = fct_inorder(component)) %>%
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~component, nrow = 1) +
  labs(y = NULL)

  • Note: Also check further for further dimensionlity reduction steps:
    • tep_kpca():
    • step_ica()
    • step_isomap()
    • step_nnmf()

Topic Models: Latent-Dirichlet-Allocation (LDA)

  • While we already did it somewhat ‘on-the-fly’, here a more formal introduction to LDA
  • In contrast to dimnesionality reduction techiques mostly aiming at preprocessing data or easing visualization, LDA more aims at EDA and interpretation
  • It is a generative approach to identify topics (clusters) within the word-usage in documents.
    • Topics are represented as a probability distribution over the words in the vocabulary. Hhigh probability words can be used to charactrize the topic.
    • Documents are represented as a mixture of topics.

alt text

library(topicmodels)
text_dtm <- text_tidy %>%
  cast_dtm(document = id, term = word, value = n)
text_lda <- text_dtm %>% 
  LDA(k = 6, method = "Gibbs",
      control = list(seed = 1337))
  • \(\beta\) is an output of the LDA model, indicating the propability that a word occurs in a certain topic.
  • Therefore, loking at the top probability words of a topic often gives us a good intuition regarding its properties.
# LDA output is defined for tidy(), so we can easily extract it
lda_beta <- text_lda %>% 
  tidy(matrix = "beta") 
lda_beta %>%
  # slice
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%
  slice(1:10) %>%
  ungroup() %>%
  # visualize
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = NULL, y = expression(beta)) +
  facet_wrap(~ topic, ncol = 3, scales = "free")

  • Documents are represented as a mix of topics. This association of a document to a topic is captured by \(\gamma\)
lda_gamma <- text_lda %>% 
  tidy(matrix = "gamma")
lda_gamma %>%
  group_by(topic) %>%
    arrange(desc(gamma)) %>% 
    slice(1:10) %>%
  ungroup() %>%
  left_join(text %>% select(id, projectacronym) %>% mutate(id = id %>% as.character()), by = c('document' = 'id'))
  • Note that an LDA can also be performed via a recipe:
recipe_lda <- recipe_base %>% # tokenize
  step_untokenize(text) %>% # Is a bit silly, needs the full text vectors instead of tokens....
  step_lda(text, num_topics = 6) %>% # LDA
  prep() 
recipe_lda %>% juice() %>% 
  head(100)
  • As a bonus, a great way to interactively visualize LDA’s.
  • It’s a bit cumbersome in R, though…
library(LDAvis)
# A bit of a lenghty function....
topicmodels_json_ldavis <- function(fitted, doc_dtm, method = "PCA"){
  require(topicmodels); require(dplyr); require(LDAvis)
  
  # Find required quantities
  phi <- posterior(text_lda)$terms %>% as.matrix() # Topic-term distribution
  theta <- posterior(fitted)$topics %>% as.matrix() # Document-topic matrix
  
  text_tidy <- doc_dtm %>% tidy()
  vocab <- colnames(phi)
  doc_length <- tibble(document = rownames(theta)) %>% left_join(text_tidy %>% count(document, wt = count), by = 'document')
  tf <- tibble(term = vocab) %>% left_join(text_tidy %>% count(term, wt = count), by = "term") 
  
  if(method == "PCA"){mds <- jsPCA}
  if(method == "TSNE"){library(tsne); mds <- function(x){tsne(svd(x)$u)} }
  
  # Convert to json
  json_lda <- LDAvis::createJSON(phi = phi, theta = theta, vocab = vocab, doc.length = doc_length %>% pull(n), term.frequency = tf %>% pull(n),
                                 reorder.topics = FALSE, mds.method = mds,plot.opts = list(xlab = "Dim.1", ylab = "Dim.2")) 
  return(json_lda)
}
library(LDAvis)
json_lda <- topicmodels_json_ldavis(fitted = text_lda, 
                                    doc_dtm = text_dtm, 
                                    method = "TSNE")

# json_lda %>% serVis() # For direct output
# json_lda %>% serVis(out.dir = 'LDAviz') # For saving the html

Didnt really figure out how to embedd the resulting plot, but the outcome can be seen here

Embeddings (Bonus)

  • One last thing we did not venture in yet, are embeddings

  • I will not go into details here, just see it as a peak of what’s to come in further sessions.

  • The idee of word embedding is (in a nutshell) that

  • There are packages on how to train own embeddings such as text2vec, but we will for now not bother with that.

  • The only thing we will do for now is to load pretrained embeddings (GloVe, cf. Pennington et al, 2014)

library(textdata)

glove6b <- embedding_glove6b(dimensions = 100)
glove6b
  • La voila, a large pretrained embedding model for around 400k of the most common words.
  • We for now loaded the smallest of these embedding models, there exist way bigger ones.
  • Lets join it with our tidy tokenlist
word_embeddings <- text_tidy %>%
  inner_join(glove6b, by = c('word' = 'token'))
word_embeddings %>% head()
  • We could now create average document embeddings by taking the mean over all dimensions
  • We could also (even better) weight that by then word’s tfidf score.
  • These embddings could now be used for instance for some clustering or SML exercise
  • I guess you can already see how to use these embeddings in an SML model.
library(uwot) # for UMAP
embeddings_umap <- doc_embeddings  %>% 
  column_to_rownames("id") %>%
  umap(n_neighbors = 15, 
       metric = "cosine", 
       min_dist = 0.01, 
       scale = TRUE,
       verbose = TRUE, 
       n_threads = 8) 
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
embeddings_umap %<>% as.data.frame()
embeddings_umap  %>% 
  ggplot(aes(x = V1, y = V2)) + 
  geom_point(shape = 21, alpha = 0.5) 

  • Ok, we see a rather clear seperation of documents.
  • Just for fun, lets add a density based clustering (very good for spatial clustering) on top (even though we already see the results)
library(dbscan)
  • Do the hirarchical density based clustering
embeddings_hdbscan <- embeddings_umap %>% as.matrix() %>% hdbscan(minPts = 15)
  • Plot it
embeddings_umap %>% 
  bind_cols(cluster = embeddings_hdbscan$cluster %>% as.factor(), 
            prob = embeddings_hdbscan$membership_prob) %>%
  ggplot(aes(x = V1, y = V2, col = cluster)) + 
  geom_point(aes(alpha = prob), shape = 21) 

  • Note: We can also assigne the embeddings via a recipe
  • Unfortunately, we can not do a TFIDF weighting here ‘out-of-the-box’, but have to work with average embeddings instead.
recipe_embedding <- recipe_base %>% # tokenize
  step_word_embeddings(text, embeddings = glove6b, aggregation = 'mean')
recipe_embedding %>% prep() %>% juice() %>% 
  head(100)
  • Same goes for UMAP, which can be accessd in recipes via the the package embed pckage.
  • However,embed is a bit heavy in terms of dependencies, since it uses keras and tensorflow, a deep learning framewok, in the backgroubnd, and is in need to install another mini-conda enviroment.
  • If you have no experience with keras and tensorflow so far, I suggest you wait with this one until later sessions when we properly introduce it.
library(embed)
Error: package or namespace load failed for ‘embed’:
 .onLoad failed in loadNamespace() for 'tensorflow', details:
  call: py_module_import(module, convert = convert)
  error: ModuleNotFoundError: No module named 'tensorflow'
recipe_umap <- recipe_embedding %>%
  step_umap(starts_with('w_embed'), n_neighbors = 15) 
recipe_umap %>% prep() %>% juice() %>% 
  head(100)
  • So, that’s all I have for now

Summary

  • There are many ways to convert text data into a vector representation.
  • These range from simple and weighted bags-of-words, to topic models, over different types of dimensionality reduction to finally word and document embeddings.
  • All of them are useful, depending on the purpose.

Endnotes

Packages & Ecosystem

  • textrecipes: Text preprocessing recipes
  • embed: Extra embedding recipes
  • topicmodels: LDA topicmodelling in R
  • LDAvis: A bit clunky but awesome interactive LDA visualizations
  • text2vec: Package vor vector space modelling (aka embeddings & other vectorizations) of textdata
  • textdata: Useful datasets for text, such as GloVe embeddings, sentiment lexica etc.
  • uwot: UMAP for R

References

CHapters:

  • Julia Silge and David Robinson (2020). Text Mining with R: A Tidy Approach, O’Reilly. Online available here
  • Emil Hvidfeldt and Julia Silge (2020). Supervised Machine Learning for Text Analysis in R, online available here

Articles: * Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3, no. Jan (2003): 993-1022. * Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Conference on Empirical Methods on Natural Language Processing (EMNLP), pages 1532–1543, 2014

Further sources

Session Info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dbscan_1.1-5       uwot_0.1.8         Matrix_1.2-18      textdata_0.4.1     tsne_0.1-3        
 [6] LDAvis_0.3.2       topicmodels_0.2-11 widyr_0.1.3        textrecipes_0.3.0  recipes_0.1.13    
[11] tidytext_0.2.6     magrittr_1.5       forcats_0.5.0      stringr_1.4.0      dplyr_1.0.2       
[16] purrr_0.3.4        readr_1.4.0        tidyr_1.1.2        tibble_3.0.4       ggplot2_3.3.2     
[21] tidyverse_1.3.0    knitr_1.30        

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-17         modeltools_0.2-23    rsconnect_0.8.16    
  [6] rprojroot_1.3-2      base64enc_0.1-3      fs_1.5.0             rstudioapi_0.11      farver_2.0.3        
 [11] textfeatures_0.3.3   SnowballC_0.7.0      RSpectra_0.16-0      prodlim_2019.11.13   fansi_0.4.1         
 [16] lubridate_1.7.9      xml2_1.3.2           codetools_0.2-16     splines_4.0.2        rsparse_0.4.0       
 [21] zeallot_0.1.0        pkgload_1.1.0        mlapi_0.1.0          jsonlite_1.7.1       RhpcBLASctl_0.20-137
 [26] broom_0.7.1          servr_0.19           dbplyr_1.4.4         tfruns_1.4           compiler_4.0.2      
 [31] httr_1.4.2           backports_1.1.10     assertthat_0.2.1     cli_2.1.0            later_1.1.0.1       
 [36] tools_4.0.2          NLP_0.2-0            gtable_0.3.0         glue_1.4.2           reshape2_1.4.4      
 [41] rappdirs_0.3.1       float_0.2-4          Rcpp_1.0.5           slam_0.1-47          cellranger_1.1.0    
 [46] vctrs_0.3.4          RJSONIO_1.3-1.4      timeDate_3043.102    gower_0.2.2          xfun_0.19           
 [51] stopwords_2.0        testthat_3.0.0       rvest_0.3.6          mime_0.9             lifecycle_0.2.0     
 [56] pacman_0.5.1         MASS_7.3-53          scales_1.1.1         ipred_0.9-9          lgr_0.4.1           
 [61] promises_1.1.1       hms_0.5.3            parallel_4.0.2       yaml_2.2.1           curl_4.3            
 [66] reticulate_1.18      rpart_4.1-15         stringi_1.5.3        tokenizers_0.2.1     desc_1.2.0          
 [71] lava_1.6.8           rlang_0.4.8          pkgconfig_2.0.3      lattice_0.20-41      labeling_0.4.2      
 [76] tidyselect_1.1.0     RcppAnnoy_0.0.16     plyr_1.8.6           R6_2.5.0             text2vec_0.6        
 [81] generics_0.1.0       DBI_1.1.0            whisker_0.4          pillar_1.4.6         haven_2.3.1         
 [86] withr_2.3.0          survival_3.2-7       nnet_7.3-14          janeaustenr_0.1.5    modelr_0.1.8        
 [91] crayon_1.3.4         usethis_1.6.3        grid_4.0.2           readxl_1.3.1         data.table_1.13.0   
 [96] blob_1.2.1           reprex_0.3.0         digest_0.6.27        tm_0.7-7             httpuv_1.5.4        
[101] spacyr_1.2.1         stats4_4.0.2         munsell_0.5.0       
