18.3 Plotting text

18.3.1 The task

Challenge: Plot text (as a visualization) and provide some statistics about character or word frequency.

Seems simple, but realize: Many plotted images are no longer text, but pixel-based (each location has a color value).

Use an example text.

As an example, we can use the bardr package (Billings, 2021) to load William Shakespeare’s complete works and extract the famous Sonnet 18 (see Wikipedia):

# Get some literary work: 
library(bardr)  # the complete works of William Shakespeare, as provided by Project Gutenberg

# Extract some work: Poetry > Sonnet 18
works  <- all_works_df
poetry <- subset(works, works$genre == "Poetry")
sonnet_18_start <- grep("compare thee to a summer", poetry$content)            # find 1st line
sonnet_18 <- poetry$content[(sonnet_18_start - 1):(sonnet_18_start + 14 - 1)]  # extract sonnet
sonnet_18 <- gsub(pattern = "\\\032", replacement = "'", x = sonnet_18)        # corrections
text <- sonnet_18

## Write text to file: 
# cat(sonnet_18, file = "bard.txt", sep = "\n")
  • Figure 18.2 shows the sonnet text, as plotted by the plot_text() function of the ds4psy package (Neth, 2021b):
Visualizing Shakespeare’s Sonnet 18 (using the plot_charmap() function of ds4psy).

Figure 18.2: Visualizing Shakespeare’s Sonnet 18 (using the plot_charmap() function of ds4psy).

18.3.2 Plot text

Basic question: Text as strings of words, sentences, or paragraphs, or as a collection of individual characters?

Here: Individual characters (i.e., on a grid of x/y-locations, similar to crossword puzzles).

  • Advantage: Easy to place words, as each character has a unique coordinate.
  • Disadvantage: Equidistant spacing creates a grid-like appearance.

Data structure: Table of individual characters char, plus their x- and y-locations.

Goal:

#>   char x y
#> 1    S 1 1
#> 2    h 2 1
#> 3    a 3 1
#> 4    l 4 1
#> 5    l 5 1
#> 6      6 1

Subtasks

  1. Turn the text (provided as a character vector with one element per line of text) into a vector of individual characters.
#>  [1] "                    18"                            
#>  [2] "Shall I compare thee to a summer's day?"           
#>  [3] "Thou art more lovely and more temperate:"          
#>  [4] "Rough winds do shake the darling buds of May,"     
#>  [5] "And summer's lease hath all too short a date:"     
#>  [6] "Sometime too hot the eye of heaven shines,"        
#>  [7] "And often is his gold complexion dimmed,"          
#>  [8] "And every fair from fair sometime declines,"       
#>  [9] "By chance, or nature's changing course untrimmed:" 
#> [10] "But thy eternal summer shall not fade,"            
#> [11] "Nor lose possession of that fair thou ow'st,"      
#> [12] "Nor shall death brag thou wand'rest in his shade," 
#> [13] "When in eternal lines to time thou grow'st,"       
#> [14] "  So long as men can breathe or eyes can see,"     
#> [15] "  So long lives this, and this gives life to thee."
#> [1] "S" "h" "a" "l" "l" " "
#> [1] " " "t" "h" "e" "e" "."

Ad 1: Note that both are character vectors, but of different lengths.

  1. Generate the x- and y-coordinates for individual characters.

Ad 2: Mapping increasing coordinates to rows and positions in text.

We could use a combination of 2 loops that iterate through each line of text and each individual character per line. However, we can also use a single loop (for each line of text) and use numerical indexing to map the coordinates for the characters within the current line.

(See map_text_coord() definition.)

Result:

#>   char x y
#> 1    S 1 1
#> 2    h 2 1
#> 3    a 3 1
#> 4    l 4 1
#> 5    l 5 1
#> 6      6 1

18.3.3 Adding counts and colors

While such visualizations can be aesthetically pleasing, combining them with quantitative aspects (e.g., counting pattern matches) may also enable new insights.

Subtasks

  1. Count character frequency

Result:

#> chars
#>  e  o  t  s  a  h  n  r  i  l  m  d  u  f  g  c  y  '  v  p  S  w  A  b  B  N 
#> 63 44 39 38 37 31 31 28 26 23 22 20 13 10 10  9  8  6  5  4  4  4  3  3  2  2 
#>  1  8  I  k  M  R  T  W  x 
#>  1  1  1  1  1  1  1  1  1

Map frequency value to each character in the char vector (i.e., a typical join() or merge()` problem).

  1. Map the frequency values to an appropriate color palette

+++ here now +++

# (a) Frequency map:
fm <- count_chars_words(x = text[-1])
names(fm) <- c("chars", names(fm)[-1])
dim(cm)
#> NULL
head(fm)
#>   chars char_freq  word word_freq
#> 1     S         4 Shall         1
#> 2     h        31 Shall         1
#> 3     a        37 Shall         1
#> 4     l        23 Shall         1
#> 5     l        23 Shall         1
#> 6             104              NA

# (b) Char map:
cm <- map_text_coord(x = text[-1], flip_y = TRUE)
cm$ix <- 1:nrow(cm)  # add row nr 
dim(cm)
#> [1] 612   4
head(cm)
#>   char x  y ix
#> 1    S 1 14  1
#> 2    h 2 14  2
#> 3    a 3 14  3
#> 4    l 4 14  4
#> 5    l 5 14  5
#> 6      6 14  6

# (c) Combine:
tb <- cbind(cm, fm)
dim(tb)
#> [1] 612   8
head(tb)
#>   char x  y ix chars char_freq  word word_freq
#> 1    S 1 14  1     S         4 Shall         1
#> 2    h 2 14  2     h        31 Shall         1
#> 3    a 3 14  3     a        37 Shall         1
#> 4    l 4 14  4     l        23 Shall         1
#> 5    l 5 14  5     l        23 Shall         1
#> 6      6 14  6             104              NA

# Check:
all(tb$char == tb$chars)  
#> [1] TRUE

Plot text with character/word statistics:

  • Show character frequency (using a continuous color scale):
ggplot(tb, aes(x = x, y = y)) +
  # geom_tile(aes(fill = word_freq)) +
  geom_text(aes(label = char, col = char_freq), fontface = 1) +
  scale_color_gradient(low = "grey90", high = "black") + 
  # coord_equal() + 
  theme_classic()
Text with character frequency.

Figure 18.3: Text with character frequency.

Using factor(char_freq) for an ordinal color scale:

ggplot(tb, aes(x = x, y = y)) +
  # geom_tile(aes(fill = word_freq)) +
  geom_text(aes(label = char, col = factor(char_freq)), fontface = 1) +
  # scale_color_gradient(low = "grey90", high = "black") + 
  scale_color_manual(values = usecol(c("grey90", "black"), n = length(unique(factor(tb$char_freq))))) + 
  theme_classic()
  • Figure 18.4 shows a version generated by varying the color and label rotation options of plot_chars():
Varying label colors and angles in Shakespeare’s Sonnet 18 (using the plot_chars() function of ds4psy).

Figure 18.4: Varying label colors and angles in Shakespeare’s Sonnet 18 (using the plot_chars() function of ds4psy).

  • If we want to locate and highlight specific target terms, we can use the regular expression options provided by the plot_chars() function (see Figure 18.5):
Locating and visualizing pattern matches in Shakespeare’s Sonnet 18 (using the plot_chars() function of ds4psy).

Figure 18.5: Locating and visualizing pattern matches in Shakespeare’s Sonnet 18 (using the plot_chars() function of ds4psy).

Note that all these plots are character-based: The basic unit is an individual character and all characters have the same width and height. In formatted text (e.g., when using real typesetting systems, e.g., TeX), not all characters have the same length. This would require a different approach to plotting text (see e.g., the text decoration functions in the unikn package). But even when plotting longer strings (i.e., entire words or sentences), the plotted text still changes its format by becoming part of an image. Thus, real desktop imaging software disinguishes text from graphical objects for as long as possible and only converts text into a graphical format when necessary (e.g., for printing it).

Nevertheless, our examples show that the boundaries between plain text and graphics are easily blurred, especially when we start thinking about word search problems, crossword puzzles, or word clouds. Thinking about new ways of visual expression will also change the ways we see characters and texts.

18.3.4 Extensions

  1. Creating word search puzzles:

    • start with a \(m \times\ n\) grid of NA values and
    • a list of words to place.
    • specify orientation: 2 directions (fw/bw), 4 orientations (horizontal, vertical, 2 diagonals)
    • aim to place elements (under constraints)
    • fill rest so that no new words are created