Chapter 6 정제1

정제는 토큰화, 불용어제거 및 정규화 등의 단계를 거친다.

토큰화
불용어제거
정규화

6.1 토큰화1

토큰(token)은 말뭉치의 가장 작은 단위다. 토큰의 단위를 글자, 단어, 엔그램(n-gram), 문장, 문단 등 다양하게 지정할 수 있다.

pkg_l <- c("tidyverse", "tidytext", "textdata")
lapply(pkg_l, require, ch = T)

## Loading required package: tidyverse

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## √ ggplot2 3.3.2     √ purrr   0.3.4
## √ tibble  3.1.0     √ dplyr   1.0.2
## √ tidyr   1.1.2     √ stringr 1.4.0
## √ readr   1.4.0     √ forcats 0.5.0

## Warning: package 'tibble' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: tidytext

## Loading required package: textdata

## Warning: package 'textdata' was built under R version 4.0.4

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE

6.1.1 단어 토큰

unnest_tokens()함수의 토큰화 기본값은 “words”로서 단어 단위로 토큰화한다.

text_v <- "You still fascinate and inspire me.
You influence me for the better. 
You’re the object of my desire, the #1 Earthly reason for my existence."

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text,
                token = "words")

## # A tibble: 25 x 1
##    word     
##    <chr>    
##  1 you      
##  2 still    
##  3 fascinate
##  4 and      
##  5 inspire  
##  6 me       
##  7 you      
##  8 influence
##  9 me       
## 10 for      
## # ... with 15 more rows

strip_punct =인자에 FALSE를 투입하면, 문장부호를 제거하지 않는다. 문장부호는 상황에 따라 텍스트의 의미를 파악하는 중요한 단서를 제공하는 경우 있기 때문에 텍스트분석 정제 과정에서 제거하지 맣아야 할 때도 있다.

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text,
                token = "words",
                strip_punct = F)

## # A tibble: 30 x 1
##    word     
##    <chr>    
##  1 you      
##  2 still    
##  3 fascinate
##  4 and      
##  5 inspire  
##  6 me       
##  7 .        
##  8 you      
##  9 influence
## 10 me       
## # ... with 20 more rows

6.1.2 글자 토큰

token =인자에 “characters”를 투입하면 글자 단위로 토큰화한다.

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text,
                token = "characters") %>% 
  count(word, sort = T)

## # A tibble: 21 x 2
##    word      n
##    <chr> <int>
##  1 e        20
##  2 t        10
##  3 o         8
##  4 r         8
##  5 i         7
##  6 n         7
##  7 s         6
##  8 y         6
##  9 a         5
## 10 f         5
## # ... with 11 more rows

6.1.3 복수의 글자

복수의 글자를 토큰의 단위로 할 때는 “character_shingles”을 token =인자에 투입한다. 기본값은 3글자.

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text,
                token = "character_shingles", n = 4) %>% 
  count(word, sort = T)

## # A tibble: 104 x 2
##    word      n
##    <chr> <int>
##  1 ence      2
##  2 ethe      2
##  3 reth      2
##  4 1ear      1
##  5 andi      1
##  6 arth      1
##  7 asci      1
##  8 ason      1
##  9 atea      1
## 10 bett      1
## # ... with 94 more rows

6.1.4 복수의 단어(n-gram)

복수의 단어를 토콘 단위로 나눌 때는 token =인자에 “ngrams”인자를 투입한다. 기본값은3개이다.

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text,
                token = "ngrams", n = 4) %>% 
  count(word, sort = T)

## # A tibble: 22 x 2
##    word                         n
##    <chr>                    <int>
##  1 1 earthly reason for         1
##  2 and inspire me you           1
##  3 better you’re the object     1
##  4 desire the 1 earthly         1
##  5 earthly reason for my        1
##  6 fascinate and inspire me     1
##  7 for the better you’re        1
##  8 influence me for the         1
##  9 inspire me you influence     1
## 10 me for the better            1
## # ... with 12 more rows

6.1.5 정규표현식

정규표현식(regex: regular expressions)을 이용하면, 토콘을 보다 다양한 방식으로 나눌 수 있다.

token =인자에 “regex”를 지정한다. pattern =에 정규표현식을 투입한다.

"\\n"은 “new line”을 의미한다. 문장 단위로 토큰화했다. 만일 공백 단위로 토큰화한다면, 공백을 의미하는 "\\s"를 투입한다.

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text,
                token = "regex", pattern = "\\n")

## # A tibble: 3 x 1
##   word                                                                     
##   <chr>                                                                    
## 1 "you still fascinate and inspire me."                                    
## 2 "you influence me for the better. "                                      
## 3 "you’re the object of my desire, the #1 earthly reason for my existence."

unnest_tokens함수의 다양한 기능은 ?unnest_tokens로 살펴보자.

?unnest_tokens

6.2 불용어(stop words) 제거

앞서 제시한 연애편지를 문자 단위로 토큰화해 단어의 빈도를 계산해보자.

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text) %>% 
  count(word, sort = TRUE)

## # A tibble: 19 x 2
##    word          n
##    <chr>     <int>
##  1 the           3
##  2 for           2
##  3 me            2
##  4 my            2
##  5 you           2
##  6 1             1
##  7 and           1
##  8 better        1
##  9 desire        1
## 10 earthly       1
## 11 existence     1
## 12 fascinate     1
## 13 influence     1
## 14 inspire       1
## 15 object        1
## 16 of            1
## 17 reason        1
## 18 still         1
## 19 you’re        1

count로 단어빈도를 계산한 결과를 보면 “the”가 3회, “for,” “me,” “my,” “you”가 각각 2회 사용됐다. 즉, 이 글은 너와 나에 대한 글이런 것을 알수 있다. 사랑고백이란 것이 너와 나의 일이므로 타당하다.

분석결과를 보면 단어빈도로 의미를 파악하는데 불필요한 단어도 있다. “the,” “for,” “of,” “and” 등과 같은 관사, 전치사, 접속사들처럼 자주 사용하는 단어들이다. 이런 단어는 불용어(stop words)로 처리해 분석대상에 제외하는 것이 보다 정확한 의미를 파악하는데 도움이 되는 경우도 있다.

불용어를 제거하는 방법은 크게 두가지가 있다.

anti_join() 불용어목록을 데이터프레임에 저장한 다음, anti_join()함수를 이용해 텍스트데이터프레임과 배제결합하는 방법이다. 두 데이터프레임에서 겹치는 행을 제외하고 결합(join)한다. 이 경우 불용어 목록에 포함된 행이 제외된다.
filter()함수와 str_detect()함수를 함께 이용해 불용어를 걸러내는 방법이다. 불용어를 목록에 포함시키기 어려운 경우에 이용한다.

6.2.1 불용어 사전

주로 사용되는 불용어목록은 불용어사전으로 제공된다. tidytext패키지는 stop_words에 불용어를 모아 놓았다. stop_words의 구조부터 살며보자.

kableExtra패키지를 이용하면 데이터프레임을 깔끔하게 출력할 수 있다.(사용법은 여기)

install.packages("kableExtra")

데이터셋을 R세션에 올리는 함수는 data()함수다.

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.0.4

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

data(stop_words)
stop_words %>% glimpse()

## Rows: 1,149
## Columns: 2
## $ word    <chr> "a", "a's", "able", "about", "above", "according", "accordingl~
## $ lexicon <chr> "SMART", "SMART", "SMART", "SMART", "SMART", "SMART", "SMART",~

stop_words[c(1:3, 701:703, 1001:1003),] %>% 
  kbl() %>% kable_classic(full_width = F)

word	lexicon
a	SMART
a’s	SMART
able	SMART
during	snowball
before	snowball
after	snowball
parted	onix
parting	onix
parts	onix

행이 1,149개(불용억 1,149개)이고, 열이 2개(word와 lexicon)인 데이터프레임이다. word열에 있는 단어가 불용어고, lexicon열에 있는 값은 불용어 용어집의 이름이다. tidytext패키지의 stop_words에는 세 개의 불용어 용어집(SMART, snowball, onix) 이 포함돼 있다. filter함수로 특정 용어집에 있는 불용어 사전만 골라 이용할 수 있다.

stop_words$lexicon %>% unique

## [1] "SMART"    "snowball" "onix"

불용어사전으로 불용어를 걸러낸 다음 단어빈도를 계산해보자.

data(stop_words)

tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 10 x 2
##    word          n
##    <chr>     <int>
##  1 1             1
##  2 desire        1
##  3 earthly       1
##  4 existence     1
##  5 fascinate     1
##  6 influence     1
##  7 inspire       1
##  8 object        1
##  9 reason        1
## 10 you’re        1

결과를 보면 “you”등 대명사가 포함된 토큰은 모두 제거됐는데, “you’re”는 그대로 남아 있다. 불용어 사전에는 “you’re”로 '를 이용했는데, 본문에는 “you’re”로 ’를 이용했기 때문이다.

6.2.2 불용어 사전 수정1

불용어 사전에 “you’re”를 추가해보자.또한 숫자 “1”도 함께 불용어사전에 추가하자.

먼저 추가할 용어를 불용어사전과 같은 구조의 데이터프레임에 저장한다.

names(stop_words)

## [1] "word"    "lexicon"

stop_add <- tibble(word = c("you’re", "1"),
                   lexicon = "added")
stop_add

## # A tibble: 2 x 2
##   word   lexicon
##   <chr>  <chr>  
## 1 you’re added  
## 2 1      added

bind_rows()함수로 불용어사전과 결합한다.

stop_words2 <- bind_rows(stop_words, stop_add)
stop_words2 %>% tail()

## # A tibble: 6 x 2
##   word     lexicon
##   <chr>    <chr>  
## 1 younger  onix   
## 2 youngest onix   
## 3 your     onix   
## 4 yours    onix   
## 5 you’re   added  
## 6 1        added

새로 만든 불용어사전으로 정체한 후 단어 빈도를 계산해보자.

tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) %>% 
  anti_join(stop_words2) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 8 x 2
##   word          n
##   <chr>     <int>
## 1 desire        1
## 2 earthly       1
## 3 existence     1
## 4 fascinate     1
## 5 influence     1
## 6 inspire       1
## 7 object        1
## 8 reason        1

“you’re”와 숫자가 모두 제거됐다.

6.2.3 `filter()`

모든 숫자를 불용어 사전에 포함시킬수는 없다. 정규표현식(regex: regular expression)에서 숫자를 의미하는 [:digit:] 또는 \\d를 이용해 filter()함수와 str_detect()함수 및 부정연산자 !를 이용해 걸러낸다.

str_subset()함수는 패턴이 일치하는 문자를 출력하는 반면, str_detect()함수는 패턴이 일치하는 문자에 대한 논리값(TRUE or FALSE)을 출력한다.

df <- tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) %>% 
  anti_join(stop_words)

## Joining, by = "word"

df$word %>% str_subset(pattern = "\\d")

## [1] "1"

df$word %>% str_detect(pattern = "\\d")

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

불용어를 제거한 다음 추가로 본문에서 숫자와 “’”가 포함된 문제를 제거하자.

tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) %>% 
  anti_join(stop_words) %>% 
  filter(
    !str_detect(word, pattern = "\\d"),
    !str_detect(word, pattern = "you’re")
    ) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 8 x 2
##   word          n
##   <chr>     <int>
## 1 desire        1
## 2 earthly       1
## 3 existence     1
## 4 fascinate     1
## 5 influence     1
## 6 inspire       1
## 7 object        1
## 8 reason        1

6.2.4 불용어 사전 수정2

통상적으로 쓰이는 불용어 중에는 실은 문서의 의미를 파악하는데 중요한 단서를 제공하는 단어도 있다. “you” “me” “my” 등과 같은 대명사는 흔하게 사용되기 때문에 불용어로 분류되지만, 맥락를 파악하는데 중요한 역할을 하기도 한다. 불용어 사전에서 대명사를 찾아 불용어 사전에서 제거하자.

stop_words$word %>% 
  str_subset("(^i$|^i[:punct:]+|^mys*|^me$|^mine$)")

##  [1] "i"      "i'd"    "i'll"   "i'm"    "i've"   "me"     "my"     "myself"
##  [9] "i"      "me"     "my"     "myself" "i'm"    "i've"   "i'd"    "i'll"  
## [17] "i"      "me"     "my"     "myself"

stop_words3 <- stop_words %>% 
  filter(
    !str_detect(word, "(^i$|^i[:punct:]+|^mys*|^me$|^mine$)"),
    )
stop_words3$word %>% 
  str_subset("^i")

##  [1] "ie"          "if"          "ignored"     "immediate"   "in"         
##  [6] "inasmuch"    "inc"         "indeed"      "indicate"    "indicated"  
## [11] "indicates"   "inner"       "insofar"     "instead"     "into"       
## [16] "inward"      "is"          "isn't"       "it"          "it'd"       
## [21] "it'll"       "it's"        "its"         "itself"      "it"         
## [26] "its"         "itself"      "is"          "it's"        "isn't"      
## [31] "if"          "into"        "in"          "if"          "important"  
## [36] "in"          "interest"    "interested"  "interesting" "interests"  
## [41] "into"        "is"          "it"          "its"         "itself"

6.2.5 불용어 목록 만들기

제거하고 싶은 불용어를 최소화하고 싶을 때는 불용어 목록을 직접 만들수도 있다. “the, for, and”등이 포함된 불용어 목록을 만들어 정제해 보자. “the, for, and”등 불용어목록을 데이터프레임에 저장한 다음, anti_join()함수를 이용해 토큰데이터프레임과 배제결합한다.

stop_df <- tibble(word = c("the","for", "and"))

tibble(text = text_v) %>% 
  unnest_tokens(output = word, input = text) %>% 
  anti_join(stop_df) %>% 
  filter(!str_detect(word, "\\d")) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 15 x 2
##    word          n
##    <chr>     <int>
##  1 me            2
##  2 my            2
##  3 you           2
##  4 better        1
##  5 desire        1
##  6 earthly       1
##  7 existence     1
##  8 fascinate     1
##  9 influence     1
## 10 inspire       1
## 11 object        1
## 12 of            1
## 13 reason        1
## 14 still         1
## 15 you’re        1

6.3 정규화(Normalizaation)

6.3.1 어간추출(stemming)과 기본형추출(lemmatization)

“me”와 “my” 그리고, “you”와 “you’re”는 형태는 다르지만, 같은 의미를 공유하하고 있다. 각각 같은 의미이므로 하나로 묶어 줄 필요가 있다. 기존패키지를 이용하는 방법과 정규표현식을 이용하는 방법이 있다. 국문 처리는 <정제2>에서 다루고, 여기서는 영문 중심으로 한다.

패키지 활용

어간추출 패키지

기본형추출 패키지

형태소추출 패키지

정규표현식 이용

파생형 단어의 목록을 만든 다음, str_sub()함수를 이용해 목록에 있는 단어를 선택해 대표형으로 바꾼다. “나”와 관련된 단어는 “ME”로 “너”와 관련된 단어는 “YOU”로 바꾸자.

본문에서 대문자를 모두 소문자로 바꾼다. stringr패키지의 str_to_lower() 함수 이용. locale = 기본값은 “en”영문이다. 상황에 따라 “tr” 등 터키어로 설정할 수 있다.

이외 다양한 영문자 Case 변환은 string패키지 Convert case of a string항목 참조

txt_low_v <- c("You still fascinate and inspire me.
You influence me for the better. 
You’re the object of my desire, the #1 Earthly reason for my existence.") %>% 
  str_to_lower()

정규표현식으로 단어 패턴을 만들고, 각 패턴을 |연산자로 연결해 단일 벡터로 저장한다.

meword_v <- c("^i$|^i[:punct:]+|^mys*|^me$|^mine$")
youword_v <- c("^you$|^you[:punct:]+|^yours*")

dplyr패키지의 recode함수를 이용해 데이터프레임의 값(value)을 바꿀 수 있다. recode는 벡터를 처리하므로, 데이터프레임의 값을 바꾸려면 mutate함수와 함께 이용한다. (주의: recode(벡터, 옛 값 = 새 값)

tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) %>% 
  mutate( word = recode(word, my = "me", `you’re` = "you") )%>% 
  anti_join(stop_df) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 14 x 2
##    word          n
##    <chr>     <int>
##  1 me            4
##  2 you           3
##  3 1             1
##  4 better        1
##  5 desire        1
##  6 earthly       1
##  7 existence     1
##  8 fascinate     1
##  9 influence     1
## 10 inspire       1
## 11 object        1
## 12 of            1
## 13 reason        1
## 14 still         1

나에 대한 단어인 me가 4회, 너에 대한 단어가 3회 사용됐다.

이번에는 기존 불용어 사전을 수정해서 사용해보자. stop_words에서 “you,”“me,” “my”를 제거한다. dplyr패키지의 filter함수를 이용해 해당 값이 들어 있는 행(row)을 걸러낸다. !word %in% c("you","you're", "me", "my")는 word열에서 "you","you're", "me", "my"가 포함된 행을 찾아 그 행의 나머지 부분이라는 의미다. !가 ~를 제외한 나머지를 의미한다. %in%은 ~가 포함된을 의미한다.

stop_md <- stop_words
stop_md <- stop_md %>% filter( !word %in% c("you","you're", "me", "my") ) 
tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) %>% 
  mutate( word = recode(word, `my` = "me", `you’re` = "you") )%>% 
  anti_join(stop_md) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 11 x 2
##    word          n
##    <chr>     <int>
##  1 me            4
##  2 you           3
##  3 1             1
##  4 desire        1
##  5 earthly       1
##  6 existence     1
##  7 fascinate     1
##  8 influence     1
##  9 inspire       1
## 10 object        1
## 11 reason        1

“you,”“you’re,” “me,” “my”를 제거하지 않고 단어빈도를 계산했다.

6.3.1.1 `map()`

불용어 사전에는 “I”와 관련된 단어도 찾아 보자. “I” 관련어는 “my”도 있으므로 purrr패키지의 map()함수를 이용해 str_subset()함수를 반복해서 실행한다. map()함수는 R기본함수 apply()와 비슷한 기능을 한다. map()함수는 리스트구조로 산출하므로, unlist()함수를 이용해 벡터구조로 바꾼다.

map_chr()이용하면 문자벡터로 출력

^는 정규표현식(regex)에서 문자(string)의 시작을 의미한다. $는 문자의 종료. 문자 앞뒤로 ^과 $이 있으면, 정확하게 일치하는 문자만 찾아 준다.

library(tidyverse)
pattern <- c("i'", "my", "^me$")
i_v <- map(pattern,
    str_subset, 
    string = stop_words$word) %>% 
  unlist()
i_v

##  [1] "i'd"    "i'll"   "i'm"    "i've"   "i'm"    "i've"   "i'd"    "i'll"  
##  [9] "my"     "myself" "my"     "myself" "my"     "myself" "me"     "me"    
## [17] "me"

불용어 사전에서 “I”와 “you”에 관련된 단어를 모두 제거해 보자.

library(tidytext)
library(stringr)
stop_md2 <- stop_words
stop_md2 <- stop_md2 %>% filter( !word %in% c(i_v, you_v) ) 
str_subset(stop_md2$word, "you")
str_subset(stop_md2$word, "my")

text_tk <- tibble(text = text_v) %>%
  unnest_tokens(output = word, input = text) 


i2_v <- paste0(i_v, sep="|", collapse = "")

i2_v <- i2_v %>% str_replace("\\|$", "")

text_tk$word %>%  str_replace_all(i3_v, "ME")


you2_v <- paste0(you_v, sep="|", collapse = "")
you2_v <- str_replace_all(you2_v, "\\|$", "")

text_tk$word %>%  str_replace_all(you3_v, "YYYYY")



you2_v
you3_v <- "(you'd)|(you'll)|(you’re)|(you've)|(your)|(yours)|(yourself)|(yourselves)"
you4_v <- "you|youd|youll|youre|youve|your|yours|yourself|yourselves"

you0_v <- c("you", "youd", "youll", "youre", "youve", "your",  "yours", "yourself", "yourselves")

you_v %>% str_replace_all(you3_v, "YYYYYY")
you0_v %>% str_replace_all(you4_v, "YYYYYY ")

you_v %>% str_replace_all(you4_v, "YYYYYY ")



v_be <- "(\\b(a|A)m)|(\\b(a|A)re)|(\\b(i|I)s)|(\\b(w|W)as)|(\\b(w|W)ere)|(\\b(w|W)e)"

v_be <- "am|are|is|was"

mytxt <- c("I am a boy. You are a boy.")
str_replace_all(mytxt, v_be, "")



text_tk %>% 
  anti_join(stop_md2) %>% 
  count(word, sort = TRUE)

you_v %>% unique()

text_tk$word %>% 
  str_replace("you’re",
                  "YOU")