10.2 Language in NLP

  • corpus: a collection of documents
  • documents: single tweets, single statements, single text files, etc.
  • tokenization: “the process of splitting text into tokens” (Silge 2017), further refers to defining the unit of analysis, e.g., single words, sequences of words or entire sentences (i.e., tokens)
  • bag of words (method): approach where all tokens are put together in a “bag” without considering their order (alternatively: bigrams (word pairs), word embeddings)
    • possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
  • stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.
  • document-term/feature matrix (DTM/DFM): common format to store text data (examples later)


Silge, Julia. 2017. Text Mining with r : A Tidy Approach. First edition. Beijing, China.