7.2 Language in NLP

  • corpus: a collection of documents

  • documents: single tweets, single statements, single text files, etc.

  • tokenization: refers to defining the unit of analysis, e.g., single words, sequences of words or entire sentences (referred to as tokens)

  • bag of words (method): approach where all tokens are put together in a “bag” without considering their order (alternatively: bigrams (word pairs), word embeddings)

    • possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
  • stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.

  • document-term/feature matrix (DTM/DFM): common format to store text data (examples on the next slides)