10.2 Language in NLP

corpus: a collection of documents
documents: single tweets, single statements, single text files, etc.
tokenization: “the process of splitting text into tokens” (Silge 2017), further refers to defining the unit of analysis, e.g., single words, sequences of words or entire sentences (i.e., tokens)
bag of words (method): approach where all tokens are put together in a “bag” without considering their order (alternatively: bigrams (word pairs), word embeddings)
- possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.
document-term/feature matrix (DTM/DFM): common format to store text data (examples later)

Silge, Julia. 2017. Text Mining with r : A Tidy Approach. First edition. Beijing, China.