10.2 Language in NLP
- corpus: a collection of documents
- documents: single tweets, single statements, single text files, etc.
- tokenization: “the process of splitting text into tokens” (Silge 2017), further refers to defining the unit of analysis, e.g., single words, sequences of words or entire sentences (i.e., tokens)
- bag of words (method): approach where all tokens are put together in a “bag” without considering their order (alternatively: bigrams (word pairs), word embeddings)
- possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
- stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.
- document-term/feature matrix (DTM/DFM): common format to store text data (examples later)
References
Silge, Julia. 2017. Text Mining with r : A Tidy Approach. First edition. Beijing, China.