7.2 Language in NLP
corpus: a collection of documents
documents: single tweets, single statements, single text files, etc.
tokenization: refers to defining the unit of analysis, e.g., single words, sequences of words or entire sentences (referred to as tokens)
bag of words (method): approach where all tokens are put together in a “bag” without considering their order (alternatively: bigrams (word pairs), word embeddings)
- possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.
document-term/feature matrix (DTM/DFM): common format to store text data (examples on the next slides)