Topic is a general term that can refer to a wide variety of concepts. For example, it may be used to describe the main theme or ideas in a text, the subject matter of a text, or the category to which a text belongs.
Topic and LDA
Topic can also be used more specifically to refer to a statistical model that is commonly used in text analytics. This model, known as latent Dirichlet allocation (LDA), is a way of automatically discovering the topics that are present in a text. LDA is a probabilistic model that takes as input a collection of documents and produces as output a set of topics. Each topic is a distribution over the words in the vocabulary, and each document is a distribution over the topics.
LDA can be used to find out what the major themes of a text are, to automatically group documents into categories, or to determine which documents are most similar to each other.
Topic and Embedding
Topic can also be used in the context of word embeddings. A word embedding is a mapping of words to vectors of real numbers that captures some meaning about the words. For example, two words that are often used together in text will have vectors that are close together in the word embedding space.
One way of creating word embeddings is to use a neural network to learn the mapping from words to vectors. This approach is often called word2vec.
Topic and LSA
Another way of creating word embeddings is to use a technique called latent Semantic Analysis (LSA). LSA creates a matrix that represents the co-occurrence of words in text. The rows of the matrix represent words, and the columns represent documents. The entries in the matrix are counts of how often each word appears in each document.
LSA can be used to create word embeddings by taking the singular value decomposition (SVD) of the co-occurrence matrix. This decomposition produces a set of vectors that represent the words in the text. These vectors can be used as word embeddings.
LSA is a statistical technique, and LDA is a probabilistic model. Both of these methods can be used to create word embeddings.