The Latent Dirichlet Allocation is a generative mixed membership model for topic recognition. In LDA, each document is assigned a distribution over a set of latent topics, and in turn, each topic is assigned a distribution over the corpus' vocabulary. We use collapsed Gibbs sampling to sample from the posterior of the LDA distribution. Thereafter, we train on half of the pages of ten classic novels, and perform inference on the remaining half. Using the topic distribution as a unique signal for classification, we perform nearest neighbors on the queried topic distributions of our test set to choose the closest match in our trained dictionary of topic distributions. We achieve 100% accuracy in recovering the true labels of the test set.
We hypothesize that thematic content can be used as a signal for recovering a document's label. Our theory is predicated on the belief that topics across a book will remain consistent. To test our theory, we use LDA to identify the topic distribution of excerpts from ten classic novels publicly available on Project Gutenberg.
These novels are:
Beowulf, The Divine Comedy by Dante Alighieri, Dracula by Bram Stoker, Frankenstein by Mary Shelley, The Adventures of Huckleberry Finn by Mark Twain, Moby Dick by Herman Mellville, Sherlock Holmes by Sir Arthur Conan Doyle, Tale of Two Cities by Charles Dickens, The Republic by Plato, Ulysses by James Joyce.
We hypothesize that thematic content can be used as a signal for recovering a document's label. Our theory is predicated on the belief that topics across a book will remain consistent. To test our theory, we use LDA to identify the topic distribution of excerpts from ten classic novels publicly available on Project Gutenberg.
These novels are:
Beowulf, The Divine Comedy by Dante Alighieri, Dracula by Bram Stoker, Frankenstein by Mary Shelley, The Adventures of Huckleberry Finn by Mark Twain, Moby Dick by Herman Mellville, Sherlock Holmes by Sir Arthur Conan Doyle, Tale of Two Cities by Charles Dickens, The Republic by Plato, Ulysses by James Joyce.