Find Themes in Text | Topic Analysis Tool

How to Find Themes in Text, the Latent Dirichlet allocation

What is theme or topic of a large corpus, or body of work? Finding themes in text is a common academic question, but at a large size it is difficult to analyze multiple catagories. Luckily, we can find the answer by looking at the question mathematically; not by sorting the individual words themselves (that would be difficult and lead to errors in classification), but by looking at them in statistical relationship to each other.

How do we do this? The best way to conceptualize the problem is the think of standing in a room with some boxes. Now, take the text and cut it in the individual words, and then, here’s the trick, treat them like units that make no sense word wise, but dump them randomly into boxes. The point is that the boxes are filled with units that have no real meaning. Our tool, and it only analyzes units statistically, remembers where that came from in the text.

Take the word river as a data point. Our text talks a lot about rivers, so the tool sees it quite often. Let’s say water is another data point. Water and river are together often, so our tool is going to make the decision to put them in the same box, and astonishingly it makes the same calculations millions of times.

Our tool does another thing to help you with text sorting accuracy. Topic are sorted by closeness of association, but they also can be sorted by novelty. In our text above, river and water go together, but perhaps there is a section on Christmas trees. In all likeliness, river and water won’t go with Christmas, but Christmas tree will be associated perhaps with gift, family, Santa and so on. Our tool would identify Christmas data point as a topic and treat associated words as a unit.

The math is like this.

lift= number of times word is in topic box/number of times it is in whole text

Lift compares how often a word appears in one topic box versus how often it appears everywhere else, the higher the ratio, the more that word belongs to that topic and not the others. In our example Christmas most likely will be rare in a river fishing box.

A word that appears 30 times in topic 3 but 200 times across all topics gets a modest lift. A word that appears 20 times in topic 3 but only 25 times total gets a huge lift, it’s basically owned by that topic. That’s the label.

What is interesting from a math perspective is that the tool assumes there will be order, but it does not know statistically speaking what the distribution will be, it just assumes that there will be topics. It then takes the parts and then starts sorting to bring order back to the text, it makes hundred of judgements to realign word associations and to see frequency wise what goes together.

Topic Discovery

Latent Dirichlet Allocation — finds the hidden thematic structure in large texts

Text Input
Exclude Words

A standard English stopword list runs silently. Add domain-specific words you want filtered out.

Number of Topics
5

Start with 5. Adjust after seeing first results.

Words per Topic
10

Top N words shown per topic.

Iterations
200

More = stable, slower. 200 is a solid default.

Discovered Topics — auto-labeled by lift, click to rename

How to Use the Tool

Paste your text into the input box, the longer the better. A few paragraphs will work, but a full chapter works better, the more text the more accurate the statistics of words

Set the number of topic boxes, five is a good number. Too many and it can return scattered word results.

Put in any words you want excluded. We have it so common English words are already filtered out, but if a word dominates your text without being meaningful (like a character name or a repeated filler), add it to the excluded field.

Hit Run. The tool will sort every word by occurrence and return your topics sorted two way, one by frequency (cyan bar) and one by lift (gold bar). The auto-generated label comes from the highest-lift words, the words most unique in that topic.