Entropy Lens – Word Surprisal Analyzer

The Core Organization: Surprisal and Entropy Lens

Humans run on communication. The complexity of what we convey is beyond that which the most sophisticated machines can generate, astonishing, really. We understand a whole array of nuances: history, context, slang, the unspoken. Not only that, we hold an entire catalog of word frequencies in mind, built from a lifetime experience of reading and listening. Our brains have assigned rough probabilities to word sequences without us ever consciously doing the math, a sort of mathematical entropy lens.

A simple example is “The cat sat on the…” we already know mat is coming. For English speakers it is predictable. Surprisal is the formal word that quantifies exactly that, how expected or unexpected any given word is. The measure is expressed in a unit called a bit.

The math of a bit is simple, words that have a high probability of showing up next in a string of spoken or written words register low, the word “the” is four bits, but an obscure word might be 20. The cat sat on the…edge of the precipice dangling the necklace in his mouth, grinning I think at the commotion. The first part of the sentence has low bit numbers, while the rest does not.

Where the probabilities come from

But where do the probability of use come from. This tool uses the Brown Corpus as a reference. It is a reference collection from the 1960’s that indexes over one million words of the English language. A word’s frequency in that book is calculated by probability. The word “the” occurred about 70,000 times out of million words, other frequencies fan out from that.

Words not in the corpus at all, are given an almost zero probability so they do not skew the results.

How to Use the Tool

Enter your text and hit Analyze.

Mean surprisal

This tool averages surprisal across all words in your text. This is a single number that describes how novel your words are. Higher means your word choices are consistently unusual, lower means you’re using more common words. Neither is good or bad, it often is tied to the genre of text you are writing or the audience you are writing for.

The color map

This is a visual translation of each word’s surprisal value. Dark cyan = low surprisal (common word, predictable). As surprisal rises it shifts through teal into gold. Gold words are the statistical outliers, rare or obscure words.

Bigram entropy, the different thing

This part of the tool is measuring something related but more individualistic. It is a tiny snapshot of the predictiveness of your text. For example, “He sneezed at the cat, sneezed at the dog, and sneezed at the open window.” vs “He sneezed, and sneezed and sneezed.”

For most short passages, bigram entropy won’t show much. In longer texts it becomes a useful thing to keep in mind — a signal that your words are falling into predictable patterns with each other.

Paste your text

Surprisal Function Explained: Measuring Information in Probability & AI by ClickVector