July 11

Avatar's character-defining words

A few years ago, I read this wonderful article on The Pudding about which words are "most hip hop" . Since Avatar: The Last Airbender has recently enjoyed a resurgence in popularity now that it's finally on Netflix, I was inspired to do a similar project: which words are most central to each character in ATLA?

In the Pudding article, they used a dataset of 26 million words. We'll have to do our best with the 120,138 words in the Avatar script.

To find the defining words for a character — let's say it's Sokka — we'll need to look for the words that he says frequently, but other characters don't. As a first attempt, we might try to look for the words that have the highest likelihood of being spoken by Sokka, instead of any other character. Here's just a few of them:

odds by Sokka Sokka count others count
1. scare 100% 1 0
2. cookin 100% 1 0
3. weirdness 100% 1 0

I can see Sokka saying "cookin," but I'd hardly call it a "character-defining word."

Do you see the problem here? Our dataset is too small, so we are just selecting all the obscure words that were only spoken once in the entire series, where the speaker just happened to be Sokka.

In order to fix this, we'll use a statistic called the tf-idf score, which is used to measure how important a word is to a document in a collection of documents. In this case, we can consider Sokka to be a document (really, the document is all of Sokka's lines) and find which words are most important to his lines. Tf-idf will prioritize words that Sokka says frequently, but deprioritize words that have been said by other characters.

This approach works a lot better, although it isn't perfect. Here's the breakdown for all characters in the series who have at least 200 significant words of dialogue.1 Multi-word proper nouns (e.g., "ty lee") are counted as one word.

1 In the computations I ignore stop words, which are typically common words like "I", "am", and "the" that are not significant in this context.

Character-defining words in Avatar

32 characters' top 10 central words across the series, using tf-idf

A lot of names appear in the lists of words, and we can see some sensible patterns emerge: members of the Gaang mention each other by name a lot, and to a lesser extent so do members of the fire nation. For minor characters, we see words relating to the context they appear in, like "chakras" for Guru Pathik, "ba sing se" for Long Feng, and "library" for Professor Zei.

An interesting note on methodology: I had to run the tf-idf analysis for the most important characters separately from the full list of characters. Otherwise, if we keep them together when calculating tf-idf scores, the major characters end up with lists full of words like "want", "know", and "need".

Why might this be? Well, these are all introspective, character-developing words — even though our main cast says them all the time, minor characters who don't get as much development say them much less! So including minor characters makes those words seem much rarer than we would typically expect them to be.

Since Avatar is renowned for its character development, after all, it might be illuminating to see how characters' central words change over time. Our analysis stands to gain more nuance if we apply it to each individual season, rather than the series as a whole.

Here's what that looks like. This analysis shows only the undisputed main characters who also appear in all three seasons (sorry, Azula and Toph), although characters from the previous section are still used for the calculations.

Character growth by vocabulary in Avatar

5 characters' top 10 central words in each season, using tf-idf

Some of the clearer insights come from Zuko's shifting lexicon over the seasons. Although I mentioned that I separated major characters during calculations to prevent character-developing words from being overrepresented, they still show up a lot in Zuko's lists, since he is arguably given the most character development in the show.

Season 1 Zuko is mostly occupied with capturing the avatar, a goal which he never stops to question ("need"). But in Season 2, he becomes more concerned with his family and begins his internal struggle to discover where he truly fits in ("want", "realized"). By the middle of Season 3, Zuko's metamorphosis is complete and he achieves certitude about his choices and the ability to reflect on his former self ("know", "wanted", "thought") .

Let's cap off this post with a couple more statistics. A relatively basic one that we've neglected is the total word count of every character. Make your guess: who do you think talks the most? Here's the top ten.

Avatar characters by word count

Finally, let's try to empirically determine who are the optimists and pessimists of Avatar! VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis tool that can determine whether a given sentence is positive, negative, or neutral. Although VADER was built for social media, it will work well enough for our purposes.

Using VADER, we can determine what fraction of each character's lines (i.e., sentences) are positive, negative, or neutral. Here's an arbitrary selection of ten characters ranked by their net positivity, meaning the percentage of positive lines minus the percentage of negative lines.

Avatar characters by positivity

Although I didn't include him in the chart, the character with the highest fraction of positive lines (and a net positivity second only to Ty Lee) is, surprisingly, the grumpy and sexist Master Pakku. Rather than suggesting that Pakku is a wellspring of good cheer, this result highlights how VADER can fail — it lacks context and an understanding of sarcasm, so Pakku's snarky lines like "Well, have fun teaching yourself" are tagged as positive.


That's all for now. If you are interested in the details of my methodology (there's nothing particularly special) or want to see the data I left out of this post, then feel free to take a look at the full source code.

Thanks to The Pudding for inspiration (and CSS), AvatarSpirit.net for the transcripts, and the maintainers of NLTK for the sentiment analysis implementation!