It's interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.
So that's what I'm going to do here: make some general observations about the ways that scripts shift thematically. In its own, this stuff is pretty interesting--when I first started analyzing the set, I thought it might an end in itself. But it turns out that by combining those thematic scripts with the topic models, it's possible to do something I find really fascinating, and a little mysterious: you can sketch out, derived from the tens of thousands of hours of dialogue in the corpus, what you could literally call a plot "arc" through multidimensional space.
Words in screen time
First, let's lay the groundwork. Many, many individual words show strong trends towards the beginning or end of scripts. In fact, plotting movies in what I'm calling "screen time" usually has a much more recognizable signature than plotting things in the "historic time" you can explore yourself in the movie bookworm. So what I've done is cut every script there into "twelfths" of a movie or TV show; the charts here show the course of an episode or movie from the first minute at the left to the last one at the right. For example: the phrase "love you" (as in, mostly, "I love you") is most frequent towards the end of movies or TV shows: characters in movies are almost three times more likely to profess their love in the last scene of a movie than in the first.
(I'm including TV shows and "unknown" as a sanity check. The same pattern holds for television; but movies tend to be more romantic.) That love waits for the end to declare itself is not necessarily surprising: but it is fundamental to the medium, and a phenomenon well worth understanding better. What other conventions are there like this? How do individual TV shows play off these shared structural elements of the medium as a whole?
Lots of individual words or phrases have their own trends, even ones that don't map so firmly to our understanding of plot points. Take the phrase "if you," for instance. It's rare at the beginning of scripts, common in the middle, and then rare again towards the end. Why? Well, presumably it encapsulates a particular type of dialogue, in which the protagonist has to give or seek advice about the consequences of their actions. First scenes won't have the necessary exposition laid out for the hero to reach a crossroads; and as the script wraps up in the third act, the importance of decisions fade out in favor of consequences. You can see them working through those results grammatically, as well: the word "because" (not shown) becomes more and more common as the typical script proceeds.
These are interesting, and at some point I'll share a link for exploring these words on your own. I could probably go on for quite a while. But right now, though, my question is less about the individual instances than about generalizing some of the patterns underlying these shifts.
One way to approach these overall questions of structure would be to use all the words in the scripts, and search for interesting patterns. This is actually feasible, though computationally pretty difficult, and something I'll keep in mind for later. But for now, the topic models I described last time provide a nice way to understand what's actually going in a typical plot arc. (If you're new here and don't know what topic models or LDA are, all the digital humanists got together and decided that this post by Ted Underwood is the one to read for an explanation.)
Topics in screen time
Just as words show patterns in screen time, so do topics. But since I only have 127 topics, it's easy to search all of them for the strongest patterns. Using linear models to test fittedness to a straight line, I went through the data set to pull out the nine topics that show the strongest directionality. Here they are (click to enlarge):
All of these topics show clear and strong directionality. Over the course of a script, characters talk more about the truth and lying; they more often narrate murders; they use the language of apology and authenticity. (Two out of the nine topics display "sorry" prominently.) They talk less, on the other hand, about "business" and "companies," about their "job" and the "office," and about times of day.
Using the same method but working from the middle out, I can also see which topics are clustered in the middle of episodes or movies; or conversely, which ones are predominantly used in the beginning or ends of scripts. Some of these curves aren't quite as pretty, but do show the elements that tend to get dropped or emphasized when the game is actually afoot. The two topics used mostly at the edges are grand language about the world and the government: there are some really beautiful curves that show up mostly in the middle of movies. Two pairs in particular stick out. One pair is "look eyes face Look eye" and "hand head cut hands," suggesting that focuses on the human body and the human face occur mostly in the heart of films. The other is "question answer questions" and "talk doesn people wants says," suggesting that descriptive language about narration or questioning also is comparatively rare in the first and third acts.
It's easy to get a little just-so about these curves. For instance, after falling away later, the face does return in last scenes, Does the face return, in ways that the rest of the body doesn't, because it contains the same grand eternal resonance that the nation and the world can have? Maybe that's crazy. But certainly, there are major elements of structural similarity being shown by these patterns.
[One caveat before I get into the really exciting stuff: the way I've chosen to arrange the data is creating a few odd artifacts here. In particular, dividing up into three minute chunks and then dividing into twelfths occasionally does strange things. This is particularly true for shows shorter than 24 minutes, which typically end lacking any three-minute chunks assigned to twelfth 2, 3, or 4. (I think). In later versions I may change this; but it could take a few days of on-again-off-again computation, and I'm so interested in these models right now I don't want to take the time off.]
Moving through topic space in screen time
So: those charts show the distribution of topics in screen time. I think that's a practice less susceptible to embarrassing errors than one I've objected about in the past, of plotting topics in historical time, because the guarantee that each author is equally represented across time periods helps to eliminate some questions of linguistic drift. (There are other problems, though: I'm sure that having about 200 documents that all say "Space: the final frontier" in their first couple minutes is skewing certain elements of the model in odd ways).
I'm not interested, fundamentally, in topics, though: I'm interested in the TV shows. So how can we move from writing about topics, which is boring, to writing about television, which isn't? The answer is: we think about the ways that individual episodes of TV shows move through the space created by topics. One way to do that is through dimensional reducing the topic space using principal components analysis. Let me ease into that, though, by using just a few of the topics to explain.
Suppose we take what I think is the most interesting linear trend above; the shift from the mundane world of talking about times of day to the personal world of giving and taking apologies.
This chart is showing proportion of the morning-tomorrow topic on the y axis, and of the sorry-really-talk topic on the x axis. A typical script starts in part 0, using 1% "apology" language and 1.6% "time" language. By the end, those proportions have been flipped: that script moves from one pole to the other.
We don't just have to worry about a "typical" script, though: we can drop any text we want into this same space and see how it moves. These tend to be noisier because of the smaller sample, so I'm going to use 6 parts rather than 12. (If I really wanted to dazzle you, I'd use a moving average that would artificially smooth the curves: but that's not totally necessary). I've randomly picked 6 of the top 20 shows in the sample to see how they move on this same chart: here's what it looks like.
Law and Order SVU follows the pattern almost exactly; The West Wing comes close, although it has mostly finished its time talk by the end of the first half hour. Other shows depart more significantly; 24 is on track until it doubles back in the last 20 minutes, and Survivor moves left to right but not top to bottom. Dr. Who seems to almost move in the opposite direction of the convention. Still, with just two relatively scarce variables, its surprising to me how well a random sample holds up; it's probably more than a coincidence, as well, that The West Wing and Law and Order are the shows that use the vocabulary the most, as well as using it the most typically. The ones that fail are the most under-represented. For example, Doctor Who hardly talks about times of day at all (perhaps doing it from a time travel topic instead).
[By the way, you might look at these charts and say that they're completely random, this method is bunk, and there's no sign of the left to right trends in the individual shows. I don't think you'd be crazy to do so, based on the charts above--but at least wait until we get to the higher-dimensional stuff before passing definitive judgment.]
This is interesting, but there are even better ways to approach the general problem of plot arcs. These are charts of motion in two dimensional space: but we have a full 127 dimensions to move around in, one for every topic. Principal Components provides a nice way to plot them all simultaneously. So what I'm doing is taking a restricted set of only 6 documents, one for each sixth in screen time of all the text spoken in all of the movies or TV shows for that period: and calculating the first two components. (Forgive me for linking to one of the earliest series on this blog, a deeper explanation of PCA. And I should note that though I'm plotting 12 chunks, I use 6 chunks of screen time rather than 12 to calculate the components because of the aforementioned problems with shows under 23 minutes.).
Here's we arrive at the heart of this long post. Plotting individual trends, there are lots of ways you can imagine a curve moving. But when we combine all the topics present in all the shows, a single structure emerged. It's quite literally a "plot arc," reduced down from a much more complicated curve moving through 127-dimensional space.
What is this saying? That in the grand corpus of tens of thousands of hours of studio-approved, investor-funded, union-written scripts, two major trends stand out: one set of directional trends, advancing continuously through the course of the film, and one cyclical, through which the language returns back to its origins. It's tempting to get mystical on this, and humanists often to do so when applying techniques like PCA. So perhaps so I should emphasize that it's hard to imagine any other shape coming out of the PCA algorithm with the inputs I put in (which were specifically designed to destroy the genre signals that would ordinarily be output by PCA). Even before I ran it, I was pretty confident this sort of curve would emerge. About the only other reasonable possible option for story structure would be a circle a la Dan Harmon. But a true circle is pretty implausible--we know, intuitively, that the first lines of a story won't be exactly the same as the last, although they may be similar in some ways. So this pure rainbow shape is probably more a confirmation that the method works, more than a radical insight into the nature of narration. (But maybe, some part of me wants to say, I shouldn't give too much up? Maybe it's just a little bit fundamental to narratology? Just don't tell anyway outside this parenthesis I said so).
What's really interesting is not just knowing that there's an arc, but knowing what makes it up. For this, we just have to look at the loadings for the components. Here they are, overlaid: this is not a particularly beautiful visualization and you'd need to expand to really read it, but it gives a better sense of what it means to move through this space. Although there are 127 dimensions, I'm only showing the 15 most important ones on the plot here.
Just for fun, let's put that into a single narrative: the typical script starts among a group of wise-cracking teenagers at the school, making plans for the day to come and the weekend. At the office, however, a dead body is discovered. The wisecracking ceases, and instead the befuddled victims try to describe more accurately how the murder happened, to apologize for their mistakes, and to inspire each other not to give in to defeat but to fight for victory. A heartfelt plea to the Almighty for help lies over their testimony at the trial; and they carefully move into the future, apologizing once again and reflecting on the new truths they have learned.
If anything jumps out at you from this, it might be the dominance of tropes from detective/crime fiction; just as we saw in the last post, those do seem to be some of the strongest elements, and I am somewhat tempted to exclude them from future versions of this. On the other hand, it does capture the centrality of the murder mystery to contemporary television fiction; no other genre comes close in laying out the general scene. Whatever the general principles of plot we learn by watching TV and movies, it's almost certainly deeply inflected by the crime show. (I don't have a place in my personal memory palace for literary criticism, but I believe there's an argument out there that the detective novel is central to the formation of the 20th century novel--anyone?)
But it's important to remember that this is not a general scheme that every show follows--the point is that scripts tend to start with either school scenes, or work scenes, or the discovery of a body. There are many different ways to reach this same plot curve, including some probably nonsensical ones. (Wisecracking at the office doesn't lead to a trial, very often--although we could actually search through the corpus with this method to find the rare episode where this is the case.)
How Individual shows mirror the general plot arc
Just as I was able to overlay individual shows on the toy example of "apology" and "temporality" examples, I can do the same thing with the individual plot arcs. And--lo and behold--those shows tend to follow the same plot arc.
They're not perfect (although ER is pretty close)--but they're also certainly not random. The general curve reflects quite well the overall shape of the arc traced by the aggregate show. (Brief note: because I realized while writing this that to prove efficacy I needed to segregate training data from output, these are arcs that are generated on a subcorpus of the full set specifically designed not to include CSI, ER, or any of the other top 20 shows. So ER is following a plot arc I derived from a batch of shows that don't include any ER, CSI, or any of the other shows you're about to see.)
But eagle-eyed readers might have noticed that the x and y-axes have different numbers than for the archetype. How do these fit in? This is where I got a result that I didn't expect, though perhaps I should have. Here are the twenty shows with the most minutes of dialogue in the movie bookworm, arranged on a single plot. Almost all of them sweep out an arc roughly like the ER or CSI ones (although some are rougher than others, and a few shows, like the "Gilmore Girls" and "The Simpsons," seem to extend the action further in their final scene rather than returning home.)
But although they trace out arcs, they do it in their portion of the plot arc space. (To really make sense out of this chart, you might have to not just click to expand, but open in a separate window, zoom in and zoom out, pan around; or, just truest me). The portions of plot-arc space they land in correspond to genre: the crime shows live in an area something like the early middle of a show, while science fiction camps out after the end of the end.
If you know PCA, this may not initially seem weird to you: after all, the method is usually fairly good at segregating genres. But remember, this is PCA on a space made out of only 6 massive documents, one for each ten minutes of the hour; each TV show is evenly divided across those 6, so the normal avenues for genre signal should be almost entirely muted. But it turns out that the signal for structurality is also tied with the signal with genre. So all those Star Treks are camped out together in the southeast, because the entirety of science fiction language is more "endingy" than even the final scenes of your typical "NCIS" episode. (Particularly on the up-down dimension, which, remember tends to separate the mundane from the eternal. There aren't all that many water-cooler scenes in Start Trek, it turns out; and conversely, the forensics shows aren't dispensing many eternal truths.) What these positions tell us, roughly, is that the genre signal is somewhat, but not immensely, more strong than the plot arc signal even on terms designed to discover plot relationships above all.
So that clustering is interesting enough: but the omnipresence of the curves suggests that they all follow the same path through space in some way, regardless of where they start. Look at each one of these, allowed to expand out: some shows (Charmed, Cold Case, all three Star Treks) follow the curve perfectly, while most others at least move left to right. The two with the biggest problems are Survivor (which obviously faces different narrative issues) and 24 (which I can't really explain away by its plot structure). There's some tendency for shows that start in the southwest quadrant to end less decisively. But even then, they usually follow something like the standard curve.
From a multidimensional point of view, this tells us something really interesting for future research: that plot is about motion through multidimensional space, not about position in it. Most of our existing text analysis toolkits are built around spatial and probabilistic measures that don't really conceive of any possibility for motion; but these archetypal plot arcs are better thought of forces that pull a given script, whatever its starting point, in different directions and with different forces as time unfolds, like tides going in and out.
Even at this point, there are some interesting questions that could be answered with the tools at hand. What are the most typical shows? Do any shows have inverted arcs? Do successful shows follow the arc better than failures? Why doesn't Community, Dan Harmon's story circle be damned, follow the standard arc? Some of these are probably best investigated in particular subsections: what can we learn about the transatlantic genre of the cop/detective show, maybe by trying to filter out national vocabulary? I might turn to those low hanging fruit soon, and I'm ending mostly because this post is long enough and the data could use some cleaning. (Next time I write on this, I'll be using a slightly different topic model, and possibly a 1-minute chunking rather than a three-minute one). But I could easily fill out another half-dozen questions in that same vein right now, and I'm sure you can to.
One extremely important observation I've noticed, but haven't had time to fully plumb, is this: TV shows show much greater regularities of form than do films. This could be data problems, but I suspect it's real; everything about the TV system enforces heavy constraints and tight structures that even a 90-minute movie can more easily flout. (Particularly because my topics seem to be better capture themes of interpersonal conflict than individual growth).
But because this is getting out past where I usually take text analysis, really pushing the investigation farther may require some really difficult thought. How best to capture motion along arcs as a phenomenon in itself? I could just multiply my multidimensional space across 6 more dimensions, but I'm already pushing the limits of the data. The closest standard technique might be transition matrices but applied to topic spaces rather than words.
In some ways, the problem resembles musicology, a field which has a firm grasp on structure, more than text. The math reminds me in particular of Dmitri Tymoczko's "A geometry of music," which treats chord progressions as paths through multidimensional space. But the time spans that we're working with are much greater, the kind of spaces where you want a sort of Schenkerian analysis of plot. The pure formulaic banality of much network TV makes this project particularly appealing here; if the sitcom episode is a rondo and the hourlong drama a sonata form, each cinematic drama probably more closely resembles a meandering, idiosyncratic tone poem. (To say nothing of the novel. Though I'm sure there are ways this could be creatively deployed on that white whale of narrative, I doubt the signal for plot arcs would be anywhere near as strong).
But there's also a danger that analogy to high-level musical structures implies. Schenker's Ursatz can seem like a weird cross between a mirage and a tautology. As I said, some kind structures like these have to exist just because of the math; that doesn't they're real in any meaningful sense, or that they're useful. So I'm curious what anyone else thinks on this one in particular.
Two images with a lot more shows are graphically problematic, but might help you to locate your favorite show or find some errors. The prettiest way to handle this would be an interactive graphic with D3 using my d3.layout.trail library, so you could see the curves unveil in space when you added a show; but I've got get a little actual history done this week, too, so maybe not.
Appendix 1 (open in new window): Top 200 shows, in one giant plot. Shows the tendency for southwestern shows to end closer to the middle of the plot arc, which worries me a bit: and the tendency of all shows to cluster in the portions that form a sort of macro version of the arc, which, I don't even know.
Appendix 2 (open in new window and expand): Top 200 shows in alphabetical order, faceted one per frame.