Sapping Attention: Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts

Tuesday, December 16, 2014

Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts

Note: a somewhat more complete and slightly less colloquial, but eminently more citeable, version of this work is in the Proceedings of the 2015 IEEE International Conference on Big Data. Plus, it was only there that I came around to calling the whole endeavor "plot arceology."

It's interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.

So that's what I'm going to do here: make some general observations about the ways that scripts shift thematically. In its own, this stuff is pretty interesting--when I first started analyzing the set, I thought it might an end in itself. But it turns out that by combining those thematic scripts with the topic models, it's possible to do something I find really fascinating, and a little mysterious: you can sketch out, derived from the tens of thousands of hours of dialogue in the corpus, what you could literally call a plot "arc" through multidimensional space.

Words in screen time

First, let's lay the groundwork. Many, many individual words show strong trends towards the beginning or end of scripts. In fact, plotting movies in what I'm calling "screen time" usually has a much more recognizable signature than plotting things in the "historic time" you can explore yourself in the movie bookworm. So what I've done is cut every script there into "twelfths" of a movie or TV show; the charts here show the course of an episode or movie from the first minute at the left to the last one at the right. For example: the phrase "love you" (as in, mostly, "I love you") is most frequent towards the end of movies or TV shows: characters in movies are almost three times more likely to profess their love in the last scene of a movie than in the first.

(I'm including TV shows and "unknown" as a sanity check. The same pattern holds for television; but movies tend to be more romantic.) That love waits for the end to declare itself is not necessarily surprising: but it is fundamental to the medium, and a phenomenon well worth understanding better. What other conventions are there like this? How do individual TV shows play off these shared structural elements of the medium as a whole?

Lots of individual words or phrases have their own trends, even ones that don't map so firmly to our understanding of plot points. Take the phrase "if you," for instance. It's rare at the beginning of scripts, common in the middle, and then rare again towards the end. Why? Well, presumably it encapsulates a particular type of dialogue, in which the protagonist has to give or seek advice about the consequences of their actions. First scenes won't have the necessary exposition laid out for the hero to reach a crossroads; and as the script wraps up in the third act, the importance of decisions fade out in favor of consequences. You can see them working through those results grammatically, as well: the word "because" (not shown) becomes more and more common as the typical script proceeds.

These are interesting, and at some point I'll share a link for exploring these words on your own. I could probably go on for quite a while. But right now, though, my question is less about the individual instances than about generalizing some of the patterns underlying these shifts.

One way to approach these overall questions of structure would be to use all the words in the scripts, and search for interesting patterns. This is actually feasible, though computationally pretty difficult, and something I'll keep in mind for later. But for now, the topic models I described last time provide a nice way to understand what's actually going in a typical plot arc. (If you're new here and don't know what topic models or LDA are, all the digital humanists got together and decided that this post by Ted Underwood is the one to read for an explanation.)

Topics in screen time

Just as words show patterns in screen time, so do topics. But since I only have 127 topics, it's easy to search all of them for the strongest patterns. Using linear models to test fittedness to a straight line, I went through the data set to pull out the nine topics that show the strongest directionality. Here they are (click to enlarge):

All of these topics show clear and strong directionality. Over the course of a script, characters talk more about the truth and lying; they more often narrate murders; they use the language of apology and authenticity. (Two out of the nine topics display "sorry" prominently.) They talk less, on the other hand, about "business" and "companies," about their "job" and the "office," and about times of day.

Using the same method but working from the middle out, I can also see which topics are clustered in the middle of episodes or movies; or conversely, which ones are predominantly used in the beginning or ends of scripts. Some of these curves aren't quite as pretty, but do show the elements that tend to get dropped or emphasized when the game is actually afoot. The two topics used mostly at the edges are grand language about the world and the government: there are some really beautiful curves that show up mostly in the middle of movies. Two pairs in particular stick out. One pair is "look eyes face Look eye" and "hand head cut hands," suggesting that focuses on the human body and the human face occur mostly in the heart of films. The other is "question answer questions" and "talk doesn people wants says," suggesting that descriptive language about narration or questioning also is comparatively rare in the first and third acts.

It's easy to get a little just-so about these curves. For instance, after falling away later, the face does return in last scenes, Does the face return, in ways that the rest of the body doesn't, because it contains the same grand eternal resonance that the nation and the world can have? Maybe that's crazy. But certainly, there are major elements of structural similarity being shown by these patterns.

[One caveat before I get into the really exciting stuff: the way I've chosen to arrange the data is creating a few odd artifacts here. In particular, dividing up into three minute chunks and then dividing into twelfths occasionally does strange things. This is particularly true for shows shorter than 24 minutes, which typically end lacking any three-minute chunks assigned to twelfth 2, 3, or 4. (I think). In later versions I may change this; but it could take a few days of on-again-off-again computation, and I'm so interested in these models right now I don't want to take the time off.]

Moving through topic space in screen time

So: those charts show the distribution of topics in screen time. I think that's a practice less susceptible to embarrassing errors than one I've objected about in the past, of plotting topics in historical time, because the guarantee that each author is equally represented across time periods helps to eliminate some questions of linguistic drift. (There are other problems, though: I'm sure that having about 200 documents that all say "Space: the final frontier" in their first couple minutes is skewing certain elements of the model in odd ways).

I'm not interested, fundamentally, in topics, though: I'm interested in the TV shows. So how can we move from writing about topics, which is boring, to writing about television, which isn't? The answer is: we think about the ways that individual episodes of TV shows move through the space created by topics. One way to do that is through dimensional reducing the topic space using principal components analysis. Let me ease into that, though, by using just a few of the topics to explain.

Suppose we take what I think is the most interesting linear trend above; the shift from the mundane world of talking about times of day to the personal world of giving and taking apologies.

This chart is showing proportion of the morning-tomorrow topic on the y axis, and of the sorry-really-talk topic on the x axis. A typical script starts in part 0, using 1% "apology" language and 1.6% "time" language. By the end, those proportions have been flipped: that script moves from one pole to the other.

We don't just have to worry about a "typical" script, though: we can drop any text we want into this same space and see how it moves. These tend to be noisier because of the smaller sample, so I'm going to use 6 parts rather than 12. (If I really wanted to dazzle you, I'd use a moving average that would artificially smooth the curves: but that's not totally necessary). I've randomly picked 6 of the top 20 shows in the sample to see how they move on this same chart: here's what it looks like.

Law and Order SVU follows the pattern almost exactly; The West Wing comes close, although it has mostly finished its time talk by the end of the first half hour. Other shows depart more significantly; 24 is on track until it doubles back in the last 20 minutes, and Survivor moves left to right but not top to bottom. Dr. Who seems to almost move in the opposite direction of the convention. Still, with just two relatively scarce variables, its surprising to me how well a random sample holds up; it's probably more than a coincidence, as well, that The West Wing and Law and Order are the shows that use the vocabulary the most, as well as using it the most typically. The ones that fail are the most under-represented. For example, Doctor Who hardly talks about times of day at all (perhaps doing it from a time travel topic instead).

[By the way, you might look at these charts and say that they're completely random, this method is bunk, and there's no sign of the left to right trends in the individual shows. I don't think you'd be crazy to do so, based on the charts above--but at least wait until we get to the higher-dimensional stuff before passing definitive judgment.]

Plot arcs

This is interesting, but there are even better ways to approach the general problem of plot arcs. These are charts of motion in two dimensional space: but we have a full 127 dimensions to move around in, one for every topic. Principal Components provides a nice way to plot them all simultaneously. So what I'm doing is taking a restricted set of only 6 documents, one for each sixth in screen time of all the text spoken in all of the movies or TV shows for that period: and calculating the first two components. (Forgive me for linking to one of the earliest series on this blog, a deeper explanation of PCA. And I should note that though I'm plotting 12 chunks, I use 6 chunks of screen time rather than 12 to calculate the components because of the aforementioned problems with shows under 23 minutes.).

Here's we arrive at the heart of this long post. Plotting individual trends, there are lots of ways you can imagine a curve moving. But when we combine all the topics present in all the shows, a single structure emerged. It's quite literally a "plot arc," reduced down from a much more complicated curve moving through 127-dimensional space.

What is this saying? That in the grand corpus of tens of thousands of hours of studio-approved, investor-funded, union-written scripts, two major trends stand out: one set of directional trends, advancing continuously through the course of the film, and one cyclical, through which the language returns back to its origins. It's tempting to get mystical on this, and humanists often to do so when applying techniques like PCA. So perhaps so I should emphasize that it's hard to imagine any other shape coming out of the PCA algorithm with the inputs I put in (which were specifically designed to destroy the genre signals that would ordinarily be output by PCA). Even before I ran it, I was pretty confident this sort of curve would emerge. About the only other reasonable possible option for story structure would be a circle a la Dan Harmon. But a true circle is pretty implausible--we know, intuitively, that the first lines of a story won't be exactly the same as the last, although they may be similar in some ways. So this pure rainbow shape is probably more a confirmation that the method works, more than a radical insight into the nature of narration. (But maybe, some part of me wants to say, I shouldn't give too much up? Maybe it's just a little bit fundamental to narratology? Just don't tell anyway outside this parenthesis I said so).

What's really interesting is not just knowing that there's an arc, but knowing what makes it up. For this, we just have to look at the loadings for the components. Here they are, overlaid: this is not a particularly beautiful visualization and you'd need to expand to really read it, but it gives a better sense of what it means to move through this space. Although there are 127 dimensions, I'm only showing the 15 most important ones on the plot here.

Just for fun, let's put that into a single narrative: the typical script starts among a group of wise-cracking teenagers at the school, making plans for the day to come and the weekend. At the office, however, a dead body is discovered. The wisecracking ceases, and instead the befuddled victims try to describe more accurately how the murder happened, to apologize for their mistakes, and to inspire each other not to give in to defeat but to fight for victory. A heartfelt plea to the Almighty for help lies over their testimony at the trial; and they carefully move into the future, apologizing once again and reflecting on the new truths they have learned.

If anything jumps out at you from this, it might be the dominance of tropes from detective/crime fiction; just as we saw in the last post, those do seem to be some of the strongest elements, and I am somewhat tempted to exclude them from future versions of this. On the other hand, it does capture the centrality of the murder mystery to contemporary television fiction; no other genre comes close in laying out the general scene. Whatever the general principles of plot we learn by watching TV and movies, it's almost certainly deeply inflected by the crime show. (I don't have a place in my personal memory palace for literary criticism, but I believe there's an argument out there that the detective novel is central to the formation of the 20th century novel--anyone?)

But it's important to remember that this is not a general scheme that every show follows--the point is that scripts tend to start with either school scenes, or work scenes, or the discovery of a body. There are many different ways to reach this same plot curve, including some probably nonsensical ones. (Wisecracking at the office doesn't lead to a trial, very often--although we could actually search through the corpus with this method to find the rare episode where this is the case.)

How Individual shows mirror the general plot arc

Just as I was able to overlay individual shows on the toy example of "apology" and "temporality" examples, I can do the same thing with the individual plot arcs. And--lo and behold--those shows tend to follow the same plot arc.

They're not perfect (although ER is pretty close)--but they're also certainly not random. The general curve reflects quite well the overall shape of the arc traced by the aggregate show. (Brief note: because I realized while writing this that to prove efficacy I needed to segregate training data from output, these are arcs that are generated on a subcorpus of the full set specifically designed not to include CSI, ER, or any of the other top 20 shows. So ER is following a plot arc I derived from a batch of shows that don't include any ER, CSI, or any of the other shows you're about to see.)

But eagle-eyed readers might have noticed that the x and y-axes have different numbers than for the archetype. How do these fit in? This is where I got a result that I didn't expect, though perhaps I should have. Here are the twenty shows with the most minutes of dialogue in the movie bookworm, arranged on a single plot. Almost all of them sweep out an arc roughly like the ER or CSI ones (although some are rougher than others, and a few shows, like the "Gilmore Girls" and "The Simpsons," seem to extend the action further in their final scene rather than returning home.)

But although they trace out arcs, they do it in their portion of the plot arc space. (To really make sense out of this chart, you might have to not just click to expand, but open in a separate window, zoom in and zoom out, pan around; or, just truest me). The portions of plot-arc space they land in correspond to genre: the crime shows live in an area something like the early middle of a show, while science fiction camps out after the end of the end.

If you know PCA, this may not initially seem weird to you: after all, the method is usually fairly good at segregating genres. But remember, this is PCA on a space made out of only 6 massive documents, one for each ten minutes of the hour; each TV show is evenly divided across those 6, so the normal avenues for genre signal should be almost entirely muted. But it turns out that the signal for structurality is also tied with the signal with genre. So all those Star Treks are camped out together in the southeast, because the entirety of science fiction language is more "endingy" than even the final scenes of your typical "NCIS" episode. (Particularly on the up-down dimension, which, remember tends to separate the mundane from the eternal. There aren't all that many water-cooler scenes in Start Trek, it turns out; and conversely, the forensics shows aren't dispensing many eternal truths.) What these positions tell us, roughly, is that the genre signal is somewhat, but not immensely, more strong than the plot arc signal even on terms designed to discover plot relationships above all.

So that clustering is interesting enough: but the omnipresence of the curves suggests that they all follow the same path through space in some way, regardless of where they start. Look at each one of these, allowed to expand out: some shows (Charmed, Cold Case, all three Star Treks) follow the curve perfectly, while most others at least move left to right. The two with the biggest problems are Survivor (which obviously faces different narrative issues) and 24 (which I can't really explain away by its plot structure). There's some tendency for shows that start in the southwest quadrant to end less decisively. But even then, they usually follow something like the standard curve.

From a multidimensional point of view, this tells us something really interesting for future research: that plot is about motion through multidimensional space, not about position in it. Most of our existing text analysis toolkits are built around spatial and probabilistic measures that don't really conceive of any possibility for motion; but these archetypal plot arcs are better thought of forces that pull a given script, whatever its starting point, in different directions and with different forces as time unfolds, like tides going in and out.

Breakpoint

Even at this point, there are some interesting questions that could be answered with the tools at hand. What are the most typical shows? Do any shows have inverted arcs? Do successful shows follow the arc better than failures? Why doesn't Community, Dan Harmon's story circle be damned, follow the standard arc? Some of these are probably best investigated in particular subsections: what can we learn about the transatlantic genre of the cop/detective show, maybe by trying to filter out national vocabulary? I might turn to those low hanging fruit soon, and I'm ending mostly because this post is long enough and the data could use some cleaning. (Next time I write on this, I'll be using a slightly different topic model, and possibly a 1-minute chunking rather than a three-minute one). But I could easily fill out another half-dozen questions in that same vein right now, and I'm sure you can to.

One extremely important observation I've noticed, but haven't had time to fully plumb, is this: TV shows show much greater regularities of form than do films. This could be data problems, but I suspect it's real; everything about the TV system enforces heavy constraints and tight structures that even a 90-minute movie can more easily flout. (Particularly because my topics seem to be better capture themes of interpersonal conflict than individual growth).

But because this is getting out past where I usually take text analysis, really pushing the investigation farther may require some really difficult thought. How best to capture motion along arcs as a phenomenon in itself? I could just multiply my multidimensional space across 6 more dimensions, but I'm already pushing the limits of the data. The closest standard technique might be transition matrices but applied to topic spaces rather than words.

In some ways, the problem resembles musicology, a field which has a firm grasp on structure, more than text. The math reminds me in particular of Dmitri Tymoczko's "A geometry of music," which treats chord progressions as paths through multidimensional space. But the time spans that we're working with are much greater, the kind of spaces where you want a sort of Schenkerian analysis of plot. The pure formulaic banality of much network TV makes this project particularly appealing here; if the sitcom episode is a rondo and the hourlong drama a sonata form, each cinematic drama probably more closely resembles a meandering, idiosyncratic tone poem. (To say nothing of the novel. Though I'm sure there are ways this could be creatively deployed on that white whale of narrative, I doubt the signal for plot arcs would be anywhere near as strong).

But there's also a danger that analogy to high-level musical structures implies. Schenker's Ursatz can seem like a weird cross between a mirage and a tautology. As I said, some kind structures like these have to exist just because of the math; that doesn't they're real in any meaningful sense, or that they're useful. So I'm curious what anyone else thinks on this one in particular.

Appendix

Two images with a lot more shows are graphically problematic, but might help you to locate your favorite show or find some errors. The prettiest way to handle this would be an interactive graphic with D3 using my d3.layout.trail library, so you could see the curves unveil in space when you added a show; but I've got get a little actual history done this week, too, so maybe not.

Appendix 1 (open in new window): Top 200 shows, in one giant plot. Shows the tendency for southwestern shows to end closer to the middle of the plot arc, which worries me a bit: and the tendency of all shows to cluster in the portions that form a sort of macro version of the arc, which, I don't even know.

Appendix 2 (open in new window and expand): Top 200 shows in alphabetical order, faceted one per frame.

23 comments:

KalebergDecember 16, 2014 at 12:56 PM
Wow, this is great stuff. You have analytically uncovered the "three acts" theory of story structure. As Julius Epstein, who wrote Casablanca, said to Vincent Sassone who was taking a writing course: "You're wasting your money. I'll tell you how to write a screenplay in three sentences. Act I, get your guy up a tree. Act II, throw rocks at him. Act III, get your guy outta the tree." Basically, there's the setup, the complications and the resolution.

In other words, the structure you are detecting is completely intentional. It's also pretty old. You should dump some Euripides or Aeschylus into your analyzer. It's probably cross cultural too.
ReplyDelete
Replies
mexico guyDecember 17, 2014 at 9:13 PM
Unbelievably good! Got this recommended by the Browser and it didn't disappoint. Great work.
ReplyDelete
Replies
AnonymousDecember 18, 2014 at 8:55 AM
Never seen so much blood come out of a stone before.
ReplyDelete
Replies
Ted UnderwoodDecember 18, 2014 at 12:38 PM
This is really excellent stuff. Basic structural principles of narrative. I kind of can't even summarize my response. Superimposing the arcs of different shows is an excellent viz idea.

Obviously, going to want to do this with novels as well. One could group novels by author or genre to create patterns as strong as the ones you're getting from television serials.

Whew! Awesome idea.
ReplyDelete
Replies
AnonymousDecember 18, 2014 at 12:42 PM
Wheel, reinvented. Congratulations.
ReplyDelete
Replies
Ted UnderwoodDecember 18, 2014 at 1:21 PM
The paragraph about Star Trek is especially wild. Deep relationship between plot structure and genre?!?!? Structuralists should looooove this stuff.
ReplyDelete
Replies
AnonymousDecember 18, 2014 at 1:46 PM
What follows is pasted in from an e-mail from a friend that I want to respond to here:

> [...] And believe it or not I actually buy his [Dan Harmon's]/your/Joseph Campbell's mythology stuff: when I read Campbell a long time ago I couldn't see the point of his superficial all-myths-say-the-same-thing structuralism, but as a barometer of today's highly formalized pop culture mythology it actually makes sense.
> Have I given you my long-winded paean before about how The State guys (David Wain, Michael Showalter, Michael Ian Black, etc.) are actually pretty great narratologists? Their shows are constantly commenting on how the consistency of modern narratives have warped our sense of what counts as significant and eventful. But those deconstructions are usually equivalent to close readings rather than radical structural changes, and I'll be if you ran Wet Hot American Summer, Reno 911, or Stella through the bookworm bot the plot arcs wouldn't surprise you.
> But that doesn't mean I'd agree with your more dismal suspicion that Schenkerian sketches of this kind are banal or tautological. Cognitive narratologists are really trying to figure out what kind of conventional narrative order the mind (in both the individual sense and the cultural one) needs or wants or uses — not as a salve but as a scaffolding that can support a lot of renovation/rethinking at different levels.
> Or to put it another way, we are rarely aware of how concepts of time or causation are narrative fabrications until they're pointed out to us. The conceptual blending guys, Fauconnier and Turner, have talked about how our concept of "punishment" relies on exactly the kind of narrative that you pinged in one of your earlier charts (feel/hate/sorry/hurt/wrong fault): this is maybe an obvious one because we are more aware of judgments about guilt and retribution being selective readings of a situation because they're rooted in legal systems which are overtly deliberative. But you're also drawing out a lot of nonobvious conventions that are important for highlighting how deep our mythologies run (came/knew/wanted/saw/day/remember took is one of my favorite topics, and the Survivor pseudo-coda at point 4 in your PCA chart is very interesting too). I have a pet theory about the meta-arcing of your mini arcs but I'll save that for another day.
ReplyDelete
Replies
AnonymousDecember 20, 2014 at 11:06 AM
Kurt Vonnegut did something like this for his thesis at U of Chicago. The faculty laughed him out of the University. Good luck for everyone who enjoyed his later work in fiction.
ReplyDelete
Replies
Jhon StaphenJanuary 1, 2015 at 4:28 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Erez AidenJanuary 4, 2015 at 4:47 PM
Hey Ben -

This is really, really neat. I enjoyed this whole post and think there is a lot to what you've uncovered.

But I do worry that the fact that whole genre-s are more 'ending-y' is not so much a discovery as an issue with the method, an indication that genre features need to be controlled for more systematically.

Seems to me that progressing faster-or-slower through the plot is something that can vary by genre, but that any method which effectively says s/t like: well this genre starts at the end and keeps going is no longer doing something that passes the "reasonable" sniff test for plot structure.

I realize that what you are saying in the text of the blog post is more nuanced than this, but on first pass I think the data may simply indicate that star-trek is full of words that are ending-y for other genres, and so the classifier doesn't do a very good job on star trek. And indeed on many other specific shows.

Notwithstanding that it does seem to get meaningful results out of the aggregate of all shows.

Very cool!
ReplyDelete
Replies
BenJanuary 7, 2015 at 10:23 AM
Hi Erez! Yes, I worry about the classifier a bit. Still, I think there's a chance even the strongest version of the Star Trek thing might hold. If typical episodes of TV shows tend to deal with big philosophical questions just as the beginning and end, and Star Trek does the same but deals with much bigger questions and deals with them all the time, then this might be a fair characterization. But I agree that it's unlikely that a first-pass method like this is actually clocking into exactly those features--what disturbs me the most is that the second-to-last chart shows some very strange patterns, where the arcs would look better if distorted to be rotated around a focal point. Something's fishy, and I'm going to pull away from PCA after about one more post.

But I don't think I'm saying that "Star Trek" starts at the end, exactly: I'm saying it starts at the end compared to other plots. Say we were looking instead a metric for identifying laugh lines, and I found that action movies started off really jokey and gradually got less humorous. Then that method was applied to comedies, and found that they ended at a well more beginning-y place than the action movies began, and started at a far laughier place. That wouldn't show a problem with the classifier, because laughter-ratios actually are connected to genre; it would just be a generally interesting finding that both genres got less funny as they progressed. What's important are not the points in multidimensional space, but the directionality of movement among them. So PCA as a "classifier" doesn't do a great job on Star Trek--but the higher-dimensional classifier that tags it as "plotty" does work fairly well, because it moves through vector space from its beginning in a relatively predictable way.

But of course you and Ted are right that it may be disingenuous to call this "plot." "Structural arrangement of thematic elements" would be something more like what's actually going.
ReplyDelete
Replies
dashoreJanuary 15, 2015 at 7:44 AM
Fantastic:

"the problem resembles musicology... So I'm curious what anyone else thinks on this one in particular."

I'm intrigued by the comparison to Schenkerian analysis; actually excited just to see another "word person" who knows what Schenkerian analysis is (though the more general, successor notion of "voice leading analysis" might be sufficient).

One thing to say is that Schenker proposed a kind of symmetry between analysis and generation: the structures revealed by reduction were the structures that composers working in the framework of tonal harmony elaborate (through techniques like arpeggiation or stepwise motion) in the first place. The basic structures (the tonic triad, the descending line (3, 2, 1)) are not merely what are revealed by analysis, but also what, through elaboration, generate the music as we hear it. And Schenker had reason to believe in this kind of symmetry because he studied the same kinds of composition techniques (harmony and counterpoint) that composers like Bach and Beethoven had studied centuries earlier.

Voice leading analysis is not only, in its conception, symmetrical but homogeneous. The structures that it identifies are (except in some weird cases where it must posit them, but that aside) actual notes that occur in the music. So reductionist analysis takes place in the same medium as the object it reduces. You can *play* the product of an analysis on a piano.

As I think you acknowledge, we should be hesitant to treat your PCA analyses in the same way. Topic modeling (at least as Blei describes it) relies on the fiction that "topics" generate documents, but it's important that we neither forget that this is a methodological fiction nor let it slip into an assertion. Put differently: the tonic and the descending line may (at least plausibly) generate the surface structure of music through elaboration. But TV writers are not generating scripts out of probabilistic collocations of words (certainly not as they occur in set 3 or 6 min segments). Insofar as we conceive of TV writing as having generative principles (it probably has many of them) at all, there's no reason to think those principles are composed of words, much less collocations of words - they may be "plot-points" or "tropes" or "situations" or "story elements. So your analysis is heterogeneous where Schenker's is homogeneous: topics may usefully correlate with some generative structure or another (the debate about whether this is "plot" or not is a useful one), but they are not those structures. This reminds me of Moretti's observation of the distinction between objects of history and objects of knowledge. The topics you identify are objects of knowledge but stand at a distance from the objects of history.

I think your post is pretty clear and right on about all of this - just wanted to help elaborate the comparison.

One more note about voice leading analysis: the point of its reductions (stripping away elaborations, simplifying rhythm, compressing prolongations, etc.) is not to "discover" the same hidden structure in every score (in the way that archetype analysis gets its thrill from discovering "the hero" in different texts). Rather, identifying the Ursatz (or rather positing it - you're right to see it as a kind of tautology) makes it possible to analyze the variously employed techniques of elaboration that individuate different works. The analogy to your post: the arc structure you "discover" is less interesting than the subsequent analyses of deviations from arc structure that it permits. (Though again, I suppose these to be deviations from a statistical norm, not different elaborations of underlying structures, as in Schenker, though perhaps you could come up with a taxonomy or classification of deviations as he did with elaborations.)
ReplyDelete
Replies
gianluJune 4, 2015 at 6:33 AM
Hi, your post is fascinating, thanks for sharing it!
I am a master student and a newbie on topic modeling, but I have one question. Is the topic model trained to recognize a specific set of topics or is it trained only on script chunks (regardless their content)? I find LDA interesting but I would like to understand how to use it properly in this kind of analysis.
Thanks again!
ReplyDelete
Replies

Add comment