Sapping Attention: Plot arceology 2016: emotion and tension

Monday, July 18, 2016

Plot arceology 2016: emotion and tension

Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockers's work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought I'd post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockers's original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.

Basic methodology

1. They use some unspecified mechanism to limit the ~50,000 books in project Gutenberg to 1,700 "stories" or "works of fiction." They use these terms interchangeably. But this suffers 2 problems.

1a. First, whatever fiction/nonfiction classifier they use seems to work extraordinarily poorly--almost certainly worse than simply using the Library of Congress classifications that Gutenberg itself distributes with some of its dumps. It includes personal narratives, political essays, instructions for building bird houses, psychology texts, and so forth. If you click the "random" button on their page (which is a great thing to include), you'll see many of these.

2a. It also includes many collections of short stories or pairs of novels published in a single volume. Some of these are the highest scoring "plots" for their basic arcs: for instance, "The Wonder Book of Bible Stories" is the best instance of the inverse of plot 3, and only 1 of the top 5 representatives of (- SV 2) seems to actually be a single narrative.

I ran a spot check of 50 random texts in their browser. I counted 18 non-fiction; 20 novels or short stories; and 12 collections of short stories, or other multi-work texts. So roughly 40% of the texts used are actually what the authors say they are. This makes the conclusions only provisional at best. So many of the titles in the captions are obviously not stories that it's a little baffling they didn't bother to clean up their data set, or use one of the many *actual* fiction collections out there. [Edit: I noticed in the appendix that they classify "fiction" on the basis of length and download count. How they chose the parameters they use aren't clear to me; in any case, it's obvious that just length and download count are *terrible* inputs into a fiction/nonfiction classifier, so it's no wonder they do so poorly.]

2. The null hypothesis that they test against is "word salad;" a completely reshuffled set of orders. They do indeed seem to show that their stories have stronger shapes than word salads. But this is an extremely weak finding. It's akin to saying that you can predict the stock market because you can show that stock prices exhibit greater regularity than random digits. Of course stocks are not random dice rolls every second; they have a trajectory that they move from randomly. But for a time series like this, I think the null hypothesis should be at the least a random walk, not complete random words: that is, particularly when using normalized scores as here, the assumption should be that any given paragraph has the same emotional valence as the previous paragraph, not a completely new one. That is to say, it is easy to generative a "null narrative" that is distinct from a "null text." This is not to say that there isn't some benefit to checking the weaker null hypothesis first. [Although see below in the comments: Scott Enderle suggests that the random noise they get shouldn't be producing results like it is. So what's going on is yet more unclear.]

Another question about plot as a time series is: can you predict what will happen? No one working in the field, to my knowledge, has tried to do this, but it could be interesting. In terms of emotional valences, this makes clear, I think, why the word salad null hypothesis is silly; if you want to predict the end of the book from the middle and beginning, you could do better than say "It will start randomly vacillating every word from negative to positive and so on."

3. They decide to test success by number of downloads, and argue that shapes (SV 3) and (-SV 3) are most successful because they "have markedly higher downloads, and somewhat higher variance." The designation as "higher" is based entirely on mean downloads, since the medians are roughly the same. If both mean and median don't tell the story, there's probably something else going on. Maybe there's simply more variance, for example, and the number of downloads varies log-normally. When the summary statistics don't agree, it's a stretch to claim any actual conclusions.

What are we doing here?

Next on to some bigger questions of what it means to study plot. This and Jockers are two of the more prominent things recently using sentiment analysis as a proxy for "plot." I saw Ted Underwood on Twitter arguing that the next step must be following up on David Bamman's work on experimenting on whether sentiment analysis actually works by using Mechanical Turk to annotate the "emotional trajectory of texts."

I'm basically done thinking about all of these; the combination of my paper on topic-modeling arcs and my meta-reflections on algorithms, plots, and the Jockers-Swafford affair of 2015 for Debates in the Digital Humanities 2016 give most of what I have to say formally about the issue. There are some slides from the IEEE paper that have nice interactives about the beginnings and ends of TV shows. But I thought I'd just blog out a few additional directions I'd like to see followed up.

I have some issues with the idea of validating sentiment analysis results being especially useful for literary analysis, principally because I don't think that even perfectly working sentiment analysis would be a very good way to measure plot. Citing Vonnegut is a bit of a bait-and-switch; he writes about "good fortune" and "ill fortune," not "positive sentiment" and "negative sentiment." Sentiment analysis is already trained on large numbers of human samples of whether something is positive or negative; if we want to explicitly test Vonnegut's hypothesis, we ought to be building new models that classify text as "fortunate" and "misfortunate," which should subtly differ from ""positive sentiment" and "negative sentiment."

Or we should be testing theories of plot that, unlike Vonnegut's, actually have any influence beyond a web video from a few year's ago. (Vonnegut doesn't even strike me as a writer who was especially good at plot, to be honest.) Train an LTSM model on human-tagged data that can accurately extract the "call to adventure" or the reaching of the innermost cave from a script, and then we might have something interesting, because there's a real interplay between the stories we consume through mass media and popularizations of Joseph Campbell.

Of course that brings me to the final problem here, which is that you *can't* use mechanical turk to label stories by their Cambpellian archetypes because ordinary readers don't speak in those terms. Is that a problem? Can we expect to find structures that most people wouldn't recognize?

I've said before that I think formal musical analysis is the real place to look here. One could, I imagine, try to classify every Beethoven sonata movement by its emotional trajectory; in some popular understanding of music that is what actually happens. But if macro-musicologists tried to do that, they'd obviously be missing out on the actual formal elements the composer was working with. Early 19th-century European music is organized tonally; a good model of its structure would look at tonal organization, not some nebulous notion of emotionality.

I do not believe there are general story principles as firm as classical-era sonata form. I do think that some combination of Joseph Campbell, commercial organization, and three-act structure conformism leaves contemporary television and movies somewhat predisposed to one or a few narratives that could be usefully explored. Which is why I think it's a huge strategic blunder for everyone working with plots to be looking at novels--probably the least coherent narrative form in existence--instead of any of the many other forms of narrative out there.

Even if there are "master plots," I suspect they will be revealed as much in terms of tension as emotion. (Tension is also more easily analogized to classical form music, for better or worse, as dominant-tonic relationships.) A plot classifier shouldn't be looking at local emotion; it should be looking at arcs of introduction of tension and release. This requires a very different form of machine reading; every gun on every mantlepiece needs to be tracked until it goes off. (As with everything else these days, this seems structurally better suited for neural networks than the locally tokenized texts we're mostly working with.) Tension explains a wide variety of plots that none of the emotionally based mechanisms can. For example, the preponderance of plots in my TV and movie database are procedurals which are not organized around a single character's rise and fall; instead, they proceed from crime to punishment, from disease to cure, or from acquisition to sale.

I have no idea how to define "tension." You could do it through Mechanical Turk, I guess. But what's really interesting is that we may be able to define it operationally. What sort of events in texts demand resolutions? What distinguishes beginnings from ends? These are more unsupervised questions than ones about emotional trajectories, and ones that might provide us with much more interesting questions to build on as well as answers.

17 comments:

Ted UnderwoodJuly 18, 2016 at 12:53 PM
These are great reflections, and I agree with pretty much everything. To clarify: when I say we need more evidence about human reactions, I don't necessarily mean that we need to validate the sentiment analysis. That's one piece of it — a piece David Bamman covered well. But I agree that the bigger question is, How much does sentiment actually tell us about plot?

I want evidence about human responses -- in the form of genre or popularity or *something* -- mainly in order to address that part of the question, which you rightly identify as crucial. For instance, when I casually tried to use sentiment trajectories to distinguish Shakespearean tragedies and comedies, I didn't get significant results. That's a striking null. A sentiment-based method ought to be able to distinguish those two genres if it can distinguish *anything.*

There are lots of opportunities for further research here.
ReplyDelete
Replies
BenJuly 18, 2016 at 1:03 PM
Ah, responded to this already on Twitter.

Treating this as a prediction problem would be one way to get at it. If someone were to say: which elements of a plot are predictive of it ending happily? Sadly? Then we might get away from treating plot time as one big mass of equally important stuff and finding which inflection points matter. Even sticking with sentiment analysis, I find it really unlikely that spreading sentiment over 100% really works well. These techniques make a huge distinction between a "man in hole" where he falls in the hole at 40% of the way through, and 60% of the way through. Something less rigid make work better; predicting the end state would be a good start.

In the musical analogy, these models are all getting hung up on the modulations in the development. But if you listen to the exposition of a sonata, you know what the key signatures and thematic groups in the recapitulation are going to be, give or take a theme.

Likewise anyone who reads the first third of Pride and Prejudice knows what the two major couples in the end are going to be.
ReplyDelete
Replies
Bill BenzonJuly 18, 2016 at 1:45 PM
Hi Ben,

I'm sympathetic to your skepticism about sentiment analysis >> plot and am sympathetic to your nod to musical analysis. And I've got a very specific interest which I've been calling ring-form composition, because that's the standard name. But I'm now thinking of it as ring-form rhetorical structure for reasons that will emerge soon enough.

What's ring-form? A text with a linear structure like this: A B C...X...C' B' A'. There's a structural center and the other segments are such that the second half is a mirror of the first.

I got interested in ring-form in email discussions with Mary Douglas, the anthropologist. Toward the end of her career she got interested in the Old Testment, which is one of the areas where ring-form has traditionally been studied (Homeric epic is another). She ended up writing books on Numbers and on Leviticus in which she argued/demonstrated each is ring-form. Then she gave a series of lectures at Yale (published as Thinking in Circles) in which she laid down some rules of thumb about the form and argued, in one chapter, that Tristram Shandy exhibits ring-form.

Meanwhile I'd been finding ring-form in various texts: Osamu Tezuka's manga, Metropolis; the Nutcracker Suite, Sorcerer's Apprentice, and Pastoral Symphony episodes of Disney's Fantasia; Conrad's Heart of Darkness; Coppola's Apocalypse Now (loosely based on Heart of Darkness); the 1954 Japanase film, Gojira (mangled and deformed into Godzilla, King of the Monsters for an American audience); a few lyric poems; Obama's eulogy for Clementa Pinckney; and, of all things, Ali Liu's PMLA essay on meaning in DH. That's quite a variety of texts, not all of them narratives, and at least one not even artistic (Liu's essay). So, it's a rhetorical form. But where it informs a narrative, as it does in some of these instances, you identify the form with reference to what we ordinarily think of as the plot.

[continued in next comment]
ReplyDelete
Replies
Bill BenzonJuly 18, 2016 at 1:46 PM
[continued from previous comment]

Now, for Obama's eulogy for Clementa Pinckney. It's a sermon. What tipped me to the ring-form is that the word "grace" first appeared roughly mid-way in the sermon and then kept on to the end, where he hammered it and then segued into "Amazing Grace." That told me it broke into two parts. Once I knew that, identifying the symmetry was not too difficult.

We've got a video of the eulogy. Obama was working in a tradition where audience response is important. That response, of course, is in the video. If we graphed the amplitude of the sound level we'd have a crude "sentiment" analysis. To my ear there's a noticeable increase at the structural center, but that pales in comparison to the climactic ending.

. I'd love a computational routine that would be able to pick out ring-forms in actual narratives. But I'm skeptical. Would a sentiment analysis of Heart of Darkness identify the structural center? I don't know and really have no way of guessing. I can tell you that the structural center occurs in the longest paragraph in the text, but I don't think that that would be a generally useful clue to much of anything. It just happens to be an interesting feature of this rather peculiar text.

I have no general conclusions to offer. This is just stuff that I've somewhat laboriously managed to dig up over the years. I've got a bunch of posts on ring-form at my blog, New Savanna but this post lists most of the best ones along with some explanatory text: Literary Studies from a Martian Point of View: An Open Letter to Charlie Altieri. Here's a working paper that gathers some of them into a PDF: Ring Composition: Some Notes on a Particular Literary Morphology. Here's the Obama stuff: Obama’s Eulogy for Clementa Pinckney: Technics of Power and Grace.

ReplyDelete
Replies
ScottJuly 18, 2016 at 2:46 PM
My concerns about this go even deeper, and I think basically invalidate the entire project. I think the results that SVD gives here strongly suggest that these sentiment patterns are just random walks, plain and simple -- Brownian noise.

The results Regan et. al. get are very strange. Entirely reshuffling a random walk gives white noise, which yields chaotic SVD eigenfunctions. They say they used random permutations to generate word salad, but their SVDs look like Brownian noise. I was only able to get white noise to look like Brownian noise with heavy filtering. (I used a few different constant averages to do that.)

If it were really white noise, the SVD eigenfunctions would also look like white noise. And if it were Brownian noise with hidden structure -- which seems to be the fundamental claim -- then we should expect the SVD to pick up on that hidden structure. I've been investigating this for the last few days; here's a notebook illustrating the results:

https://github.com/senderle/svd-noise/blob/master/Noise.ipynb

The last example shows what the SVD looks like when there's hidden regularity in the data. It's very obvious. The hidden regularity would have to be extremely subtle given the results in Reagan et. al.
ReplyDelete
Replies
ScottJuly 18, 2016 at 2:48 PM
Following up -- I think this reiterates your point, Ben, about smaller local regularities. But I suspect even smaller local regularities might show up in the SVD. I'm also looking into autocorrelation as a way of picking up those regularities.
ReplyDelete
Replies
Bill BenzonJuly 18, 2016 at 6:15 PM
Since we're talking Fourier analysis and harmonic curves I'll offer some remarks specifically about Heart of Darkness. As you know Kurtz is a central character, but he doesn't appear until a ways into the text. So I got curious about his presence in the text. I divided the text into equal sized bins of 500 words, counted the occurence of "Kurtz" in each bin and graphed the results. (Seems I'd done what's called a periodogram.) It turns out that “Kurtz” appears in the text of Heart of Darkness at periodic intervals. There is a short cycle of roughly 2000 words and a longer one that divides the text into four sections: an initial section with no appearances, a second section with relatively low activity, and a third section with more activity. As a check, I repeated the procedure with bins of 600 words.

I wonder what other periodicity we'd find in this text. And other texts? And if we ran Jockers' procedure on HoD?

I've written this up in a short working paper: Periodicity in Heart of Darkness.
ReplyDelete
Replies

Add comment