A Story Generation and Evaluation Platform



Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To address these issues, we introduce a dataset and evaluation platform built with STORIUM, an online collaborative storytelling community. Our author-generated dataset contains 6K lengthy stories (125M tokens) with fine-grained natural language annotations, in the form of cards, interspersed throughout each narrative, forming a robust source for guiding models. Our evaluation platform is integrated directly with STORIUM, where real authors can query a model for suggested story continuations and then edit them. We provide a leaderboard with automatic metrics computed over these edits, which correlate well with both user ratings of generated stories and qualitative feedback from semi-structured user interviews. We release both the dataset and evaluation platform to spur more principled research into story generation.

A high-level outline of our dataset and platform. In this example from a real STORIUM game, the character ADIRA MAKAROVA uses the strength card DEADLY AIM to DISRUPT THE GERMANS, a challenge card. Our model conditions on the natural language annotations in the scene intro, challenge card, strength card, and character, along with the text of the previous scene entry (not shown) to generate a suggested story continuation. Players may then edit the model output, by adding or deleting text, before publishing the entry. We collect these edits, using the matched text as the basis of our USER metric. New models can be added to the platform by simply implementing four methods: startup, shutdown, preprocess, and generate.


If you use our dataset or evaluation platform, please cite:
  Author = {Nader Akoury, Shufan Wang, Josh Whiting, Stephen Hood, Nanyun Peng and Mohit Iyyer},
  Booktitle = {Empirical Methods for Natural Language Processing,
  Year = "2020",
  Title = {{STORIUM}: {A} {D}ataset and {E}valuation {P}latform for {S}tory {G}eneration}
Read the paper


If you have any questions or comments about this work, please visit my website which has my contact information, CV, and an up-to-date listing of my publications.