# reddit-short-stories A small unlabelled dataset of 4,308 short stories (4 million words) scraped from https://reddit.com/r/WritingPrompts for your machine learning needs. Scraped and formatted by Trevor Du ## Dataset description * Each line of [reddit_short_stories.txt](https://github.com/tdude92/reddit-short-stories/blob/main/reddit_short_stories.txt) is one full short story. * Each short story begins with an "\" token and ends with an "\" token (eg. "\ once upon a time, the end \"). * Newline characters in a story are replaced with the "\" token (eg. "\ line 1 \ line 2 \") ## Data Collection Method r/WritingPrompts is a forum on the popular discussion website, https://reddit.com. The tradition is that users start threads that are titled with a *Writing Prompt*. In these threads, other users comment a short story they've written based on the original prompt. The scraper saved a comment on a post on r/WritingPrompts if the following conditions are satisfied: * The post is flaired "Writing Prompt" * The post has >=1.0k upvotes. * The author of the comment is not a moderator of r/WritingPrompts (to avoid scraping automod posts and mod announcements). * The comment has >=200 upvotes. * The comment has >=200 words. * <20 comments have already been scraped from the comment's parent post. Note: Only a portion of r/WritingPrompts was scraped, not the entire thing. Hoping to scrape more of r/WritingPrompts and other subreddits in the future.