(Here is the
original Reddit comment announcing this collection of data and what the processes were.)
This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues.
Q: How are the files structured?
Each file is compressed with bzip2 compression. When uncompressed, each file is a series of JSON blocks delimited by new lines (\n). The name of each file follows the format RC_yyyy-mm.bz2 where yyyy is the year and mm is the month. RC stands for "reddit comments."
Q: What does Reddit use for comment ids?
Comment ids are in base 36. If a comment starts with t1_, simply remove that piece and convert to base 10 to get an integer representation of the comment. Most comments should be in sequential order.
Q: I noticed 1-5 comments are missing on average for each 100 sequential ids -- why?
Those comments were either removed, private or unavailable from the API. 99% of the time, it’s due to the comment being posted in a private subreddit.
Q: I’m doing analysis on scores and need to know when the comments were fetched.
Most of the JSONs should have a “retrieved_on” key that I added to reflect when that particular comment was pulled from the Reddit API. There is also an “archived” key in each JSON block that will tell you if that comment has been archived (meaning that people can no longer vote or reply to that comment).
Q: I have additional questions and would like to contact you. What is your contact information?
No problem. My e-mail is jason@pushshift.io