Luckily for us, the team behind StackOverflow provides most of the data behind the StackExchange universe to which StackOverflow belongs under a cc-by-sa license. At the time of writing this book, the latest data dump can be found at https://archive.org/download/stackexchange. It contains data dumps of all the Q&A sites of the StackExchange family. For StackOverflow, you will find multiple files, of which we only need the stackoverflow.com-Posts.7z file, which is 11.3 GB.
After downloading and extracting it, we have around 59 GB of data in the XML format, containing all questions and answers as individual row tags within the root tag posts:
<?xml version="1.0" encoding="utf-8"?>
<posts>
...
<row Id="4572748" PostTypeId="2" ParentId="4568987" CreationDate="2011-01-01T00:01:03.387" Score="4" ViewCount="" Body="<p>IANAL, but <a href="http://support.apple.com/kb/HT2931" rel="nofollow">this</a> indicates to me that you cannot use the loops in your application:</p><blockquote><p>...however, individual audio loops may not be commercially or otherwise distributed on a standalone basis, nor may they be repackaged in whole or in part as audio samples, sound effects or music beds."</p><p>So don't worry, you can make commercial music with GarageBand, you just can't distribute the loops as loops.</p> </blockquote> " OwnerUserId="203568" LastActivityDate="2011-01-01T00:01:03.387" CommentCount="1" />
…
</posts>
Refer to the following table:
Name |
Type |
Description |
Id |
Integer |
This is a unique identifier of the post. |
PostTypeId |
Integer |
This describes the category of the post. The values of interest to us are the following:
Other values will be ignored. |
ParentId |
Integer |
This is a unique identifier of the question to which this answer belongs. It is missing for questions, in which case we will set it to -1. |
CreationDate |
DateTime |
This is the date of submission. |
Score |
Integer |
This is the score of the post. |
ViewCount |
Integer or empty |
This is the number of user views for this post. |
Body |
String |
This is the complete post as encoded HTML text. |
OwnerUserId |
Id |
This is a unique identifier of the poster. If 1, then it is a wiki question. |
Title |
String |
This is the title of the question (missing for answers). |
AcceptedAnswerId |
Id |
This is the ID for the accepted answer (missing for answers). |
CommentCount |
Integer |
This is the number of comments for the post. |