Fetching the data

Luckily for us, the team behind StackOverflow provides most of the data behind the StackExchange universe to which StackOverflow belongs under a cc-by-sa license. At the time of writing this book, the latest data dump can be found at https://archive.org/download/stackexchange. It contains data dumps of all the Q&A sites of the StackExchange family. For StackOverflow, you will find multiple files, of which we only need the stackoverflow.com-Posts.7z file, which is 11.3 GB.

After downloading and extracting it, we have around 59 GB of data in the XML format, containing all questions and answers as individual row tags within the root tag posts:

<?xml version="1.0" encoding="utf-8"?>
<posts>
...
<row Id="4572748" PostTypeId="2" ParentId="4568987" CreationDate="2011-01-01T00:01:03.387" Score="4" ViewCount="" Body="&lt;p&gt;IANAL, but &lt;a href=&quot;http://support.apple.com/kb/HT2931&quot; rel=&quot;nofollow&quot;&gt;this&lt;/a&gt; indicates to me that you cannot use the loops in your application:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;...however, individual audio loops may not be commercially or otherwise distributed on a standalone basis, nor may they be repackaged in whole or in part as audio samples, sound effects or music beds.&quot;&lt;/p&gt;&lt;p&gt;So don't worry, you can make commercial music with GarageBand, you just can't distribute the loops as loops.&lt;/p&gt; &lt;/blockquote&gt; " OwnerUserId="203568" LastActivityDate="2011-01-01T00:01:03.387" CommentCount="1" />

</posts>

Refer to the following table:

Name

Type

Description

Id

Integer

This is a unique identifier of the post.

PostTypeId

Integer

This describes the category of the post. The values of interest to us are the following:

  • 1: Question
  • 2: Answer

Other values will be ignored.

ParentId

Integer

This is a unique identifier of the question to which this answer belongs. It is missing for questions, in which case we will set it to -1.

CreationDate

DateTime

This is the date of submission.

Score

Integer

This is the score of the post.

ViewCount

Integer or empty

This is the number of user views for this post.

Body

String

This is the complete post as encoded HTML text.

OwnerUserId

Id

This is a unique identifier of the poster. If 1, then it is a wiki question.

Title

String

This is the title of the question (missing for answers).

AcceptedAnswerId

Id

This is the ID for the accepted answer (missing for answers).

CommentCount

Integer

This is the number of comments for the post.

Normally, we try to stick to the Python style guides for variable naming. In this chapter, we will use the names in the XML fomat so they are easier to follow. For example, we will have ParentId instead of parent_id.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset