Slimming the data down to chewable chunks

We will need to train many variants, until we arrive at the final classifier. Given the current data, we will be slowed down considerably by the following:

  • Post-it stores attributes, which we might not need.
  • It is stored as XML, which is not the fastest format to parse.
  • The dump contains posts that date back to 2011. Restricting to just the year 2017, we will still end up with over 6,000,000 posts, which should be enough.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset