We will need to train many variants, until we arrive at the final classifier. Given the current data, we will be slowed down considerably by the following:
- Post-it stores attributes, which we might not need.
- It is stored as XML, which is not the fastest format to parse.
- The dump contains posts that date back to 2011. Restricting to just the year 2017, we will still end up with over 6,000,000 posts, which should be enough.