The data that we are using to create the model that detects the spam messages is taken from http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/, which contains 747 spam message samples, along with 4,827 non-spam messages.
These messages are taken from different sources and labeled with the category of spam and non-spam. If you open the downloaded file in Notepad or any text editor, it will be in the following format:
ham What you doing?how are you?
ham Ok lar... Joking wif u oni...
ham dun say so early hor... U c already then say...
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham Siva is in hostel aha:-.
ham Cos i was out shopping with darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam Sunshine Quiz! Win a super Sony DVD recorder if you can name the capital of Australia? Text MQUIZ to 82277. B
spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU
In the preceding sample, we can see that every line starts with the category name and is followed by the actual message.