Spam {msos} | R Documentation |
Spam
Description
In the Hewlett-Packard spam data, a set of n = 4601 emails were classified according to whether they were spam, where "0" means not spam, "1" means spam. Fifty-seven explanatory variables based on the content of the emails were recorded, including various word and symbol frequencies. The emails were sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with the words "George" or "hp" would likely indicate non-spam, while "credit" or "!" would suggest spam. The data were collected by Hopkins et al. [1999], and are in the data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank and Asuncion, 2010].)
Usage
Spam
Format
A double matrix with 4601 observations on the following 58 variables.
- WFmake
Percentage of words in the e-mail that match make.
- WFaddress
Percentage of words in the e-mail that match address.
- WFall
Percentage of words in the e-mail that match all.
- WF3d
Percentage of words in the e-mail that match 3d.
- WFour
Percentage of words in the e-mail that match our.
- WFover
Percentage of words in the e-mail that match over.
- WFremove
Percentage of words in the e-mail that match remove.
- WFinternet
Percentage of words in the e-mail that match internet.
- WForder
Percentage of words in the e-mail that match order.
- WFmail
Percentage of words in the e-mail that match mail.
- WFreceive
Percentage of words in the e-mail that match receive.
- WFwill
Percentage of words in the e-mail that match will.
- WFpeople
Percentage of words in the e-mail that match people.
- WFreport
Percentage of words in the e-mail that match report.
- WFaddresses
Percentage of words in the e-mail that match addresses.
- WFfree
Percentage of words in the e-mail that match free.
- WFbusiness
Percentage of words in the e-mail that match business.
- WFemail
Percentage of words in the e-mail that match email.
- WFyou
Percentage of words in the e-mail that match you.
- WFcredit
Percentage of words in the e-mail that match credit.
- WFyour
Percentage of words in the e-mail that match your.
- WFfont
Percentage of words in the e-mail that match font.
- WF000
Percentage of words in the e-mail that match 000.
- WFmoney
Percentage of words in the e-mail that match money.
- WFhp
Percentage of words in the e-mail that match hp.
- WFgeorge
Percentage of words in the e-mail that match george.
- WF650
Percentage of words in the e-mail that match 650.
- WFlab
Percentage of words in the e-mail that match lab.
- WFlabs
Percentage of words in the e-mail that match labs.
- WFtelnet
Percentage of words in the e-mail that match telnet.
- WF857
Percentage of words in the e-mail that match 857.
- WFdata
Percentage of words in the e-mail that match data.
- WF415
Percentage of words in the e-mail that match 415.
- WF85
Percentage of words in the e-mail that match 85.
- WFtechnology
Percentage of words in the e-mail that match technology.
- WF1999
Percentage of words in the e-mail that match 1999.
- WFparts
Percentage of words in the e-mail that match parts.
- WFpm
Percentage of words in the e-mail that match pm.
- WFdirect
Percentage of words in the e-mail that match direct.
- WFcs
Percentage of words in the e-mail that match cs.
- WFmeeting
Percentage of words in the e-mail that match meeting.
- WForiginal
Percentage of words in the e-mail that match original.
- WFproject
Percentage of words in the e-mail that match project.
- WFre
Percentage of words in the e-mail that match re.
- WFedu
Percentage of words in the e-mail that match edu.
- WFtable
Percentage of words in the e-mail that match table.
- WFconference
Percentage of words in the e-mail that match conference.
- CFsemicolon
Percentage of characters in the e-mail that match SEMICOLON.
- CFparen
Percentage of characters in the e-mail that match PARENTHESES.
- CFbracket
Percentage of characters in the e-mail that match BRACKET.
- CFexclam
Percentage of characters in the e-mail that match EXCLAMATION.
- CFdollar
Percentage of characters in the e-mail that match DOLLAR.
- CFpound
Percentage of characters in the e-mail that match POUND.
- CRLaverage
Average length of uninterrupted sequences of capital letters.
- CRLlongest
Length of longest uninterrupted sequence of capital letters.
- CRLtotal
Total number of capital letters in the e-mail
- spam
Denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Source
Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spam data. Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304, 1999.