JUNK {regclass} | R Documentation |
Junk-mail dataset
Description
Building a junk mail classifier based on word and character frequencies
Usage
data("JUNK")
Format
A data frame with 4601 observations on the following 58 variables.
Junk
a factor with levels
Junk
Safe
make
a numeric vector, the percentage (0-100) of words in the email that are the word
make
address
a numeric vector
all
a numeric vector
X3d
a numeric vector, the percentage (0-100) of words in the email that are the word
3d
our
a numeric vector
over
a numeric vector
remove
a numeric vector
internet
a numeric vector
order
a numeric vector
mail
a numeric vector
receive
a numeric vector
will
a numeric vector
people
a numeric vector
report
a numeric vector
addresses
a numeric vector
free
a numeric vector
business
a numeric vector
email
a numeric vector
you
a numeric vector
credit
a numeric vector
your
a numeric vector
font
a numeric vector
X000
a numeric vector, the percentage (0-100) of words in the email that are the word
000
money
a numeric vector
hp
a numeric vector
hpl
a numeric vector
george
a numeric vector
X650
a numeric vector
lab
a numeric vector
labs
a numeric vector
telnet
a numeric vector
X857
a numeric vector
data
a numeric vector
X415
a numeric vector
X85
a numeric vector
technology
a numeric vector
X1999
a numeric vector
parts
a numeric vector
pm
a numeric vector
direct
a numeric vector
cs
a numeric vector
meeting
a numeric vector
original
a numeric vector
project
a numeric vector
re
a numeric vector
edu
a numeric vector
table
a numeric vector
conference
a numeric vector
semicolon
a numeric vector, the percentage (0-100) of characters in the email that are semicolons
parenthesis
a numeric vector
bracket
a numeric vector
exclamation
a numeric vector
dollarsign
a numeric vector
hashtag
a numeric vector
capital_run_length_average
a numeric vector, average length of uninterrupted sequence of capital letters
capital_run_length_longest
a numeric vector, length of longest uninterrupted sequence of capital letters
capital_run_length_total
a numeric vector, total number of capital letters in the email
Details
The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).
Source
Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)