people {usa} | R Documentation |
Synthetic Sample of US population
Description
A statistically representative synthetic sample of 20,000 Americans. Each record is a simulated survey respondent.
Usage
people
Format
A tibble with 20,000 rows and 40 variables:
- id
Sequential unique ID
- fname
Random first name, see details
- lname
Random last name, see details
- gender
Biological sex
- age
Age capped at 85
- race
Race and Ethnicity
- edu
Educational attainment
- div
Census regional division
- married
Marital status
- house_size
Household size
- children
Has children
- us_citizen
Is a US citizen
- us_born
Was born in the Us
- house_income
Family income
- emp_status
Employment status
- emp_sector
Employment sector
- hours_work
Hours worked per week
- hours_vary
Hours vary week to week
- mil
Has served in the military
- house_own
Home ownership
- metro
Lives in metropolitan area
- internet
Household has internet access
- foodstamp
Receives food stamps
- house_moved
Moved in the last year
- pub_contact
Contacted or visited a public official
- boycott
- hood_group
Participated in a community association
- hood_talks
Talked with neighbors
- hood_trust
Trusts neighbors
- tablet
Uses a tablet or e-reader
- texting
Uses text messaging
- social
Uses social media
- volunteer
Volunteered
- register
Is registered to vote
- vote
Voted in the 2014 midterm elections
- party
Political party
- religion
Religious (evangelical) affiliation
- ideology
Political ideology
- govt
Follows government and public affairs
- guns
Owns a gun
Details
This dataset was originally produced by the Pew Research center for their paper entitled For Weighting Online Opt-In Samples, What Matters Most? The synthetic population dataset was created to serve as a reference for making online opt-in surveys more representative of the overall population.
See Appendix B: Synthetic population dataset for a more detailed description of the method for and rationale behind creating this dataset.
In short, the dataset was created to overcome the limitations of using large, federal benchmark survey datasets such as the American Community Survey (ACS) or Current Population Survey (CPS). These surveys often do not contain the exact questions asked in online-opt in surveys, keeping them from being used for proper adjustment.
This synthetic dataset was created by combining nine separate benchmark datasets. Each had a set of common demographic variables but many added unique variables such as gun ownership or voter registration. The surveys were combined, stratified, sampled, combined, and imputed to fill missing values from each. From this large dataset, the original 20,000 surveys from the ACS were kept to ensure accurate demographic distribution.
The names were RANDOMLY assigned to respondents to better simulate a
synthetic sample of the population. First names were taken from the
babynames
dataset which contains the Social Security Administration's
record of baby names from 1880 to 2017 along with gender and proportion.
First names were proportionally randomly assigned by birth year and sex. Last
names were taken from the Census Bureau, who provides the 162,254 most common
last names in the 2010 Census, covering over 90% of the population. For a
given surname, the proportion of that name belonging to members of each race
and ethnicity is provided. The last names were proportionally randomly
assigned by race.
Source
“For Weighting Online Opt-In Samples, What Matters Most?” Pew Research Center, Washington, D.C. (January 26, 2018) https://www.pewresearch.org/methods/2018/01/26/for-weighting-online-opt-in-samples-what-matters-most/