regex_usa {qdapRegex} | R Documentation |
Canned Regular Expressions (United States of America)
Description
A dataset containing a list U.S. specific, canned regular expressions for use in various functions within the qdapRegex package.
Usage
data(regex_usa)
Format
A list with 54 elements
Details
The following canned regular expressions are included:
- rm_abbreviation
abbreviations containing single lower case or capital letter followed by a period and then an optional space (this must be repeated 2 or more times)
- rm_between
Remove characters between a left and right boundary including the boundaries; note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_between2
Remove characters between a left and right boundary NOT including the boundaries; note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_caps
words containing 2 or more consecutive upper case letters and no lower case
- rm_caps_phrase
phrases of 1 word or more containing 1 or more consecutive upper case letters and no lower case; if phrase is one word long then phrase must be 2 or more consecutive capital letters
- rm_citation
substring that looks for in-text and parenthetical APA6 style citations (attempts to exclude references)
- rm_citation2
substring that looks for in-text APA6 style citations (attempts to exclude references)
- rm_citation3
substring that looks for parenthetical APA6 style citations (attempts to exclude references)
- rm_city_state
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters)
- rm_city_state_zip
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) & zip code (exactly 5 or 5+4 consecutive digits)
- rm_date
dates in the form of 2 digit month, 2 digit day, and 2 or 4 digit year. Separator between month, day, and year may be dot (.), slash (/), or dash (-)
- rm_date2
dates in the form of 3-9 letters followed by one or more spaces, 2 digits, a comma(,), one or more spaces, and 4 digits
- rm_date3
dates in the form of XXXX-XX-XX; hyphen separated string of 4 digit year, 2 digit month, and 2 digit day
- rm_date4
dates in the form of both
rm_date
,rm_date2
, andrm_date3
- rm_dollar
substring with dollar sign ($) followed by (1) just dollars (no decimal), (2) dollars and cents (whole number and decimal), or (3) just cents (decimal value); dollars may contain commas
- rm_email
substring with (1) alphanumeric characters or dash (-), plus (+), or underscore (_) (This may be repeated) (2) followed by at (@), followed by the same regex sequence as before the at (@), and ending with dot (.) and 2-14 digits
- rm_emoticon
common emoticons (logic is complicated to explain in words) using ">?[:;=8XB]{1}[-~+o^]?[|\")(>DO>{pP3/]+|</?3|XD+|D:<|x[-~+o^]?[|\")(>DO>{pP3/]+" regex pattern; general pattern is optional hat character, followed by eyes character, followed by optional nose character, and ending with a mouth character
- rm_endmark
substring of the last endmark group in a string; endmarks include (! ? . * OR |)
- rm_endmark3
substring of the last endmark group in a string; endmarks include (! ? OR .)
- rm_endmark3
substring of the last endmark group in a string; endmarks include (! ? . * | ; OR :)
- rm_hash
substring that begins with a hash (#) followed by a word
- rm_nchar_words
substring of letters (that may contain apostrophes) n letters long (apostrophe not counted in length); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_nchar_words2
substring of letters (that may contain apostrophes) n letters long (apostrophe counted in length); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_non_ascii
substring of 2 digits or letters a-f inside of a left and right angle brace in the form of
"<a4>"
- rm_non_words
substring of any character that isn't a letter, apostrophe, or single space
- rm_number
substring that may begin with dash (-) for negatives, and is (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value; regex pattern provided by Jason Gray
- rm_percent
substring beginning with (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value and followed by a percent sign (%)
- rm_phone
phone numbers in the form of optional country code, valid 3 digit prefix, and 7 digits (may contain hyphens and parenthesis); logic is complex to explain (see https://stackoverflow.com/a/21008254/1000343 for more)
- rm_postal_code
U.S. state abbreviations (and District of Columbia) that is constrained to just possible U.S. state names, not just two consecutive capital letters; taken from Mike Hamilton's submission found https://regexlib.com/REDetails.aspx?regexp_id=2177
- rm_repeated_characters
substring with a repetition of repeated characters within a word; regex pattern retrieved from StackOverflow's, vks: https://stackoverflow.com/a/29438461/1000343
- rm_repeated_phrases
substring with a phrase (a sequence of 1 or more words) that is repeated 2 or more times (case is ignored; separating periods and commas are ignored); regex pattern retrieved from StackOverflow's, BrodieG: https://stackoverflow.com/a/28786617/1000343
- rm_repeated_words
substring with a word (marked with a boundary) that is repeat 2 or more times (case is ignored)
- rm_tag
substring that begins with an at (@) followed by a word
- rm_tag2
Twitter substring that begins with an at (@) followed by a word composed of alpha-numeric characters and underscores, no longer than 15 characters
- rm_title_name
substring beginning with title (Mrs., Mr., Ms., Dr.) that is case independent or full title (Miss, Mizz, mizz) followed by a single lower case word or multiple capitalized words
- rm_time
substring that (1) must begin with 0-2 digits, (2) must be followed by a single colon (:), (3) optionally may be followed by either a colon (:) or a dot (.), (4) optionally may be followed by 1-infinite digits (if previous condition is true)
- rm_time2
substring that is identical to
rm_time
with the additional search for Ante Meridiem/Post Meridiem abbreviations (e.g., AM, p.m., etc.)- rm_transcript_time
substring that is specific to transcription time stamps in the form of HH:MM:SS.OS where OS is milliseconds. HH: and .OS are optional. The SS.OS period divide may also be a comma or additional colon. The HH:SS divid may also be a period. String may be affixed with pound sign (#).
- rm_twitter_url
Twitter short link/url; substring optionally beginning with http, followed by t.co ending on a space or end of string (whichever comes first)
- rm_url
substring beginning with http, www., or ftp and ending on a space or end of string (whichever comes first); note that this regex is simple and may not cover all valid URLs or may include invalid URLs
- rm_url2
substring beginning with http, www., or ftp and more constrained than
rm_url
; based on @imme_emosol's response from https://mathiasbynens.be/demo/url-regex- rm_url3
substring beginning with http or ftp and more constrained than
rm_url
&rm_url2
though light-weight, making it ideal for validation purposes; taken from @imme_emosol's response found https://mathiasbynens.be/demo/url-regex- rm_white
substring of white space(s); this regular expression combines
rm_white_bracket
,rm_white_colon
,rm_white_comma
,rm_white_endmark
,rm_white_lead
,rm_white_trail
, andrm_white_multiple
- rm_white_bracket
substring of white space(s) following left brackets ("{", "(", "[") or preceding right brackets ("}", ")", "]")
- rm_white_colon
substring of white space(s) preceding colon(s)/semicolon(s)
- rm_white_comma
substring of white space(s) preceding a comma
- rm_white_endmark
substring of white space(s) preceding a single occurrence/combination of period(s), question mark(s), and exclamation point(s)
- rm_white_lead
substring of leading white space(s)
- rm_white_lead_trail
substring of leading/trailing white space(s)
- rm_white_multiple
substring of multiple, consecutive white spaces
- rm_white_punctuation
substring of white space(s) preceding a comma or a single occurrence/combination of colon(s), semicolon(s), period(s), question mark(s), and exclamation point(s)
- rm_white_trail
substring of trailing white space(s)
- rm_zip
substring of 5 digits optionally followed by a dash and 4 more digits
Extra
Use qdapRegex:::examine_regex()
to interactively explore the
regular expressions in regex_usa
. This will provide a browser + console
based break down of each regex in the dictionary.