clean_source {rock} | R Documentation |
Cleaning & editing sources
Description
These functions can be used to 'clean' one or more sources or perform search and replace taks. Cleaning consists of two operations: splitting the source at utterance markers, and conducting search and replaces using regular expressions.
These functions can be used to 'clean' one or more sources or perform search and replace taks. Cleaning consists of two operations: splitting the source at utterance markers, and conducting search and replaces using regular expressions.
Usage
clean_source(
input,
output = NULL,
replacementsPre = rock::opts$get("replacementsPre"),
replacementsPost = rock::opts$get("replacementsPost"),
extraReplacementsPre = NULL,
extraReplacementsPost = NULL,
removeNewlines = FALSE,
removeTrailingNewlines = TRUE,
rlWarn = rock::opts$get(rlWarn),
utteranceSplits = rock::opts$get("utteranceSplits"),
preventOverwriting = rock::opts$get("preventOverwriting"),
encoding = rock::opts$get("encoding"),
silent = rock::opts$get("silent")
)
clean_sources(
input,
output,
outputPrefix = "",
outputSuffix = "_cleaned",
recursive = TRUE,
filenameRegex = ".*",
replacementsPre = rock::opts$get(replacementsPre),
replacementsPost = rock::opts$get(replacementsPost),
extraReplacementsPre = NULL,
extraReplacementsPost = NULL,
removeNewlines = FALSE,
utteranceSplits = rock::opts$get(utteranceSplits),
preventOverwriting = rock::opts$get(preventOverwriting),
encoding = rock::opts$get(encoding),
silent = rock::opts$get(silent)
)
search_and_replace_in_source(
input,
replacements = NULL,
output = NULL,
preventOverwriting = TRUE,
encoding = "UTF-8",
rlWarn = rock::opts$get(rlWarn),
silent = FALSE
)
search_and_replace_in_sources(
input,
output,
replacements = NULL,
outputPrefix = "",
outputSuffix = "_postReplacing",
preventOverwriting = rock::opts$get("preventOverwriting"),
recursive = TRUE,
filenameRegex = ".*",
encoding = rock::opts$get("encoding"),
silent = rock::opts$get("silent")
)
wordwrap_source(
input,
output = NULL,
length = 60,
removeNewlines = FALSE,
removeTrailingNewlines = TRUE,
rlWarn = rock::opts$get(rlWarn),
preventOverwriting = rock::opts$get("preventOverwriting"),
encoding = rock::opts$get(encoding),
silent = rock::opts$get(silent),
utteranceMarker = rock::opts$get("utteranceMarker")
)
Arguments
input |
For |
output |
For |
replacementsPre , replacementsPost |
Each is a list of two-element vectors,
where the first element in each vector contains a regular expression to search for
in the source(s), and the second element contains the replacement (these are passed
as |
extraReplacementsPre , extraReplacementsPost |
To perform more replacements
than the default set, these can be conveniently specified in |
removeNewlines |
Whether to remove all newline characters from the source before starting to clean them. Be careful: if the source contains YAML fragments, these will also be affected by this, and will probably become invalid! |
removeTrailingNewlines |
Whether to remove trailing newline characters (i.e. at the end of a character value in a character vector); |
rlWarn |
Whether to let |
utteranceSplits |
This is a vector of regular expressions that specify where to
insert breaks between utterances in the source(s). Such breakes are specified using
|
preventOverwriting |
Whether to prevent overwriting of output files. |
encoding |
The encoding of the source(s). |
silent |
Whether to suppress the warning about not editing the cleaned source. |
outputPrefix , outputSuffix |
The prefix and suffix to add to the filenames when writing the processed files to disk. |
recursive |
Whether to search all subdirectories ( |
filenameRegex |
A regular expression to match against located files; only files matching this regular expression are processed. |
replacements |
The strings to search & replace, as a list of two-element vectors,
where the first element in each vector contains a regular expression to search for
in the source(s), and the second element contains the replacement (these are passed
as |
length |
At how many characters to word wrap. |
utteranceMarker |
The character(s) between utterances (i.e. marking where one utterance ends and the next one starts). By default, this is a line break, and only change this if you know what you are doing. |
Details
The cleaning functions, when called with their default arguments, will do the following:
Double periods (
..
) will be replaced with single periods (.
)Four or more periods (
...
or.....
) will be replaced with three periodsThree or more newline characters will be replaced by one newline character (which will become more, if the sentence before that character marks the end of an utterance)
All sentences will become separate utterances (in a semi-smart manner; specifically, breaks in speaking, if represented by three periods, are not considered sentence ends, wheread ellipses ("…" or unicode 2026, see the example) are.
If there are comma's without a space following them, a space will be inserted.
The cleaning functions, when called with their default arguments, will do the following:
Double periods (
..
) will be replaced with single periods (.
)Four or more periods (
...
or.....
) will be replaced with three periodsThree or more newline characters will be replaced by one newline character (which will become more, if the sentence before that character marks the end of an utterance)
All sentences will become separate utterances (in a semi-smart manner; specifically, breaks in speaking, if represented by three periods, are not considered sentence ends, wheread ellipses ("…" or unicode 2026, see the example) are.
If there are comma's without a space following them, a space will be inserted.
Value
A character vector for clean_source
, or a list of character vectors,
for clean_sources
.
A character vector for clean_source
, or a list of character vectors,
for clean_sources
.
Examples
exampleSource <-
"Do you like icecream?
Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."
### Default settings:
cat(clean_source(exampleSource));
### First remove existing newlines:
cat(clean_source(exampleSource,
removeNewlines=TRUE));
### Example with a YAML fragment
exampleWithYAML <-
c(
"Do you like icecream?",
"",
"",
"Well, that depends\u2026 Sometimes, when it's..... Nice.",
"Then I do,",
"but otherwise... not really, actually.",
"",
"---",
"This acts as some YAML. So this won't be split.",
"Not real YAML, mind... It just has the delimiters, really.",
"---",
"This is an utterance again."
);
cat(
rock::clean_source(
exampleWithYAML
),
sep="\n"
);
exampleSource <-
"Do you like icecream?
Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."
### Simple text replacements:
cat(search_and_replace_in_source(exampleSource,
replacements=list(c("\u2026", "..."),
c("Nice", "Great"))));
### Using a regular expression to capitalize all words following
### a period:
cat(search_and_replace_in_source(exampleSource,
replacements=list(c("\\.(\\s*)([a-z])", ".\\1\\U\\2"))));
exampleSource <-
"Do you like icecream?
Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."
### Default settings:
cat(clean_source(exampleSource));
### First remove existing newlines:
cat(clean_source(exampleSource,
removeNewlines=TRUE));
### Example with a YAML fragment
exampleWithYAML <-
c(
"Do you like icecream?",
"",
"",
"Well, that depends\u2026 Sometimes, when it's..... Nice.",
"Then I do,",
"but otherwise... not really, actually.",
"",
"---",
"This acts as some YAML. So this won't be split.",
"Not real YAML, mind... It just has the delimiters, really.",
"---",
"This is an utterance again."
);
cat(
rock::clean_source(
exampleWithYAML
),
sep="\n"
);