The add_st_tags
function is designed to process and annotate text using the Universal Dependencies
(UD) model with the udpipe package. It allows users to tokenize and tag text with part-of-speech (Stanford) tags,
and to extract and handle hesitation markers. The function provides options for controlling the parsing,
tokenizer type, and handling of flattened input.
Usage
add_st_tags(
x,
mdl = udmodel,
st_hesitation = FALSE,
flattened = TRUE,
skip_parse = TRUE,
...
)
Arguments
- x
A character vector of input text to be processed.
- mdl
A udpipe model to use for processing the text. The default is the udmodel.
- st_hesitation
A logical value indicating whether or not to extract hesitation markers from the input text. If
TRUE
, the function will extract hesitation markers and return them separately. Default isFALSE
.- flattened
A logical value indicating if the input text is flattened. If
FALSE
, i.e. if the character string is in tokenized form, the function will flatten the text before processing. Default isTRUE
.- skip_parse
A logical value determining if the function should skip parsing and only return tokenized and tagged text. If
FALSE
, the function returns the full UD model when parsing. Default isTRUE
.- ...
Additional arguments to be passed to the
udpipe_annotate()
function. For example:tokenizer = "horizontal"
to force theudpipe_annotate
function to tokenize on tokens separated by white spaces. This will combine words and trailing punctuation marks, unless they have been separated by white space previously.tokenizer = "vertical"
to force theudpipe_annotate
function to tokenize on tokens separated by new line breaks. This can be useful if you want the tokenizer to recognise multi-word entities as a single token, or avoid separating hyphenated words etc.
Value
If skip_parse
is FALSE
, the function returns a tibble with the full udpipe model when parsing.
If st_hesitation
is TRUE
(experimental), the function returns a character vector of tokenized and tagged
text with hesitation markers extracted and handled separately.
Otherwise, the function returns a character vector of tokenized and tagged text.
Examples
if (FALSE) {
# Example text:
text <- "This is an example sentence to be tagged"
# Example speech, tokenized:
speech <- c("I","don't", "know" , "erm" ,",", "whether" , "to" ,
"include" ,"hesitation" , "markers", ".")
# Initiate udpipe model
init_udpipe_model()
# Tag text
add_st_tags(text)
# Tag speech
add_st_tags(speech, st_hesitation = TRUE, tokenized = TRUE)
text <- "I'm in a part-time job, at the moment."
text2 <- "I'm\nin\na\npart-time\njob\n,\n\nat the moment\n.\n"
# tokenizes using default model - may separate some hyphenated words
add_st_tags(text)
# tokenizes on whitespaces - punctuation marks can be lumped in with words
add_st_tags(text, tokenizer = "horizontal")
# tokenizes on user-defined line breaks - possible to capture multi-word expressions
add_st_tags(text2, tokenizer = "vertical") }