Compute the Biber-style dimension scores of a data frame.
Usage
dtag_tbl(
tbl,
input = 1,
text = 2,
tokenized = FALSE,
ttr = 400,
deflated = TRUE,
exclude = NULL
)
Arguments
- tbl
A data frame with at least one column as input id and another for the _ST tagged text.
- input
A column for the input id (defaults to 1st position, but can be named as e.g. "colname1").
- text
A column for the text (defaults to 2nd position, but can be named as e.g. "colname12"). The text should be tagged with _ST tags, and in the flattened, not tokenized form.
- tokenized
Logical. The default is FALSE, in which case the function tokenizes the text with
str_split(text, " ")
. Set to TRUE if text is already tokenized and in a list column.- ttr
Maximum number of tokens to consider for TTR, defaults to 400.
- deflated
Logical. If TRUE (default), Dimension scores are calculated without using the low mean frequency features from Biber's original study, following the MAT tagger algorithm (Nini 2019).
- exclude
A character vector of dimension tags you don't want to include in the analysis. Note that the dimension tags should be quoted inside angular brackets. For example:
(1) If some of the word counts in the texts are below 400, you may want to exclude type token ratios from the analysis with
exclude = "<TTR>"
.(2) Some of the tags (such as <WZPRES> and <GER>) were manually checked in the original Biber study, but are automatically tagged here. You can exclude some of these tags by naming them in the exclude character vector such as
exclude = c("<GER>", "<WZPRES>")
.(3) To exclude all of tags manually checked by Biber, use the argument
exclude = "<MANUAL>"
This is the same as the argument:exclude = c("<DEMP>", "<GER>", "<PASTP>", "<PRESP>", "<SERE>", "<THAC>", "<THVC>", "<TOBJ>", "<TSUB>", "<WZPAST>", "<WZPRES>")
(4) You can combine (1) and (3) with
exclude = c("<MANUAL>" , "<TTR>")
Value
A tibble containing:
wordcount - number of non-punctuation tokens found in text
dimension - Dimension1 ~ Dimension6 from Biber 1988 for each feature
feature - the <MDA> tag or AWL or TTR
detail - brief description of the feature
count - number of times the feature is counted in text
value - in case of <MDA> tag, normalized frequency per 100 tokens
z-score - value scaled to the biber_mean and biber_sd
d-score - same as z-score, but with the sign of negative dimension features reversed
biber_mean and biber_sd for each feature, based on Biber 1988
closest matching text type for each input, based on Biber 1989
Details
This function adds multidimensional analysis <MDA> tags to a data frame. The data frame should contain one column to identify input id, and another for the text that has been tagged with _ST tags, and flattened into non-tokenized form.
After tagging the text, the function then calculates Dimension scores based on the Biber 1988 standard, and approximates the closest text type as per Biber 1989 text classification.
References
Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511621024
Biber, D. (1989). A typology of English texts. , 27(1), 3-44. https://doi.org/10.1515/ling.1989.27.1.3
Nini, A. (2019). The Multi-Dimensional Analysis Tagger. In Berber Sardinha, T. & Veirano Pinto M. (eds), Multi-Dimensional Analysis: Research Methods and Current Issues, 67-94, London; New York: Bloomsbury Academic.