Tag a directory of folders of plain text files with <MDA> tags
Arguments
- path
A character string denoting the folder containing the target folders (at any level).
- n
An optional argument denoting the maximum number of text files to be analyzed.
- ST
Logical argument denoting whether the text files have _ST tags included already.
- deflated
Logical argument. If TRUE (default), in addition to the normal results, the function returns the dimension scores with "deflated" results, which means rare features from Biber's original study (mean freq < 0.1) are removed from the Dimension score calculations.
- exclude
A character vector of dimension tags you don't want to include in the analysis. Note that the dimension tags should be quoted inside angular brackets. For example:
(1) If some of the word counts in the texts are below 400, you may want to exclude type token ratios from the analysis with
exclude = "<TTR>"
.(2) Some of the tags (such as <WZPRES> and <GER>) were manually checked in the original Biber study, but are automatically tagged here. You can exclude some of these tags by naming them in the exclude character vector such as
exclude = c("<GER>", "<WZPRES>")
.(3) To exclude all of tags manually checked by Biber, use the argument
exclude = "<MANUAL>"
This is the same as the argument:exclude = c("<DEMP>", "<GER>", "<PASTP>", "<PRESP>", "<SERE>", "<THAC>", "<THVC>", "<TOBJ>", "<TSUB>", "<WZPAST>", "<WZPRES>")
(4) You can combine (1) and (3) with
exclude = c("<MANUAL>" , "<TTR>")
- ...
Additional arguments to be passed on.
Value
A list of data frames containing:
corpus_dimension_scores
corpus - name of corpus folder
corpus_text_type - closest text type for average corpus dimensions
most_common_text_type - the mode of the closest text type for the documents within the corpus folder
Dimension scores - calculated scores for Dimension1 ~ Dimension6
document_dimension_scores
corpus - name of corpus folder
doc_id - name of text file
Dimension scores - calculated scores for Dimension1 ~ Dimension6
closest_text_type - closest matching text type for each doc_id, based on Biber 1989
dimension_tags
dimension - Dimension1 ~ Dimension6 from Biber 1988 for each feature
feature - the <MDA> tag or AWL or TTR
detail - brief description of the feature
count - number of times the feature is counted in text
value - in case of <MDA> tag, normailsed frequency per 100 tokens
z-score - value scaled to the biber_mean and biber_sd
d-score - same as z-score, but with the sign of negative dimension features reversed
biber_mean and biber_sd for each feature, based on Biber 1988
tokenized_tags
corpus - name of corpus folder
doc_id - name of text file
st - text tokenized on each _ST tag
mda - text tokenized on each <MDA> tag
texts
corpus - name of corpus folder
doc_id - name of text file
raw_text - untagged, flattened text for each doc_id
tagged_text - flattened text with _ST and <MDA> tags for each doc_id
wordcount - number of non-punctuation tokens found in text
Tukey_hsd
dimension - the dimension for pairwise comparison
contrast - the corpora under pairwise comparison
null.value - the expected difference in means after aov (zero)
estimate - the difference in means after aov
conf.low - the 95% familywise lower confidence level
conf.high - the 95% familywise upper confidence level
p.value - significance test
Details
The target texts to be tagged should be placed in a directory of folders with $$ prefixed on the folder names. The function will then read in any text files from the target folders, and retrieve the folder names as the "corpus" variable.
If the texts have already been tagged with Stanford _ST tags, choose the option ST = TRUE
.
Otherwise, the function add_st_tags() will run over the texts,
for which it is necessary to have a udpipe model loaded. See add_st_tags
for details.
The function then adds multidimensional analysis <MDA> tags, and calculates Dimension scores
based on the Biber 1988 standard. Note that some of the tags from the original study can be excluded
from the analysis, with the exclude
argument.
If the argument deflated = TRUE
, the function also returns Dimension scores
calculated without using the low mean frequency features from Biber's original study,
following the MAT tagger algorithm (Nini 2019).
The function returns a list of tibbles including the tagged texts, individual and corpus-level scores for each dimension of the text and word counts. If the function detects more than one corpus folder (folders prefixed with $$), it will also return the result of post-hoc significance tests. This is a set of confidence intervals on the differences between the means of the dimension scores based on the Studentized range statistic, Tukey's ‘Honest Significant Difference’ method.
References
Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511621024
Biber, D. (1989). A typology of English texts. , 27(1), 3-44. https://doi.org/10.1515/ling.1989.27.1.3
Nini, A. (2019). The Multi-Dimensional Analysis Tagger. In Berber Sardinha, T. & Veirano Pinto M. (eds), Multi-Dimensional Analysis: Research Methods and Current Issues, 67-94, London; New York: Bloomsbury Academic.