This function produces concordance lines of text from data
by finding up to
two tag matches in tokenized text.
Usage
conc_by_tag(
data,
what = "token",
tag = "mda",
match,
cols = NULL,
tag2 = NULL,
match2 = NULL,
...
)
Arguments
- data
A relational data frame containing the text to concordance. The data frame is expected to have one column of tokens, in tokenized form, at least one column of the corresponding tags, and identifying details such as corpus, doc_id etc.
- what
The name of the column containing the text to concatenate. Default is "token".
- tag
The name of the column containing the tags to match. Default is "mda".
- match
The tag to match within the
tag
column. The match can take regex, so you can use anchoring characters (^ and $) for specific searches.- cols
The names of the columns to include in the output. It may be useful to include some extra reference columns (such as doc_id), or other tags for more fine-grained filtering.
- tag2
The name of the second column containing the tags to match (optional).
- match2
The second tag to match within the
tag2
column (optional).- ...
Additional arguments to be passed onto
dtagger::quick_conc
.For example: pass on the
separated = TRUE
argument, to enable sorting search result by adjacent tokens to the left and rightpass on the
n = 3
argument, to limit the search window to 3 tokens either side of the match.
Value
A tibble containing:
case - a case number for the match found.
left - objects immediately adjacent (up to n) to the left of the matched node, as defined by the
what
argument (default is token). In case ofseparated = TRUE
, the left are separated into left(n):left1match - the matched search item, as defined by the
match
argument.right - tokens immediately adjacent (up to n) to the right of the matched node, as defined by the
what
argument (default is token). In case ofseparated = TRUE
, the right tokens are separated into right1:right(n).index - the index row position of matched result from the input data frame.
other cols - as defined by the
tag
,tag2
andcols
arguments.
Details
The purpose of this function is to allow fine-grained concordance searches of tagged text. The input should be a dataframe with a column for tokens in tokenized form, and separate columns for tags, document and corpus details.
Typically, the function can be used with output from udpipe::udpipe_annotate
and dtagger::dtag_tbl
, dtagger::dtag_directory
or dtagger::add_tag_tbl
functions.
The concordancer can take up to two tag inputs, for example matching all upos == "ADJ"
tags and dep_rel == "amod"
tags, and seing the resulting key words in context.