Annotation
A Matcher outputs instances of Annotation. iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword/term. An Annotation instance stores the sequence of tokens of a document matched to one or multiple keywords. Also, the name of the fuzzy algorithm that matched a token in a document is stored for machine learning or debugging purposes.
Annotation’s format
The to_string method returns a string representation containing three tabulated fields:
A concatenation of tokens label as they appear in the document.
The start-end offsets in the Brat format (start and end are separated by a space, a semicolon is used to separate offsets of discontinuous tokens).
A string representation of detected Keywords.
For example:
1from iamsystem import Entity
2from iamsystem import Matcher
3
4ent = Entity(label="infectious disease", kb_id="D007239")
5matcher = Matcher.build(
6 keywords=[ent], abbreviations=[("infect", "infectious")], w=2
7)
8text = "Infect mononucleosis disease"
9annots = matcher.annot_text(text=text)
10for annot in annots:
11 print(annot)
12 print(annot.to_string(text=text))
13 print(annot.to_string(text=text, debug=True))
14# Infect disease 0 6;21 28 infectious disease (D007239) # noqa
15# Infect disease 0 6;21 28 infectious disease (D007239) Infect mononucleosis disease # noqa
16# Infect disease 0 6;21 28 infectious disease (D007239) Infect mononucleosis disease infect(abbs);disease(exact) # noqa
Passing the document to the to_string function adds the document substring that begins at the first token start offset and ends at the last token end offset. If debug equals True, it adds each token’s normalized label and the name(s) of the fuzzy algorithm(s) that detected it.
The method to_dict returns a dictionary representation of an annotation.
Multiple keywords per annotation
An Annotation has multiple keywords if and only if these keywords have the same tokenization output, i.e. the same sequence of tokens. This happens if two terms have the same label but also if the normalization process removes punctuation or if stopwords are ignored. In the example below, only one annotation is produced and it has 3 keywords:
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem import english_tokenizer
4
5ent1 = Entity(label="Infectious Disease", kb_id="J80")
6ent2 = Entity(label="infectious disease", kb_id="C0042029")
7ent3 = Entity(
8 label="infectious disease, unspecified", kb_id="C0042029"
9)
10matcher = Matcher.build(
11 keywords=[ent1, ent2, ent3],
12 tokenizer=english_tokenizer(),
13 stopwords=["unspecified"],
14)
15text = "History of infectious disease"
16annots = matcher.annot_text(text=text)
17annot = annots[0]
18for keyword in annot.keywords:
19 print(keyword)
20# Infectious Disease (J80)
21# infectious disease (C0042029)
22# infectious disease, unspecified (C0042029)
Overlapping and ancestors
In a knowledge base, labels can share a same prefix. For example keywords “lung” and “lung cancer” have the same prefix “lung”. “lung” is called an ancestor of “lung cancer” because iamsystem algorithm constructs a graph representation of keywords. Note that ancestor is not defined by a binary relation (e.g. subsomption) that could exist in the knowledge base but only when two keywords have a common prefix.
Full overlapping
Definition: let a1 and a2 two annotations. If a1.start <= a2.start and a1.end > a2.end then we say that a1 fully overlaps a2. Furthermore, if a1 has all the tokens of a2 then a2 is called a nested annotation. By default, the matcher removes nested annotation. For example:
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["lung", "lung cancer"], w=1)
4text = "Presence of a lung cancer"
5annots = matcher.annot_text(text=text)
6for annot in annots:
7 print(annot)
8# lung cancer 14 25 lung cancer
9matcher.remove_nested_annots = False
10annots_2 = matcher.annot_text(text=text)
11for annot in annots_2:
12 print(annot)
13# lung 14 18 lung
14# lung cancer 14 25 lung cancer
Another example where the first annotation fully overlaps the second but the latter is not a nested annotation:
1from iamsystem import Matcher
2
3matcher = Matcher.build(
4 keywords=["North America", "South America"], w=3
5)
6text = "North and South America"
7annots = matcher.annot_text(text=text)
8for annot in annots:
9 print(annot)
10# North America 0 5;16 23 North America
11# South America 10 23 South America
The first annotation, starting at offset 0 and ending at offset 23, fully overlaps the second. However, it doesn’t have all the tokens of the second annotation, thus the second annotation is not a nested annotation and it’s not removed. The brat format shows that North America keyword is a discontinuous sequence of tokens in the document.
Under the hood, the rm_nested_annots function is called to remove nested annotations. Ancestors are a frequent cause of nested annotations but not the only one. This function allows to remove nested annotations but to keep ancestors. Removing or keeping ancestors depends on your use case. In a semantic annotation task, only the longest terms must be kept so the ancestors need to be removed. In an information retrieval task, ancestors could be kept in the index.
Partial overlapping
Definition: let a1 and a2 two annotations. If a1.start < a2.start and a2.start < a1.end then we say that a1 partially overlaps a2.
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["lung cancer", "cancer prognosis"])
4annots = matcher.annot_text(text="lung cancer prognosis")
5for annot in annots:
6 print(annot)
7# lung cancer 0 11 lung cancer
8# cancer prognosis 5 21 cancer prognosis
The first annotation partially overlaps the second because it ends after the second starts. In this example, both annotations share the “cancer” token.
Similarly the rm_nested_annots function has no effect here.