Matcher
The simplest example is to search a list of words in a document. To do so, Matcher is the main public API of this package. I recommend to use the Matcher build method to simplify its construction:
With a list of words (keywords)
from iamsystem import Matcher
matcher = Matcher.build(
keywords=["acute respiratory distress syndrome", "diarrrhea"]
)
annots = matcher.annot_text(
text="Pt c/o Acute Respiratory Distress " "Syndrome and diarrrhea"
)
for annot in annots:
print(annot)
# Acute Respiratory Distress Syndrome 7 42 acute respiratory distress syndrome # noqa
# diarrrhea 47 56 diarrrhea
The matcher outputs a list of Annotation. By default, it performs exact match only. A limitation of passing words to the matcher is that no attributes are associated.
With a list of entities
Often, keywords are derived from a knowledge graph that associates a label with a unique identifier. The Entity has a kb_id attribute to store an identifier.
from iamsystem import Entity
from iamsystem import Matcher
ent1 = Entity(label="acute respiratory distress syndrome", kb_id="J80")
ent2 = Entity(label="diarrrhea", kb_id="R19.7")
text = "Pt c/o acute respiratory distress syndrome and diarrrhea"
matcher = Matcher.build(keywords=[ent1, ent2])
annots = matcher.annot_text(text=text)
for annot in annots:
print(annot)
# acute respiratory distress syndrome 7 42 acute respiratory distress syndrome (J80) # noqa
# diarrrhea (R19.7) 47 56
With a custom of keyword subclass
If you need to add other attributes to a keyword, you can create your own IKeyword implementation.
from iamsystem import Entity
from iamsystem import IEntity
from iamsystem import Matcher
class MyKeyword(IEntity):
def __init__(
self, label: str, category: str, kb_name: str, uri: str
):
"""label is the only mandatory attribute."""
self.label = label
self.kb_name = kb_name
self.category = category
self.kb_id = uri
def __str__(self):
"""Called by print(annot)"""
return f"{self.kb_id}"
ent1 = MyKeyword(
label="acute respiratory distress syndrome",
category="disease",
kb_name="wikipedia",
uri="https://www.wikidata.org/wiki/Q344873",
)
ent2 = Entity(label="diarrrhea", kb_id="R19.7")
text = "Pt c/o acute respiratory distress syndrome and diarrrhea"
matcher = Matcher.build(keywords=[ent1, ent2])
annots = matcher.annot_text(text=text)
for annot in annots:
print(annot)
# acute respiratory distress syndrome 7 42 https://www.wikidata.org/wiki/Q344873 # noqa
# diarrrhea 47 56 diarrrhea (R19.7)
Note you can add different keywords types.
Context window (w)
iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword/term. The w parameter determines how much discontinuous the sequence of tokens can be. By default, w=1 means that the sequence must be continuous.
Let’s say we want to detect the keyword “calcium level” in a document. With w=1, the matcher wouldn’t find the keyword in “calcium blood level” since the sequence of tokens in the document is discontinuous. One solution would be to add “blood” to the Stopwords list, however if “blood” is used by another keyword it would be a bad solution. Another solution is to set w=2 that lets the algorithm searches 2 words after token “calcium”.
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["calcium level"], w=2)
4annots = matcher.annot_text(text="calcium blood level")
5for annot in annots:
6 print(annot)
7# calcium level 0 7;14 19 calcium level
The semicolon indicates that the sequence is discontinuous. The first token “calcium” starts at character 0 and ends at character 6 (7-1). The second token “level” starts at character 14 and ends at character 18 (19-1).
Unidirectional detection
Word order is important. When the sequence of words in the document is not the same as the words sequence of the keyword, the algorithm fails to detect it. For example:
from iamsystem import Matcher
matcher = Matcher.build(keywords=["calcium level"], w=2)
annots = matcher.annot_text(text="level calcium")
print(len(annots)) # 0
This problem can be solved by changing the order of the tokens in a sentence which is the responsibility of the tokenizer. See Tokenizer section on Change tokens order.