Matcher

The simplest example is to search a list of words in a document. To do so, Matcher is the main public API of this package. I recommend to use the Matcher build method to simplify its construction:

With a list of words (keywords)

from iamsystem import Matcher

matcher = Matcher.build(
    keywords=["acute respiratory distress syndrome", "diarrrhea"]
)
annots = matcher.annot_text(
    text="Pt c/o Acute Respiratory Distress " "Syndrome and diarrrhea"
)
for annot in annots:
    print(annot)
# Acute Respiratory Distress Syndrome	7 42	acute respiratory distress syndrome # noqa
# diarrrhea	47 56	diarrrhea

The matcher outputs a list of Annotation. By default, it performs exact match only. A limitation of passing words to the matcher is that no attributes are associated.

With a list of entities

Often, keywords are derived from a knowledge graph that associates a label with a unique identifier. The Entity has a kb_id attribute to store an identifier.

from iamsystem import Entity
from iamsystem import Matcher

ent1 = Entity(label="acute respiratory distress syndrome", kb_id="J80")
ent2 = Entity(label="diarrrhea", kb_id="R19.7")
text = "Pt c/o acute respiratory distress syndrome and diarrrhea"
matcher = Matcher.build(keywords=[ent1, ent2])
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# acute respiratory distress syndrome	7 42	acute respiratory distress syndrome (J80) # noqa
# diarrrhea (R19.7)	47	56

With a custom of keyword subclass

If you need to add other attributes to a keyword, you can create your own IKeyword implementation.

from iamsystem import Entity
from iamsystem import IEntity
from iamsystem import Matcher

class MyKeyword(IEntity):
    def __init__(
        self, label: str, category: str, kb_name: str, uri: str
    ):
        """label is the only mandatory attribute."""
        self.label = label
        self.kb_name = kb_name
        self.category = category
        self.kb_id = uri

    def __str__(self):
        """Called by print(annot)"""
        return f"{self.kb_id}"

ent1 = MyKeyword(
    label="acute respiratory distress syndrome",
    category="disease",
    kb_name="wikipedia",
    uri="https://www.wikidata.org/wiki/Q344873",
)
ent2 = Entity(label="diarrrhea", kb_id="R19.7")
text = "Pt c/o acute respiratory distress syndrome and diarrrhea"
matcher = Matcher.build(keywords=[ent1, ent2])
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# acute respiratory distress syndrome	7 42	https://www.wikidata.org/wiki/Q344873 # noqa
# diarrrhea	47 56	diarrrhea (R19.7)

Note you can add different keywords types.

Context window (w)

iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword/term. The w parameter determines how much discontinuous the sequence of tokens can be. By default, w=1 means that the sequence must be continuous.

Let’s say we want to detect the keyword “calcium level” in a document. With w=1, the matcher wouldn’t find the keyword in “calcium blood level” since the sequence of tokens in the document is discontinuous. One solution would be to add “blood” to the Stopwords list, however if “blood” is used by another keyword it would be a bad solution. Another solution is to set w=2 that lets the algorithm searches 2 words after token “calcium”.

from iamsystem import Matcher

matcher = Matcher.build(keywords=["calcium level"], w=2)
annots = matcher.annot_text(text="calcium blood level")
for annot in annots:
    print(annot)
# calcium level	0 7;14 19	calcium level

The semicolon indicates that the sequence is discontinuous. The first token “calcium” starts at character 0 and ends at character 6 (7-1). The second token “level” starts at character 14 and ends at character 18 (19-1).

Unidirectional detection

Word order is important. When the sequence of words in the document is not the same as the words sequence of the keyword, the algorithm fails to detect it. For example:

from iamsystem import Matcher

matcher = Matcher.build(keywords=["calcium level"], w=2)
annots = matcher.annot_text(text="level calcium")
print(len(annots))  # 0

This problem can be solved by changing the order of the tokens in a sentence which is the responsibility of the tokenizer. See Tokenizer section on Change tokens order.