Stopwords

Default Stopwords

It can be useful to remove stopwords, i.e. words that are not relevant to find a match. For example, the words ‘unspecified’ or ‘NOS’ (Not Otherwise Specified) is frequently used in medical terminologies to denote an entity that has been incompletely characterized.

from iamsystem import Entity
from iamsystem import Matcher
from iamsystem import english_tokenizer

ent = Entity(
    label="Essential hypertension, unspecified", kb_id="I10.9"
)
matcher = Matcher.build(
    keywords=[ent],
    tokenizer=english_tokenizer(),
    stopwords=["unspecified"],
)
text = "Medical history: essential hypertension"
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# essential hypertension	17 39	Essential hypertension, unspecified (I10.9) # noqa

NegativeStopwords

Sometimes it’s useful to ignore all the words but those of the keywords. For example, we want to find the label “calcium blood” whatever the words between calcium and blood as long as the order is kept. One solution would be to change the Context window (w). Another solution is to use NegativeStopwords to ignore all words except those that the user wants to keep:

from iamsystem import Matcher

text = "the level of calcium can be measured in the blood."
matcher = Matcher.build(keywords=["calcium blood"], negative=True)
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# calcium blood	13 20;44 49	calcium blood