Stopwords
Default Stopwords
It can be useful to remove stopwords, i.e. words that are not relevant to find a match. For example, the words ‘unspecified’ or ‘NOS’ (Not Otherwise Specified) is frequently used in medical terminologies to denote an entity that has been incompletely characterized.
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem import english_tokenizer
4
5ent = Entity(
6 label="Essential hypertension, unspecified", kb_id="I10.9"
7)
8matcher = Matcher.build(
9 keywords=[ent],
10 tokenizer=english_tokenizer(),
11 stopwords=["unspecified"],
12)
13text = "Medical history: essential hypertension"
14annots = matcher.annot_text(text=text)
15for annot in annots:
16 print(annot)
17# essential hypertension 17 39 Essential hypertension, unspecified (I10.9) # noqa
NegativeStopwords
Sometimes it’s useful to ignore all the words but those of the keywords. For example, we want to find the label “calcium blood” whatever the words between calcium and blood as long as the order is kept. One solution would be to change the Context window (w). Another solution is to use NegativeStopwords to ignore all words except those that the user wants to keep:
1from iamsystem import Matcher
2
3text = "the level of calcium can be measured in the blood."
4matcher = Matcher.build(keywords=["calcium blood"], negative=True)
5annots = matcher.annot_text(text=text)
6for annot in annots:
7 print(annot)
8# calcium blood 13 20;44 49 calcium blood