Stopwords

Default Stopwords

It can be useful to remove stopwords, i.e. words that are not relevant to find a match. For example, the words ‘unspecified’ or ‘NOS’ (Not Otherwise Specified) is frequently used in medical terminologies to denote an entity that has been incompletely characterized.

 1    from iamsystem import Matcher, Term, english_tokenizer
 2    tokenizer = english_tokenizer()
 3    matcher = Matcher(tokenizer=tokenizer)
 4    matcher.add_stopwords(words=["unspecified"])
 5    term = Term(label="Essential hypertension, unspecified", code="I10.9")
 6    matcher.add_keywords(keywords=[term])
 7    text = "Medical history: essential hypertension"
 8    annots = matcher.annot_text(text=text)
 9    for annot in annots:
10        print(annot)
# essential hypertension        17 39   Essential hypertension, unspecified (I10.9)

NegativeStopwords

Sometimes it’s useful to ignore all the words but those of the keywords. For example, we want to find the label “calcium blood” whatever the words between calcium and blood are as long as the order is kept. One solution would be to change the Context window (w). Another solution is to use NegativeStopwords to ignore all words except those that the user wants to keep:

 1    from iamsystem import Matcher, Terminology, NegativeStopwords, english_tokenizer, Keyword, NoStopwords
 2    text = "the level of calcium can be measured in the blood."
 3    termino = Terminology()
 4    termino.add_keywords(keywords=[Keyword(label="calcium blood")])
 5    neg_stopwords = NegativeStopwords()
 6    tokenizer = english_tokenizer()
 7    neg_stopwords.add_words(words_to_keep=termino.get_unigrams(tokenizer=tokenizer, stopwords=NoStopwords()))
 8    matcher = Matcher(tokenizer=tokenizer, stopwords=neg_stopwords)
 9    matcher.add_keywords(keywords=termino)
10    annots = matcher.annot_text(text=text, w=1)
11    for annot in annots:
12        print(annot)
# calcium blood 13 20;44 49     calcium blood

Note that you can use the Terminology class to retrieve all the unigrams of your keywords.