Fuzzy Algorithms
Introduction
iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword. The default fuzzy algorithm of the Matcher class is the exact match algorithm. In general, in entity linking tasks, exact matching has high precision but low recall since a single character difference in a token can lead to a miss.
In this package, a fuzzy algorithm is an algorithm that is a called for each token in a document and can return one or more synonym, i.e. another string with the same meaning. The combination of several fuzzy algorithms offers great flexibility in the matching strategy, it increases recall but can also decrease precision.
This package doesn’t contain any implementation of approximate string matching algorithms, it relies on and wraps external libraries to do so. Some external libraries are not in the requirement file of this package, so you will need to install them manually depending on the fuzzy algorithm you wish to add.
Which fuzzy algorithm to choose
The set of fuzzy algorithms is configured by the user. Which one to add depends heavily on your documents and the keywords you want to detect.
If your documents contain a lot of typos, String Distance algorithms can help. If your documents contain a lot of abbreviations, it’s useful to have a sense inventory and add abbreviations to the Abbreviations class. If your documents and keywords contain inflected forms (singular, plurial, conjugated form), it is useful to add a normalization method (lemmatization, stemming) with the WordNormalizer class. If your keywords contain regular expressions, the FuzzyRegex class takes care of that.
Remember that for each token in the document, all fuzzy algorithms added to the Matcher will be called, so the more algorithms you add, the slower iamsystem. However, algorithms that are context independant can be cached to avoid calling them multiple times.
Abbreviations
The Abbreviations class allows you to provide a sense inventory of abbreviations to the matcher.
1from iamsystem import Entity
2from iamsystem import Matcher
3
4ent1 = Entity(label="acute respiratory distress", kb_id="J80")
5ent2 = Entity(label="patient", kb_id="D007290")
6ent3 = Entity(label="patient hospitalized", kb_id="D007297")
7ent4 = Entity(label="physiotherapy", kb_id="D007297")
8matcher = Matcher.build(
9 keywords=[ent1, ent2, ent3, ent4],
10 abbreviations=[
11 ("Pt", "patient"),
12 ("PT", "physiotherapy"),
13 ("ARD", "Acute Respiratory Distress"),
14 ],
15)
16annots = matcher.annot_text(
17 text="Pt hospitalized with ARD. Treament: PT"
18)
19for annot in annots:
20 print(annot.to_string(debug=True))
21# Pt hospitalized 0 15 patient hospitalized (D007297) pt(abbs);hospitalized(exact) # noqa
22# ARD 21 24 acute respiratory distress (J80) ard(abbs)
23# PT 36 38 patient (D007290) pt(abbs)
24# PT 36 38 physiotherapy (D007297) pt(abbs)
Note the following:
The first word “Pt” is associated with a single annotation.
Since “hospitalized” comes after the abbreviation and since the matcher removes nested annotation by default (See Full overlapping), the ambiguity is removed.
The last word “PT” has two annotations
The Abbreviations is context independent and cannot resolve the ambiguity here. To solve this problem, the annotations need to be post-processed (rules, language models…) to identify the most likely long form.
In the case where two abbreviations have different string cases (Pt stands only for patient and PT for physiotherapy), the Abbreviations class can be configured to be case sensitive. The Abbreviations class can be configured with a method that checks if the document’s token is an abbreviation or not:
1from iamsystem import Abbreviations
2from iamsystem import Entity
3from iamsystem import Matcher
4from iamsystem import TokenT
5from iamsystem import english_tokenizer
6
7def upper_case_only(token: TokenT) -> bool:
8 """Return True if all token's characters are uppercase."""
9 return token.label.isupper()
10
11def first_letter_capitalized(token: TokenT) -> bool:
12 """Return True if the first letter is uppercase."""
13 return token.label[0].isupper() and not token.label.isupper()
14
15tokenizer = english_tokenizer()
16ent1 = Entity(label="acute respiratory distress", kb_id="J80")
17ent2 = Entity(label="patient", kb_id="D007290")
18ent3 = Entity(label="patient hospitalized", kb_id="D007297")
19ent4 = Entity(label="physiotherapy", kb_id="D007297")
20matcher = Matcher.build(
21 keywords=[ent1, ent2, ent3, ent4], tokenizer=tokenizer
22)
23
24abbs_upper = Abbreviations(
25 name="upper case abbs", token_is_an_abbreviation=upper_case_only
26)
27abbs_upper.add(
28 short_form="PT", long_form="physiotherapy", tokenizer=tokenizer
29)
30abbs_upper.add(
31 short_form="ARD",
32 long_form="Acute Respiratory Distress",
33 tokenizer=tokenizer,
34)
35abbs_capitalized = Abbreviations(
36 name="capitalized abbs",
37 token_is_an_abbreviation=first_letter_capitalized,
38)
39abbs_capitalized.add(
40 short_form="Pt", long_form="patient", tokenizer=tokenizer
41)
42matcher.add_fuzzy_algo(fuzzy_algo=abbs_upper)
43matcher.add_fuzzy_algo(fuzzy_algo=abbs_capitalized)
44annots = matcher.annot_text(
45 text="Pt hospitalized with ARD. Treament: PT"
46)
47for annot in annots:
48 print(annot.to_string(debug=True))
49# Pt hospitalized 0 15 patient hospitalized (D007297) pt(capitalized abbs);hospitalized(exact) # noqa
50# ARD 21 24 acute respiratory distress (J80) ard(upper case abbs)
51# PT 36 38 physiotherapy (D007297) pt(upper case abbs)
Notice that TokenT is a generic token type, so if you use a custom tokenizer (i.e. from an external library like spaCy) you can access custom attributes.
String Distance
This package utilizes the spellwise and pysimstring libraries to access string distance algorithms.
Spellwise
In the example below, iamsystem is configured with two spellwise algorithms: Levenshtein distance which measures the number of edits needed to transform one word into another, and Soundex which is a phonetic algorithm.
1from iamsystem import Entity
2from iamsystem import ESpellWiseAlgo
3from iamsystem import Matcher
4
5ent1 = Entity(label="acute respiratory distress", kb_id="J80")
6matcher = Matcher.build(
7 keywords=[ent1],
8 spellwise=[
9 dict(
10 measure=ESpellWiseAlgo.LEVENSHTEIN,
11 max_distance=1,
12 min_nb_char=5,
13 ),
14 dict(measure=ESpellWiseAlgo.SOUNDEX, max_distance=1),
15 ],
16)
17annots = matcher.annot_text(text="acute resiratory distresssss")
18for annot in annots:
19 print(annot.to_string(debug=True))
20# acute resiratory distresssss 0 28 acute respiratory distress (J80) acute(exact,LEVENSHTEIN,SOUNDEX);resiratory(LEVENSHTEIN);distresssss(SOUNDEX) # noqa
The spellwise parameter of the build function expects an iterable of dictionary. The key-value pairs of a dictionary are passed to the SpellWiseWrapper init function. Since a string distance algorithm is context independent, the build function placed them in a CacheFuzzyAlgos to avoid calling them multiple times. For a list of available Spellwise algorithms, see ESpellWiseAlgo.
String distance algorithms are often used to detect typos in a document. False positives are common since two words could have a short string distance. To avoid calling a string distance algorithm on common words of a language, you can set string_distance_ignored_w parameter:
1from iamsystem import ESpellWiseAlgo
2from iamsystem import Matcher
3
4matcher = Matcher.build(
5 keywords=["poids"],
6 spellwise=[
7 dict(
8 measure=ESpellWiseAlgo.LEVENSHTEIN,
9 max_distance=1,
10 min_nb_char=4,
11 )
12 ],
13)
14annots = matcher.annot_text(text="Absence de poils.")
15for annot in annots:
16 print(annot)
17matcher = Matcher.build(
18 keywords=["poids"],
19 spellwise=[
20 dict(
21 measure=ESpellWiseAlgo.LEVENSHTEIN,
22 max_distance=1,
23 min_nb_char=4,
24 )
25 ],
26 string_distance_ignored_w=["poils"],
27)
28# poils 11 16 poids
29annots_2 = matcher.annot_text(text="Absence de poils.")
30for annot in annots_2:
31 print(annot) # 0
Since poils is one substitution from poids, the algorithm returns a false positive. By telling the algorithm to ignore common words French words like poils, the string distance algorithm is called only for unknown words.
SimString
The pysimstring library provides an API to the fast simstring algorithm implemented in C++. The simstring parameter of the build function expects an iterable of dictionary. The key-value pairs of a dictionary are passed to the SimStringWrapper init function. Since a string distance algorithm is context independent, the build function placed them in a CacheFuzzyAlgos to avoid calling them multiple times.
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem.fuzzy.simstring import ESimStringMeasure
4
5ent1 = Entity(label="acute respiratory distress", kb_id="J80")
6matcher = Matcher.build(
7 keywords=[ent1],
8 simstring=[dict(measure=ESimStringMeasure.COSINE, threshold=0.7)],
9)
10annots = matcher.annot_text(text="acute respiratori disstress")
11for annot in annots:
12 print(annot)
13# acute respiratori disstress 0 27 acute respiratory distress (J80)
Using the cosine similarity and a threshold of 0.7, the tokens respiratori matched to respiratory and disstress matched to distress.
CacheFuzzyAlgos
Fuzzy algorithms that are not context depend can be cached to avoid calling them multiple times. The CacheFuzzyAlgos stores fuzzy algorithms, calls them once and then stores their results.
1from iamsystem import Abbreviations
2from iamsystem import CacheFuzzyAlgos
3from iamsystem import Entity
4from iamsystem import ESpellWiseAlgo
5from iamsystem import Matcher
6from iamsystem import SpellWiseWrapper
7
8ent1 = Entity(label="acute respiratory distress", kb_id="J80")
9matcher = Matcher.build(keywords=[ent1])
10abbs = Abbreviations(name="abbs")
11abbs.add(short_form="a", long_form="acute", tokenizer=matcher)
12test = dict(
13 measure=ESpellWiseAlgo.LEVENSHTEIN, max_distance=1, min_nb_char=5
14)
15levenshtein = SpellWiseWrapper(**test)
16soundex = SpellWiseWrapper(ESpellWiseAlgo.SOUNDEX, max_distance=1)
17cache = CacheFuzzyAlgos()
18for algo in [levenshtein, soundex]:
19 algo.add_words(words=matcher.get_keywords_unigrams())
20 cache.add_algo(algo=algo)
21# cache.add_algo(algo=abbs) ## no need to be this one in cache
22matcher.add_fuzzy_algo(fuzzy_algo=cache)
23matcher.add_fuzzy_algo(fuzzy_algo=abbs)
24annots = matcher.annot_text(text="a resiratory distresssss")
25for annot in annots:
26 print(annot.to_string(debug=True))
27# a resiratory distresssss 0 24 acute respiratory distress (J80) a(abbs);resiratory(LEVENSHTEIN);distresssss(SOUNDEX) # noqa
Note that although we could have put the Abbreviations instance in the cache, it’s not necessary to do so since this algorithm is as fast as the cache. If you use the build function of the matcher, string distance algorithms are automatically cached.
FuzzyRegex
Regular expressions are very useful and can be used with iamsystem. For example, if you want to detect blood test results in electronic health records, such as calcium levels in blood, you can have a regular expression in your keyword: “calcium (^d*[.,]?d*$) mmol/L”. The fuzzy_regex parameter expects an iterable of dictionary. Key-value pairs of the dictionary correspond to FuzzyRegex init function parameters.
The regular expression (^d*[.,]?d*$) is placed in the FuzzyRegex instance, with a patter name (ex: numval), and the pattern name is placed in the keyword (“calcium numval mmol/L”).
1from iamsystem import Matcher
2from iamsystem import english_tokenizer
3from iamsystem import split_find_iter_closure
4
5tokenizer = english_tokenizer()
6tokenizer.split = split_find_iter_closure(pattern=r"(\w|\.|,)+")
7matcher = Matcher.build(
8 keywords=["calcium numval mmol/L"],
9 tokenizer=tokenizer,
10 stopwords=["level", "is", "normal"],
11 fuzzy_regex=[
12 dict(
13 name="regex_num",
14 pattern=r"^\d*[.,]?\d*$",
15 pattern_name="numval",
16 )
17 ],
18)
19annots = matcher.annot_text(
20 text="the blood calcium level is normal: 2.1 mmol/L"
21)
22for annot in annots:
23 print(annot)
24# calcium 2.1 mmol L 10 17;35 45 calcium numval mmol/L
Note that the Default split function must be modified to detect decimal values. Also note that the label of the keyword “calcium numval mmol/L” (line 7) contains the same pattern name numval. When the fuzzy algorithm receives the token value 2.1, it finds that it matches its regular expression and returns the pattern name numval.
In the example above, stopwords have been added, otherwise the algorithm wouldn’t have found the keyword with a context window of 1. It’s often the case that intermediate words are not known in avance, so this method wouldn’t work. Another way to do exactly the same annotation is to use the NegativeStopwords class which ignores all unigrams that are not in the keywords:
1from iamsystem import Matcher
2from iamsystem import english_tokenizer
3from iamsystem import split_find_iter_closure
4
5tokenizer = english_tokenizer()
6tokenizer.split = split_find_iter_closure(pattern=r"(\w|\.|,)+")
7matcher = Matcher.build(
8 keywords=["calcium numval mmol/L"],
9 tokenizer=tokenizer,
10 negative=True,
11 fuzzy_regex=[
12 dict(
13 name="regex_num",
14 pattern=r"^\d*[.,]?\d*$",
15 pattern_name="numval",
16 )
17 ],
18)
19annots = matcher.annot_text(
20 text="the blood calcium level is normal: 2.1 mmol/L"
21)
22for annot in annots:
23 print(annot)
24# calcium 2.1 mmol L 10 17;35 45 calcium numval mmol/L
WordNormalizer
Word normalization is a common pre-processing step in NLP. The idea is to group words that have the same normalized form; for example “eating”, “eats”… have the same canonical form “eat”.
The WordNormalizer offers the possibility to add a normalization function. A token in a document will match a token in a keyword if they have the same normalized form.
In the example below, nltk is used to access a French stemmer. The stemming function is given to the WordNormalizer class:
1from nltk.stem.snowball import FrenchStemmer
2
3from iamsystem import Entity
4from iamsystem import Matcher
5from iamsystem import french_tokenizer
6
7ent1 = Entity(label="cancer de la prostate", kb_id="C61")
8stemmer = FrenchStemmer()
9matcher = Matcher.build(
10 keywords=[ent1],
11 tokenizer=french_tokenizer(),
12 stopwords=["de", "la"],
13 normalizers=[dict(name="french_stemmer", norm_fun=stemmer.stem)],
14)
15annots = matcher.annot_text(text="cancer prostatique")
16for annot in annots:
17 print(annot)
18# cancer prostatique 0 18 cancer de la prostate (C72)
Abstract Base classes
You might be interested in the fuzzy algorithms abstract base classes if you want to create a new custom fuzzy algorithm. The hierarchy is the following:
Implements this class to create a context dependent algorithm. For each token for which a synonym is expected, the context words and the algorithm’s states are available.
Implements this class to create a context-free algorithm that depends only on the current token. The class has access to the generic token for which a synonym is expected. Examples of such algorithms: FuzzyRegex, Abbreviations.
Implements this class to create a context-free algorithm that depends only on the normalized form of the token. The class has access to the normalized label of the token for which a synonym is expected. These algorithms can be cached with CacheFuzzyAlgos. Examples of such algorithms: String Distance, WordNormalizer.