Usage
Matcher
The simplest example is to search a list of words in a document. To do so, Matcher is the main public API of this package. I recommend to use the Matcher build method to simplify its construction:
With a list of words (keywords)
from iamsystem import Matcher
matcher = Matcher.build(
keywords=["acute respiratory distress syndrome", "diarrrhea"]
)
annots = matcher.annot_text(
text="Pt c/o Acute Respiratory Distress Syndrome and diarrrhea"
)
for annot in annots:
print(annot)
# Acute Respiratory Distress Syndrome 7 42 acute respiratory distress syndrome # noqa
# diarrrhea 47 56 diarrrhea
The matcher outputs a list of Annotation. By default, it performs exact match only. A limitation of passing words to the matcher is that no attributes are associated.
With a list of entities
Often, keywords are derived from a knowledge graph that associates a label with a unique identifier. The Entity class has a kb_id attribute to store an identifier.
from iamsystem import Entity
from iamsystem import Matcher
ent1 = Entity(label="acute respiratory distress syndrome", kb_id="J80")
ent2 = Entity(label="diarrrhea", kb_id="R19.7")
text = "Pt c/o acute respiratory distress syndrome and diarrrhea"
matcher = Matcher.build(keywords=[ent1, ent2])
annots = matcher.annot_text(text=text)
for annot in annots:
print(annot)
# acute respiratory distress syndrome 7 42 acute respiratory distress syndrome (J80) # noqa
# diarrrhea (R19.7) 47 56
With a custom of keyword subclass
If you need to add other attributes to a keyword, you can create your own IKeyword implementation.
from iamsystem import Entity
from iamsystem import IEntity
from iamsystem import Matcher
class MyKeyword(IEntity):
def __init__(
self, label: str, category: str, kb_name: str, uri: str
):
"""label is the only mandatory attribute."""
self.label = label
self.kb_name = kb_name
self.category = category
self.kb_id = uri
def __str__(self):
"""Called by print(annot)"""
return f"{self.kb_id}"
ent1 = MyKeyword(
label="acute respiratory distress syndrome",
category="disease",
kb_name="wikipedia",
uri="https://www.wikidata.org/wiki/Q344873",
)
ent2 = Entity(label="diarrrhea", kb_id="R19.7")
text = "Pt c/o acute respiratory distress syndrome and diarrrhea"
matcher = Matcher.build(keywords=[ent1, ent2])
annots = matcher.annot_text(text=text)
for annot in annots:
print(annot)
# acute respiratory distress syndrome 7 42 https://www.wikidata.org/wiki/Q344873 # noqa
# diarrrhea 47 56 diarrrhea (R19.7)
Note you can add different keyword types.
Context window (w)
iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword/term. The w parameter determines how much discontinuous the sequence of tokens can be. By default, w=1 means that the sequence must be continuous.
Let’s say we want to detect the keyword “calcium level” in a document. With w=1, the matcher wouldn’t find the keyword in “calcium blood level” since the sequence of tokens in the document is discontinuous. One solution would be to add “blood” to the Stopwords list, however if “blood” is used by another keyword it would be a bad solution. Another solution is to set w=2 that lets the algorithm searches 2 words after token “calcium”.
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["calcium level"], w=2)
4annots = matcher.annot_text(text="calcium blood level")
5for annot in annots:
6 print(annot)
7# calcium level 0 7;14 19 calcium level
The semicolon indicates that the sequence is discontinuous. The first token “calcium” starts at character 0 and ends at character 6 (7-1). The second token “level” starts at character 14 and ends at character 18 (19-1).
Unidirectional detection
Word order is important. When the sequence of words in the document is not the same as the words sequence of the keyword, the algorithm fails to detect it. For example:
from iamsystem import Matcher
matcher = Matcher.build(keywords=["calcium level"], w=2)
annots = matcher.annot_text(text="level calcium")
print(len(annots)) # 0
This problem can be solved by changing the order of the tokens in a sentence which is the responsibility of the tokenizer. See Tokenizer section on Change tokens order.
Tokenizer
The iamsystem matcher is highly dependent on how documents and keywords are tokenized and normalized. The ITokenizer is responsible for turning text into tokens. To do so, the TokenizerImp class performs alphanumeric tokenization with two inner functions:
split the text into (start,end) offsets
normalize each token
The english_tokenizer and french_tokenizer are concrete implementations.
Other libraries offer more elaborate tokenizers, I recommend you use them. To use the tokenizer of another library you can build an adapter by creating a new implementation of the a ITokenizer interface. For example, this package provides a spaCy custom component that consumes spaCy’s tokenizer.
Default split function
By default, the Matcher class calls the french_tokenizer that splits a document by word character (a letter or digit or underbar [a-zA-Z0-9_]).
I recommend that you check the generated tokens to verify it matches your needs. For example:
from iamsystem import english_tokenizer
tokenizer = english_tokenizer()
tokens = tokenizer.tokenize("SARS-CoV+")
for token in tokens:
print(token)
# Token(label='SARS', norm_label='sars', start=0, end=4, i=0)
# Token(label='CoV', norm_label='cov', start=5, end=8, i=1)
The ‘+’ sign is ignored even though it is important. The split function can be modified as follow :
1from iamsystem import english_tokenizer
2from iamsystem import split_find_iter_closure
3
4tokenizer = english_tokenizer()
5tokenizer.split = split_find_iter_closure(pattern=r"(\w+|\+)")
6tokens = tokenizer.tokenize("SARS-CoV+")
7for token in tokens:
8 print(token)
9# Token(label='SARS', norm_label='sars', start=0, end=4, i=0)
10# Token(label='CoV', norm_label='cov', start=5, end=8, i=1)
11# Token(label='+', norm_label='+', start=8, end=9, i=2)
Change default Tokenizer
To change Matcher’s default tokenizer, pass it to the constructor.
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem import english_tokenizer
4from iamsystem import split_find_iter_closure
5
6ent1 = Entity(label="SARS-CoV+", kb_id="95209-3")
7text = "Pt c/o acute respiratory distress syndrome. RT-PCR sars-cov+"
8tokenizer = english_tokenizer()
9tokenizer.split = split_find_iter_closure(pattern=r"(\w+|\+)")
10matcher = Matcher.build(keywords=[ent1], tokenizer=tokenizer)
11annots = matcher.annot_text(text=text)
12for annot in annots:
13 print(annot)
14# sars cov + 51 60 SARS-CoV+ (95209-3)
Default normalize function
You can override the normalize function of a tokenizer to suit your needs. The english_tokenizer normalizes each token by doing lowercasing. The french_tokenizer performs lowercasing and remove accents. The only difference between the french_tokenizer and the english_tokenizer is the removal of diacritics done with the unidecode library that tries to transform the label in ASCII characters. Using the french_tokenizer for english documents adds very little overhead.
Change tokens order
Word order is important for iamsystem. In the example below, the keyword “blood calcium level “ is mentioned but the tokens are discontinuous and not in the right order. One solution is to order the tokens alphabetically. By doing this, the tokens of the document and the keyword are in the same order. Given a wide window, the keyword can be found.
1from iamsystem import Matcher
2from iamsystem import english_tokenizer
3
4text = "the level of calcium can measured in the blood."
5tokenizer = english_tokenizer()
6matcher = Matcher.build(
7 keywords=["blood calcium level"],
8 tokenizer=tokenizer,
9 order_tokens=True,
10 w=5,
11)
12annots = matcher.annot_text(text=text)
13for annot in annots:
14 print(annot)
15# level calcium blood 4 9;13 20;41 46 blood calcium level
order_tokens parameter changes iamsystem’s matching strategy but it doesn’t change the document’s tokens order. This approach is not suitable if the document is very long or the number of keywords is large.
Stopwords
Default Stopwords
It can be useful to remove stopwords, i.e. words that are not relevant to find a match. For example, the words ‘unspecified’ or ‘NOS’ (Not Otherwise Specified) is frequently used in medical terminologies to denote an entity that has been incompletely characterized.
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem import english_tokenizer
4
5ent = Entity(
6 label="Essential hypertension, unspecified", kb_id="I10.9"
7)
8matcher = Matcher.build(
9 keywords=[ent],
10 tokenizer=english_tokenizer(),
11 stopwords=["unspecified"],
12)
13text = "Medical history: essential hypertension"
14annots = matcher.annot_text(text=text)
15for annot in annots:
16 print(annot)
17# essential hypertension 17 39 Essential hypertension, unspecified (I10.9) # noqa
NegativeStopwords
Sometimes it’s useful to ignore all the words but those of the keywords. For example, we want to find the label “calcium blood” whatever the words between calcium and blood as long as the order is kept. One solution would be to change the Context window (w). Another solution is to use NegativeStopwords to ignore all words except those that the user wants to keep:
1from iamsystem import Matcher
2
3text = "the level of calcium can be measured in the blood."
4matcher = Matcher.build(keywords=["calcium blood"], negative=True)
5annots = matcher.annot_text(text=text)
6for annot in annots:
7 print(annot)
8# calcium blood 13 20;44 49 calcium blood
Annotation
A Matcher outputs instances of Annotation. iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword/term. An Annotation instance stores the sequence of tokens of a document matched to one or multiple keywords. Also, the name of the fuzzy algorithm that matched a token in a document is stored for machine learning or debugging purposes.
Annotation’s format
The to_string method returns a string representation containing three tabulated fields:
A concatenation of tokens label as they appear in the document.
The start-end offsets in the Brat format (start and end are separated by a space, a semicolon is used to separate offsets of discontinuous tokens).
A string representation of detected Keywords.
For example:
1from iamsystem import Entity
2from iamsystem import Matcher
3
4ent = Entity(label="infectious disease", kb_id="D007239")
5matcher = Matcher.build(
6 keywords=[ent], abbreviations=[("infect", "infectious")], w=2
7)
8text = "Infect mononucleosis disease"
9annots = matcher.annot_text(text=text)
10for annot in annots:
11 print(annot)
12 print(annot.to_string(text=text))
13 print(annot.to_string(text=text, debug=True))
14# Infect disease 0 6;21 28 infectious disease (D007239) # noqa
15# Infect disease 0 6;21 28 infectious disease (D007239) Infect mononucleosis disease # noqa
16# Infect disease 0 6;21 28 infectious disease (D007239) Infect mononucleosis disease infect(abbs);disease(exact) # noqa
Passing the document to the to_string function adds the document substring that begins at the first token start offset and ends at the last token end offset. If debug equals True, it adds each token’s normalized label and the name(s) of the fuzzy algorithm(s) that detected it.
The method to_dict returns a dictionary representation of an annotation.
Multiple keywords per annotation
An Annotation has multiple keywords if and only if these keywords have the same tokenization output, i.e. the same sequence of tokens. This happens if two terms have the same label but also if the normalization process removes punctuation or if stopwords are ignored. In the example below, only one annotation is produced and it has 3 keywords:
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem import english_tokenizer
4
5ent1 = Entity(label="Infectious Disease", kb_id="J80")
6ent2 = Entity(label="infectious disease", kb_id="C0042029")
7ent3 = Entity(
8 label="infectious disease, unspecified", kb_id="C0042029"
9)
10matcher = Matcher.build(
11 keywords=[ent1, ent2, ent3],
12 tokenizer=english_tokenizer(),
13 stopwords=["unspecified"],
14)
15text = "History of infectious disease"
16annots = matcher.annot_text(text=text)
17annot = annots[0]
18for keyword in annot.keywords:
19 print(keyword)
20# Infectious Disease (J80)
21# infectious disease (C0042029)
22# infectious disease, unspecified (C0042029)
Overlapping and ancestors
In a knowledge base, labels can share a same prefix. For example keywords “lung” and “lung cancer” have the same prefix “lung”. “lung” is called an ancestor of “lung cancer” because iamsystem algorithm constructs a graph representation of keywords. Note that ancestor is not defined by a binary relation (e.g. subsomption) that could exist in the knowledge base but only when two keywords have a common prefix.
Full overlapping
Definition: let a1 and a2 two annotations. If a1.start <= a2.start and a1.end > a2.end then we say that a1 fully overlaps a2. Furthermore, if a1 has all the tokens of a2 then a2 is called a nested annotation. By default, the matcher removes nested annotation. For example:
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["lung", "lung cancer"], w=1)
4text = "Presence of a lung cancer"
5annots = matcher.annot_text(text=text)
6for annot in annots:
7 print(annot)
8# lung cancer 14 25 lung cancer
9matcher.remove_nested_annots = False
10annots_2 = matcher.annot_text(text=text)
11for annot in annots_2:
12 print(annot)
13# lung 14 18 lung
14# lung cancer 14 25 lung cancer
Another example where the first annotation fully overlaps the second but the latter is not a nested annotation:
1from iamsystem import Matcher
2
3matcher = Matcher.build(
4 keywords=["North America", "South America"], w=3
5)
6text = "North and South America"
7annots = matcher.annot_text(text=text)
8for annot in annots:
9 print(annot)
10# North America 0 5;16 23 North America
11# South America 10 23 South America
The first annotation, starting at offset 0 and ending at offset 23, fully overlaps the second. However, it doesn’t have all the tokens of the second annotation, thus the second annotation is not a nested annotation and it’s not removed. The brat format shows that North America keyword is a discontinuous sequence of tokens in the document.
Under the hood, the rm_nested_annots function is called to remove nested annotations. Ancestors are a frequent cause of nested annotations but not the only one. This function allows to remove nested annotations but to keep ancestors. Removing or keeping ancestors depends on your use case. In a semantic annotation task, only the longest terms must be kept so the ancestors need to be removed. In an information retrieval task, ancestors could be kept in the index.
Partial overlapping
Definition: let a1 and a2 two annotations. If a1.start < a2.start and a2.start < a1.end then we say that a1 partially overlaps a2.
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["lung cancer", "cancer prognosis"])
4annots = matcher.annot_text(text="lung cancer prognosis")
5for annot in annots:
6 print(annot)
7# lung cancer 0 11 lung cancer
8# cancer prognosis 5 21 cancer prognosis
The first annotation partially overlaps the second because it ends after the second starts. In this example, both annotations share the “cancer” token.
Similarly the rm_nested_annots function has no effect here.
Fuzzy Algorithms
Introduction
iamsystem algorithm tries to match a sequence of tokens in a document to a sequence of tokens in a keyword. The default fuzzy algorithm of the Matcher class is the exact match algorithm. In general, in entity linking tasks, exact matching has high precision but low recall since a single character difference in a token can lead to a miss.
In this package, a fuzzy algorithm is an algorithm that is a called for each token in a document and can return one or more synonym, i.e. another string with the same meaning. The combination of several fuzzy algorithms offers great flexibility in the matching strategy, it increases recall but can also decrease precision.
This package doesn’t contain any implementation of approximate string matching algorithms, it relies on and wraps external libraries to do so. Some external libraries are not in the requirement file of this package, so you will need to install them manually depending on the fuzzy algorithm you wish to add.
Which fuzzy algorithm to choose
The set of fuzzy algorithms is configured by the user. Which one to add depends heavily on your documents and the keywords you want to detect.
If your documents contain a lot of typos, String Distance algorithms can help. If your documents contain a lot of abbreviations, it’s useful to have a sense inventory and add abbreviations to the Abbreviations class. If your documents and keywords contain inflected forms (singular, plurial, conjugated form), it is useful to add a normalization method (lemmatization, stemming) with the WordNormalizer class. If your keywords contain regular expressions, the FuzzyRegex class takes care of that.
Remember that for each token in the document, all fuzzy algorithms added to the Matcher will be called, so the more algorithms you add, the slower iamsystem. However, algorithms that are context independant can be cached to avoid calling them multiple times.
Abbreviations
The Abbreviations class allows you to provide a sense inventory of abbreviations to the matcher.
1from iamsystem import Entity
2from iamsystem import Matcher
3
4ent1 = Entity(label="acute respiratory distress", kb_id="J80")
5ent2 = Entity(label="patient", kb_id="D007290")
6ent3 = Entity(label="patient hospitalized", kb_id="D007297")
7ent4 = Entity(label="physiotherapy", kb_id="D007297")
8matcher = Matcher.build(
9 keywords=[ent1, ent2, ent3, ent4],
10 abbreviations=[
11 ("Pt", "patient"),
12 ("PT", "physiotherapy"),
13 ("ARD", "Acute Respiratory Distress"),
14 ],
15)
16annots = matcher.annot_text(
17 text="Pt hospitalized with ARD. Treament: PT"
18)
19for annot in annots:
20 print(annot.to_string(debug=True))
21# Pt hospitalized 0 15 patient hospitalized (D007297) pt(abbs);hospitalized(exact) # noqa
22# ARD 21 24 acute respiratory distress (J80) ard(abbs)
23# PT 36 38 patient (D007290) pt(abbs)
24# PT 36 38 physiotherapy (D007297) pt(abbs)
Note the following:
The first word “Pt” is associated with a single annotation.
Since “hospitalized” comes after the abbreviation and since the matcher removes nested annotation by default (See Full overlapping), the ambiguity is removed.
The last word “PT” has two annotations
The Abbreviations is context independent and cannot resolve the ambiguity here. To solve this problem, the annotations need to be post-processed (rules, language models…) to identify the most likely long form.
In the case where two abbreviations have different string cases (Pt stands only for patient and PT for physiotherapy), the Abbreviations class can be configured to be case sensitive. The Abbreviations class can be configured with a method that checks if the document’s token is an abbreviation or not:
1from iamsystem import Abbreviations
2from iamsystem import Entity
3from iamsystem import Matcher
4from iamsystem import TokenT
5from iamsystem import english_tokenizer
6
7def upper_case_only(token: TokenT) -> bool:
8 """Return True if all token's characters are uppercase."""
9 return token.label.isupper()
10
11def first_letter_capitalized(token: TokenT) -> bool:
12 """Return True if the first letter is uppercase."""
13 return token.label[0].isupper() and not token.label.isupper()
14
15tokenizer = english_tokenizer()
16ent1 = Entity(label="acute respiratory distress", kb_id="J80")
17ent2 = Entity(label="patient", kb_id="D007290")
18ent3 = Entity(label="patient hospitalized", kb_id="D007297")
19ent4 = Entity(label="physiotherapy", kb_id="D007297")
20matcher = Matcher.build(
21 keywords=[ent1, ent2, ent3, ent4], tokenizer=tokenizer
22)
23
24abbs_upper = Abbreviations(
25 name="upper case abbs", token_is_an_abbreviation=upper_case_only
26)
27abbs_upper.add(
28 short_form="PT", long_form="physiotherapy", tokenizer=tokenizer
29)
30abbs_upper.add(
31 short_form="ARD",
32 long_form="Acute Respiratory Distress",
33 tokenizer=tokenizer,
34)
35abbs_capitalized = Abbreviations(
36 name="capitalized abbs",
37 token_is_an_abbreviation=first_letter_capitalized,
38)
39abbs_capitalized.add(
40 short_form="Pt", long_form="patient", tokenizer=tokenizer
41)
42matcher.add_fuzzy_algo(fuzzy_algo=abbs_upper)
43matcher.add_fuzzy_algo(fuzzy_algo=abbs_capitalized)
44annots = matcher.annot_text(
45 text="Pt hospitalized with ARD. Treament: PT"
46)
47for annot in annots:
48 print(annot.to_string(debug=True))
49# Pt hospitalized 0 15 patient hospitalized (D007297) pt(capitalized abbs);hospitalized(exact) # noqa
50# ARD 21 24 acute respiratory distress (J80) ard(upper case abbs)
51# PT 36 38 physiotherapy (D007297) pt(upper case abbs)
Notice that TokenT is a generic token type, so if you use a custom tokenizer (i.e. from an external library like spaCy) you can access custom attributes.
String Distance
This package utilizes the spellwise and pysimstring libraries to access string distance algorithms.
Spellwise
In the example below, iamsystem is configured with two spellwise algorithms: Levenshtein distance which measures the number of edits needed to transform one word into another, and Soundex which is a phonetic algorithm.
1from iamsystem import Entity
2from iamsystem import ESpellWiseAlgo
3from iamsystem import Matcher
4
5ent1 = Entity(label="acute respiratory distress", kb_id="J80")
6matcher = Matcher.build(
7 keywords=[ent1],
8 spellwise=[
9 dict(
10 measure=ESpellWiseAlgo.LEVENSHTEIN,
11 max_distance=1,
12 min_nb_char=5,
13 ),
14 dict(measure=ESpellWiseAlgo.SOUNDEX, max_distance=1),
15 ],
16)
17annots = matcher.annot_text(text="acute resiratory distresssss")
18for annot in annots:
19 print(annot.to_string(debug=True))
20# acute resiratory distresssss 0 28 acute respiratory distress (J80) acute(exact,LEVENSHTEIN,SOUNDEX);resiratory(LEVENSHTEIN);distresssss(SOUNDEX) # noqa
The spellwise parameter of the build function expects an iterable of dictionary. The key-value pairs of a dictionary are passed to the SpellWiseWrapper init function. Since a string distance algorithm is context independent, the Matcher build function placed them in a CacheFuzzyAlgos to avoid calling them multiple times. For a list of available Spellwise algorithms, see ESpellWiseAlgo.
String distance algorithms are often used to detect typos in a document. False positives are common since two words could have a short string distance. To avoid calling a string distance algorithm on common words of a language, you can set string_distance_ignored_w parameter:
1from iamsystem import ESpellWiseAlgo
2from iamsystem import Matcher
3
4matcher = Matcher.build(
5 keywords=["poids"],
6 spellwise=[
7 dict(
8 measure=ESpellWiseAlgo.LEVENSHTEIN,
9 max_distance=1,
10 min_nb_char=4,
11 )
12 ],
13)
14annots = matcher.annot_text(text="Absence de poils.")
15for annot in annots:
16 print(annot)
17# poils 11 16 poids
18matcher = Matcher.build(
19 keywords=["poids"],
20 spellwise=[
21 dict(
22 measure=ESpellWiseAlgo.LEVENSHTEIN,
23 max_distance=1,
24 min_nb_char=4,
25 )
26 ],
27 string_distance_ignored_w=["poils"],
28)
29annots_2 = matcher.annot_text(text="Absence de poils.")
30for annot in annots_2:
31 print(annot) # 0
Since poils is one substitution from poids, the algorithm returns a false positive. By adding poils to string_distance_ignored_w, the string distance algorithm is not called.
I recommend to pass all common words of a language to string_distance_ignored_w parameter, it will make iamsystem faster since all string distance algorithms will be called only for unknown words and this will reduce false positives.
SimString
The pysimstring library provides an API to the fast simstring algorithm implemented in C++. The simstring parameter of the Matcher build function expects an iterable of dictionary. The key-value pairs of a dictionary are passed to the SimStringWrapper init function. Since a string distance algorithm is context independent, the build function placed them in a CacheFuzzyAlgos to avoid calling them multiple times.
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem.fuzzy.simstring import ESimStringMeasure
4
5ent1 = Entity(label="acute respiratory distress", kb_id="J80")
6matcher = Matcher.build(
7 keywords=[ent1],
8 simstring=[dict(measure=ESimStringMeasure.COSINE, threshold=0.7)],
9)
10annots = matcher.annot_text(text="acute respiratori disstress")
11for annot in annots:
12 print(annot)
13# acute respiratori disstress 0 27 acute respiratory distress (J80)
Using the cosine similarity and a threshold of 0.7, the tokens respiratori matched to respiratory and disstress matched to distress.
CacheFuzzyAlgos
Fuzzy algorithms that are not context depend can be cached to avoid calling them multiple times. The CacheFuzzyAlgos stores fuzzy algorithms, calls them once and then stores their results.
1from iamsystem import Abbreviations
2from iamsystem import CacheFuzzyAlgos
3from iamsystem import Entity
4from iamsystem import ESpellWiseAlgo
5from iamsystem import Matcher
6from iamsystem import SpellWiseWrapper
7
8ent1 = Entity(label="acute respiratory distress", kb_id="J80")
9matcher = Matcher.build(keywords=[ent1])
10abbs = Abbreviations(name="abbs")
11abbs.add(short_form="a", long_form="acute", tokenizer=matcher)
12test = dict(
13 measure=ESpellWiseAlgo.LEVENSHTEIN, max_distance=1, min_nb_char=5
14)
15levenshtein = SpellWiseWrapper(**test)
16soundex = SpellWiseWrapper(ESpellWiseAlgo.SOUNDEX, max_distance=1)
17cache = CacheFuzzyAlgos()
18for algo in [levenshtein, soundex]:
19 algo.add_words(words=matcher.get_keywords_unigrams())
20 cache.add_algo(algo=algo)
21# cache.add_algo(algo=abbs) ## no need to be this one in cache
22matcher.add_fuzzy_algo(fuzzy_algo=cache)
23matcher.add_fuzzy_algo(fuzzy_algo=abbs)
24annots = matcher.annot_text(text="a resiratory distresssss")
25for annot in annots:
26 print(annot.to_string(debug=True))
27# a resiratory distresssss 0 24 acute respiratory distress (J80) a(abbs);resiratory(LEVENSHTEIN);distresssss(SOUNDEX) # noqa
Note that although we could have put the Abbreviations instance in the cache, it’s not necessary to do so since this algorithm is as fast as the cache. If you use the Matcher build function, string distance algorithms are automatically cached.
FuzzyRegex
Regular expressions are very useful and can be used with iamsystem. For example, if you want to detect blood test results in electronic health records, such as calcium levels in blood, you can have a regular expression in your keyword: “calcium (^d*[.,]?d*$) mmol/L”. The fuzzy_regex parameter expects an iterable of dictionary. Key-value pairs of the dictionary correspond to FuzzyRegex init function parameters.
The regular expression (^d*[.,]?d*$) is placed in the FuzzyRegex instance, with a patter name (ex: numval), and the pattern name is placed in the keyword (“calcium numval mmol/L”).
1from iamsystem import Matcher
2from iamsystem import english_tokenizer
3from iamsystem import split_find_iter_closure
4
5tokenizer = english_tokenizer()
6tokenizer.split = split_find_iter_closure(pattern=r"(\w|\.|,)+")
7matcher = Matcher.build(
8 keywords=["calcium numval mmol/L"],
9 tokenizer=tokenizer,
10 stopwords=["level", "is", "normal"],
11 fuzzy_regex=[
12 dict(
13 name="regex_num",
14 pattern=r"^\d*[.,]?\d*$",
15 pattern_name="numval",
16 )
17 ],
18)
19annots = matcher.annot_text(
20 text="the blood calcium level is normal: 2.1 mmol/L"
21)
22for annot in annots:
23 print(annot)
24# calcium 2.1 mmol L 10 17;35 45 calcium numval mmol/L
Note that the Default split function must be modified to detect decimal values. Also note that the label of the keyword “calcium numval mmol/L” (line 7) contains the same pattern name numval. When the fuzzy algorithm receives the token value 2.1, it finds that it matches its regular expression and returns the pattern name numval.
In the example above, stopwords have been added, otherwise the algorithm wouldn’t have found the keyword with a context window of 1. It’s often the case that intermediate words are not known in avance, so this method wouldn’t work. Another way to do exactly the same annotation is to use the NegativeStopwords class which ignores all unigrams that are not in the keywords:
1from iamsystem import Matcher
2from iamsystem import english_tokenizer
3from iamsystem import split_find_iter_closure
4
5tokenizer = english_tokenizer()
6tokenizer.split = split_find_iter_closure(pattern=r"(\w|\.|,)+")
7matcher = Matcher.build(
8 keywords=["calcium numval mmol/L"],
9 tokenizer=tokenizer,
10 negative=True,
11 fuzzy_regex=[
12 dict(
13 name="regex_num",
14 pattern=r"^\d*[.,]?\d*$",
15 pattern_name="numval",
16 )
17 ],
18)
19annots = matcher.annot_text(
20 text="the blood calcium level is normal: 2.1 mmol/L"
21)
22for annot in annots:
23 print(annot)
24# calcium 2.1 mmol L 10 17;35 45 calcium numval mmol/L
WordNormalizer
Word normalization is a common pre-processing step in NLP. The idea is to group words that have the same normalized form; for example “eating”, “eats”… have the same canonical form “eat”.
The WordNormalizer offers the possibility to add a normalization function. A token in a document will match a token in a keyword if they have the same normalized form.
In the example below, nltk is used to access a French stemmer. The stemming function is given to the WordNormalizer class:
1from nltk.stem.snowball import FrenchStemmer
2
3from iamsystem import Entity
4from iamsystem import Matcher
5from iamsystem import french_tokenizer
6
7ent1 = Entity(label="cancer de la prostate", kb_id="C61")
8stemmer = FrenchStemmer()
9matcher = Matcher.build(
10 keywords=[ent1],
11 tokenizer=french_tokenizer(),
12 stopwords=["de", "la"],
13 normalizers=[dict(name="french_stemmer", norm_fun=stemmer.stem)],
14)
15annots = matcher.annot_text(text="cancer prostatique")
16for annot in annots:
17 print(annot)
18# cancer prostatique 0 18 cancer de la prostate (C72)
Abstract Base classes
You might be interested in the fuzzy algorithms abstract base classes if you want to create a new custom fuzzy algorithm. The hierarchy is the following:
Implements this class to create a context dependent algorithm. For each token for which a synonym is expected, the context words and the algorithm’s states are available.
Implements this class to create a context-free algorithm that depends only on the current token. The class has access to the generic token for which a synonym is expected. Examples of such algorithms: FuzzyRegex, Abbreviations.
Implements this class to create a context-free algorithm that depends only on the normalized form of the token. The class has access to the normalized label of the token for which a synonym is expected. These algorithms can be cached with CacheFuzzyAlgos. Examples of such algorithms: String Distance, WordNormalizer.
Brat
Brat is an open source text annotation tool. This package provides a Brat adapter to generate Brat annotation files (.ann extension) in order to visualise iamsystem’s annotations in the Brat web interface.
Brat Formatter
Given a sequence of tokens, there are several ways of creating a Brat annotation. The default Brat formatter groups continuous sequence of tokens:
1from iamsystem import Matcher
2
3matcher = Matcher.build(keywords=["North America"])
4annots = matcher.annot_text(text="North America")
5for annot in annots:
6 print(annot)
7# North America 0 13 North America
Indeed, “North America” has two tokens, “North” and “America” but a continuous annotation (0 13) is created.
In order to have one Brat span for each token, you can use the IndividualTokenFormatter:
1from iamsystem import IndividualTokenFormatter
2from iamsystem import Matcher
3
4matcher = Matcher.build(keywords=["North America"])
5annots = matcher.annot_text(text="North America")
6formatter = IndividualTokenFormatter()
7for annot in annots:
8 annot.brat_formatter = formatter
9 print(annot)
10# North America 0 5;6 13 North America
If you have stopwords in your matching sequences, you can include them in the Brat annotation using TokenStopFormatter. Stopwords are included if and only if they form a continuous sequence of tokens. Check the differences:
1from iamsystem import Entity
2from iamsystem import Matcher
3from iamsystem import TokenStopFormatter
4
5matcher = Matcher.build(
6 keywords=[Entity(label="cancer de prostate", kb_id="C61")],
7 stopwords=["de", "la"],
8)
9annots = matcher.annot_text(text="cancer de la prostate")
10formatter = TokenStopFormatter()
11for annot in annots:
12 print(f"Default formatter: {annot}")
13 annot.brat_formatter = formatter
14 print(f"TokenStop formatter: {annot}")
15# Default formatter: cancer prostate 0 6;13 21 cancer de prostate (C61) # noqa
16# TokenStop formatter: cancer de la prostate 0 21 cancer de prostate (C61) # noqa
If your match is a discontinuous sequence of tokens and you want a continuous Brat annotation from the start offsets of the first token and end offsets of the last token, you can use the SpanFormatter. Check the differences:
1from iamsystem import Matcher
2from iamsystem import SpanFormatter
3
4matcher = Matcher.build(
5 keywords=["North America"], stopwords=["and"], w=2
6)
7text = "North and South America"
8annots = matcher.annot_text(text=text)
9formatter = SpanFormatter(text=text)
10for annot in annots:
11 print(f"Default formatter: {annot}")
12 annot.brat_formatter = formatter
13 print(f"Span formatter: {annot}")
14# Default formatter: North America 0 5;16 23 North America
15# Span formatter: North and South America 0 23 North America
Brat Document
The class BratDocument can store Brat entities and Brat notes. Each entity corresponds to an annotation:
An ID
A Brat type declared in Brat’s configuration file (annotation.conf)
start-end offsets
text substring
1from iamsystem import BratDocument
2from iamsystem import Entity
3from iamsystem import Matcher
4
5ent1 = Entity(label="North America", kb_id="NA")
6matcher = Matcher.build(keywords=[ent1], w=3)
7annots = matcher.annot_text(text="North and South America")
8brat_document = BratDocument()
9brat_document.add_annots(
10 annots, brat_type="CONTINENT", keyword_attr=None
11)
12print(str(brat_document))
13# T1 CONTINENT 0 5;16 23 North America
14# #1 IAMSYSTEM T1 North America (NA)
The first line is the brat entity, the second is the brat note. T1 is the ID of the brat entity. Each note is linked to a brat entity by its ID, here T1. In the brat note, ‘North America (NA)’ is the comment related to this entity. By default, this comment is generated by calling the __str__ method of the Keyword. Here the __str__ method of the Entity class concatenated the label ‘North America’ and the code ‘(NA)’. You can modify this last value by overriding the get_note function of the BratDocument class.
Also note that in the above example, the Brat type “CONTINENT” is passed as a parameter and applies to all annotations. If you have multiple Brat types, a better way to do this is to store the Brat type in a Keyword subclass attribute and to pass the attribute name to the add_annots function:
1from iamsystem import Entity
2
3class Entity(Entity):
4 def __init__(self, label: str, code: str, brat_type: str):
5 super().__init__(label, code)
6 self.brat_type = brat_type
7
8from iamsystem import BratDocument
9from iamsystem import Matcher
10
11ent1 = Entity(label="North America", code="NA", brat_type="CONTINENT")
12matcher = Matcher.build(keywords=[ent1], w=3)
13annots = matcher.annot_text(text="North and South America")
14brat_document = BratDocument()
15brat_document.add_annots(annots=annots, keyword_attr="brat_type")
16print(str(brat_document))
17# T1 CONTINENT 0 5;16 23 North America
18# #1 IAMSYSTEM T1 North America (NA)
Brat Writer
This package provides an utility class to write a BratDocument.
1import os
2import tempfile
3
4from iamsystem import BratDocument
5from iamsystem import BratWriter
6from iamsystem import Entity
7from iamsystem import Matcher
8
9ent1 = Entity(label="North America", kb_id="NA")
10matcher = Matcher.build(keywords=[ent1], w=3)
11annots = matcher.annot_text(text="North and South America")
12doc = BratDocument()
13doc.add_annots(annots=annots, brat_type="CONTINENT")
14temp_path = tempfile.mkdtemp()
15os.makedirs(temp_path, exist_ok=True)
16filename = os.path.join(temp_path, "docs.ann")
17with (open(filename, "w")) as f:
18 BratWriter.saveEntities(
19 brat_entities=doc.get_entities(), write=f.write
20 )
21 BratWriter.saveNotes(brat_notes=doc.get_notes(), write=f.write)
spaCy
With a list of words
This package provides a spaCy component to add iamsystem algorithm in a spaCy pipeline.
1from spacy.lang.fr import French
2
3from iamsystem.spacy.component import IAMsystemBuildSpacy # noqa
4
5nlp = French()
6nlp.add_pipe(
7 "iamsystem_matcher",
8 name="iamsystem",
9 last=True,
10 config={
11 "build_params": {
12 "keywords": [
13 "North America",
14 "South America",
15 ],
16 "abbreviations": [("amer", "America")],
17 "stopwords": ["and"],
18 "w": 2,
19 "remove_nested_annots": True,
20 "spellwise": [dict(max_distance=1, measure="Levenshtein")],
21 },
22 },
23)
24doc = nlp("Northh and South Amer.")
25self.assertEqual(2, len(doc.spans["iamsystem"]))
26spans = doc.spans["iamsystem"]
27for span in spans:
28 print(span._.iamsystem)
29# Northh Amer 0 6;17 21 North America
30# South Amer 11 21 South America
The build_params expects serializable Matcher build parameters. See IAMsystemBuildSpacy to configure this component.
With a list of keywords
Since Keyword implementation is not JSON serializable, you will have an error passing keywords instance to the keywords parameter. You have three options:
Create your own IKeyword implementation that is JSON serializable.
Pass a registered function:
@spacy.registry.misc("umls_ents.v1")
def get_termino_umls() -> Iterable[IKeyword]:
"""An imaginary set of umls ents."""
termino = Terminology()
ent1 = Entity("Insuffisance Cardiaque", "I50.9")
ent2 = Entity("Insuffisance Cardiaque Gauche", "I50.1")
termino.add_keywords(keywords=[ent1, ent2])
return termino
"build_params": {
"keywords": {"@misc": "umls_ents.v1"},
}
Note that if you call nlp.to_disk your keywords will not be serialized.
Pass the Keyword as a dictionary with asdict() function and pass the module and classname of the Keyword dataclass:
config={
"serialized_kw": {
"module": "iamsystem",
"class_name": "Keyword",
"kws": [Keyword(label="insuffisance cardiaque").asdict()],
},
"build_params": {"w": 1},
},
Note that if you call nlp.to_disk your keywords will be serialized.