Tokenizer

The iamsystem matcher is highly dependent on how documents and keywords are tokenized and normalized. The ITokenizer is responsible for turning text into tokens. To do so, the TokenizerImp class performs alphanumeric tokenization with two inner functions:

split the text into (start,end) offsets
normalize each token

The english_tokenizer and french_tokenizer are concrete implementations.

Other libraries offer more elaborate tokenizers, I recommend you use them. To use the tokenizer of another library you can build an adapter by creating a new implementation of the a ITokenizer interface. For example, this package provides a spaCy custom component that consumes spaCy’s tokenizer.

Default split function

By default, the Matcher class calls the french_tokenizer that splits a document by word character (a letter or digit or underbar [a-zA-Z0-9_]).

I recommend that you check the generated tokens to verify it matches your needs. For example:

from iamsystem import english_tokenizer

tokenizer = english_tokenizer()
tokens = tokenizer.tokenize("SARS-CoV+")
for token in tokens:
    print(token)
# Token(label='SARS', norm_label='sars', start=0, end=4, i=0)
# Token(label='CoV', norm_label='cov', start=5, end=8, i=1)

The ‘+’ sign is ignored even though it is important. The split function can be modified as follow :

from iamsystem import english_tokenizer
from iamsystem import split_find_iter_closure

tokenizer = english_tokenizer()
tokenizer.split = split_find_iter_closure(pattern=r"(\w+|\+)")
tokens = tokenizer.tokenize("SARS-CoV+")
for token in tokens:
    print(token)
# Token(label='SARS', norm_label='sars', start=0, end=4, i=0)
# Token(label='CoV', norm_label='cov', start=5, end=8, i=1)
# Token(label='+', norm_label='+', start=8, end=9, i=2)

Change default Tokenizer

To change Matcher’s default tokenizer, pass it to the constructor.

from iamsystem import Entity
from iamsystem import Matcher
from iamsystem import english_tokenizer
from iamsystem import split_find_iter_closure

ent1 = Entity(label="SARS-CoV+", kb_id="95209-3")
text = "Pt c/o acute respiratory distress syndrome. RT-PCR sars-cov+"
tokenizer = english_tokenizer()
tokenizer.split = split_find_iter_closure(pattern=r"(\w+|\+)")
matcher = Matcher.build(keywords=[ent1], tokenizer=tokenizer)
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# sars cov +	51 60	SARS-CoV+ (95209-3)

Default normalize function

You can override the normalize function of a tokenizer to suit your needs. The english_tokenizer normalizes each token by doing lowercasing. The french_tokenizer performs lowercasing and remove accents. The only difference between the french_tokenizer and the english_tokenizer is the removal of diacritics done with the unidecode library that tries to transform the label in ASCII characters. Using the french_tokenizer for english documents adds very little overhead.

Change tokens order

Word order is important for iamsystem. In the example below, the keyword “blood calcium level “ is mentioned but the tokens are discontinuous and not in the right order. One solution is to order the tokens alphabetically. By doing this, the tokens of the document and the keyword are in the same order. Given a wide window, the keyword can be found.

from iamsystem import Matcher
from iamsystem import english_tokenizer

text = "the level of calcium can measured in the blood."
tokenizer = english_tokenizer()
matcher = Matcher.build(
    keywords=["blood calcium level"],
    tokenizer=tokenizer,
    order_tokens=True,
    w=5,
)
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# level calcium blood	4 9;13 20;41 46	blood calcium level

order_tokens parameter changes iamsystem’s matching strategy but it doesn’t change the document’s tokens order. This approach is not suitable if the document is very long or the number of keywords is large.