Tokenizer

The iamsystem matcher is highly dependent on how documents and keywords are tokenized and normalized. The ITokenizer is responsible for turning text into tokens. To do so, the TokenizerImp class performs alphanumeric tokenization with two inner functions:

  • split the text into (start,end) offsets

  • normalize each token

The english_tokenizer and french_tokenizer are concrete implementations.

Other libraries offer more elaborate tokenizers, I recommend you use them. To use the tokenizer of another library you can build an adapter by creating a new implementation of the a ITokenizer interface. For example, this package provides a spaCy custom component that consumes spaCy’s tokenizer.

Default split function

By default, the Matcher class calls the french_tokenizer that splits a document by word character (a letter or digit or underbar [a-zA-Z0-9_]).

I recommend that you check the generated tokens to verify it matches your needs. For example:

from iamsystem import english_tokenizer

tokenizer = english_tokenizer()
tokens = tokenizer.tokenize("SARS-CoV+")
for token in tokens:
    print(token)
# Token(label='SARS', norm_label='sars', start=0, end=4, i=0)
# Token(label='CoV', norm_label='cov', start=5, end=8, i=1)

The ‘+’ sign is ignored even though it is important. The split function can be modified as follow :

 1from iamsystem import english_tokenizer
 2from iamsystem import split_find_iter_closure
 3
 4tokenizer = english_tokenizer()
 5tokenizer.split = split_find_iter_closure(pattern=r"(\w+|\+)")
 6tokens = tokenizer.tokenize("SARS-CoV+")
 7for token in tokens:
 8    print(token)
 9# Token(label='SARS', norm_label='sars', start=0, end=4, i=0)
10# Token(label='CoV', norm_label='cov', start=5, end=8, i=1)
11# Token(label='+', norm_label='+', start=8, end=9, i=2)

Change default Tokenizer

To change Matcher’s default tokenizer, pass it to the constructor.

 1from iamsystem import Entity
 2from iamsystem import Matcher
 3from iamsystem import english_tokenizer
 4from iamsystem import split_find_iter_closure
 5
 6ent1 = Entity(label="SARS-CoV+", kb_id="95209-3")
 7text = "Pt c/o acute respiratory distress syndrome. RT-PCR sars-cov+"
 8tokenizer = english_tokenizer()
 9tokenizer.split = split_find_iter_closure(pattern=r"(\w+|\+)")
10matcher = Matcher.build(keywords=[ent1], tokenizer=tokenizer)
11annots = matcher.annot_text(text=text)
12for annot in annots:
13    print(annot)
14# sars cov +	51 60	SARS-CoV+ (95209-3)

Default normalize function

You can override the normalize function of a tokenizer to suit your needs. The english_tokenizer normalizes each token by doing lowercasing. The french_tokenizer performs lowercasing and remove accents. The only difference between the french_tokenizer and the english_tokenizer is the removal of diacritics done with the unidecode library that tries to transform the label in ASCII characters. Using the french_tokenizer for english documents adds very little overhead.

Change tokens order

Word order is important for iamsystem. In the example below, the keyword “blood calcium level “ is mentioned but the tokens are discontinuous and not in the right order. One solution is to order the tokens alphabetically. By doing this, the tokens of the document and the keyword are in the same order. Given a wide window, the keyword can be found.

 1from iamsystem import Matcher
 2from iamsystem import english_tokenizer
 3
 4text = "the level of calcium can measured in the blood."
 5tokenizer = english_tokenizer()
 6matcher = Matcher.build(
 7    keywords=["blood calcium level"],
 8    tokenizer=tokenizer,
 9    order_tokens=True,
10    w=5,
11)
12annots = matcher.annot_text(text=text)
13for annot in annots:
14    print(annot)
15# level calcium blood	4 9;13 20;41 46	blood calcium level

order_tokens parameter changes iamsystem’s matching strategy but it doesn’t change the document’s tokens order. This approach is not suitable if the document is very long or the number of keywords is large.