API Documentation

Documentation of classes and methods.

Matcher

class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]

Bases: IMatcher[TokenT]

Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.

__init__(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]

Create an IAMsystem matcher to annotate documents. Prefer build() method to create a matcher.

Parameters

tokenizer – default french_tokenizer(). A ITokenizer instance responsible for tokenizing and normalizing.
stopwords – a IStopwords to ignore empty words in keywords and documents. If None, default to Stopwords.

add_fuzzy_algo(fuzzy_algo: FuzzyAlgo[TokenT]) → None[source]

Add a fuzzy algorithms to provide synonym(s) that helps matching: a token of a document and a token of a keyword.

Parameters: fuzzy_algo – a FuzzyAlgo instance.
Returns: None.

add_keyword(keyword: IKeyword) → None[source]

Add a keyword to find in a document.

Parameters: keyword – IKeyword to search in a document.
Returns: None.

add_keywords(keywords: Iterable[Union[str, IKeyword]]) → None[source]

Utility function to add multiple keywords.

Parameters: keywords – an iterable of string (labels) or IKeyword to search in a document.
Returns: None.

add_stopwords(words: Iterable[str]) → None[source]

Add words (tokens) to be ignored in IKeyword and in documents.

Parameters: words – a list of words to ignore.
Returns: None.

annot_text(text: str) → List[IAnnotation[TokenT]][source]

Annotate a document.

Parameters: text – the document to annotate.
Returns: a list of Annotation.

annot_tokens(tokens: Sequence[TokenT]) → List[IAnnotation[TokenT]][source]

Annotate a sequence of tokens.

Parameters: tokens – an ordered or unordered sequence of tokens.
Returns: a list of Annotation.

classmethod build(keywords: ~typing.Iterable[~typing.Union[str, ~iamsystem.keywords.api.IKeyword]], tokenizer: ~iamsystem.tokenization.api.ITokenizer = None, stopwords: ~typing.Union[~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT], ~typing.Iterable[str]] = <iamsystem.stopwords.simple.NoStopwords object>, w=1, order_tokens=False, negative=False, remove_nested_annots=True, strategy: ~typing.Union[str, ~iamsystem.matcher.strategy.EMatchingStrategy] = EMatchingStrategy.WINDOW, string_distance_ignored_w: ~typing.Optional[~typing.Iterable[str]] = None, abbreviations: ~typing.Optional[~typing.Iterable[~typing.Tuple[str, str]]] = None, spellwise: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, simstring: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, normalizers: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, fuzzy_regex: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None) → Matcher[TokenT][source]

Create an IAMsystem matcher to annotate documents.

Parameters

keywords – an iterable of keywords string or IKeyword instances.
tokenizer – default french_tokenizer(). A ITokenizer instance responsible for tokenizing and normalizing.
stopwords – provide a IStopwords. If None, default to NoStopwords.
w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.
order_tokens – order tokens alphabetically if order doesn’t matter in the matching strategy.
negative – every unigram not in the keywords is a stopword. Default to False. If stopwords are also passed, they will be removed from keywords’ tokens and so still be stopwords.
remove_nested_annots – if two annotations overlap, remove the shorter one. Default to True.
strategy – an IAMsystem matching strategy responsible for searching keywords in document. Default to WindowMatching.
string_distance_ignored_w – words ignored by string distance algorithms to avoid false positives matched.
abbreviations – an iterable of tuples (short_form, long_form).
spellwise – an iterable of SpellWiseWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SpellWiseWrapper init function.
simstring – an iterable of SimStringWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SimStringWrapper init function.
normalizers – an iterable of WordNormalizer init parameters.
fuzzy_regex – an iterable of FuzzyRegex init parameters.

property fuzzy_algos: Iterable[FuzzyAlgo[TokenT]]

The fuzzy algorithms used by the algorithm.

Returns: FuzzyAlgo instances responsible for finding possible synonyms for each token of a document.

get_initial_state() → INode[source]: Return the initial state from which iamsystem algorithm will start searching for a sequence of keywords’tokens.

get_keywords_unigrams() → Set[str][source]: Get all the unigrams (single words excluding stopwords) in the keywords.

get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) → List[Tuple[Tuple[str, ...], List[str]]][source]

Get synonyms of a token with configured fuzzy algorithms.

Parameters

tokens – document’s tokens.
token – the token for which synonyms are expected.
transitions – algorithm’s states.

Returns

tuples of synonyms and fuzzy algorithm’s names.

is_stopword(word: str) → bool[source]: Return True if word is a stopword.

is_token_a_stopword(token: TokenT) → bool[source]

Check if a token is a stopword.

Parameters: token – a generic token that implements IToken.
Returns: True if the token is a stopword.

property keywords: Collection[IKeyword]: Return the keywords added.

property remove_nested_annots: bool: Whether to remove nested annotations. Default to True.

property stopwords: IStopwords[TokenT]: Return the IStopwords used by the matcher.

property strategy: IMatchingStrategy[TokenT]: Return the matching strategy.

tokenize(text: str) → Sequence[TokenT][source]

Tokenize a text with the tokenizer’s instance.

Parameters: text – a document or a keyword.
Returns: A sequence of tokens, the type depends on the tokenizer but must implement IToken protocol.

property tokenizer: ITokenizer[TokenT]: Return the ITokenizer used by the matcher.

property w: int: Return the window parameter of this matcher.

Matcher build

class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]

Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.

classmethod build(keywords: ~typing.Iterable[~typing.Union[str, ~iamsystem.keywords.api.IKeyword]], tokenizer: ~iamsystem.tokenization.api.ITokenizer = None, stopwords: ~typing.Union[~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT], ~typing.Iterable[str]] = <iamsystem.stopwords.simple.NoStopwords object>, w=1, order_tokens=False, negative=False, remove_nested_annots=True, strategy: ~typing.Union[str, ~iamsystem.matcher.strategy.EMatchingStrategy] = EMatchingStrategy.WINDOW, string_distance_ignored_w: ~typing.Optional[~typing.Iterable[str]] = None, abbreviations: ~typing.Optional[~typing.Iterable[~typing.Tuple[str, str]]] = None, spellwise: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, simstring: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, normalizers: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, fuzzy_regex: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None) → Matcher[TokenT][source]

Create an IAMsystem matcher to annotate documents.

Parameters

keywords – an iterable of keywords string or IKeyword instances.
tokenizer – default french_tokenizer(). A ITokenizer instance responsible for tokenizing and normalizing.
stopwords – provide a IStopwords. If None, default to NoStopwords.
w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.
order_tokens – order tokens alphabetically if order doesn’t matter in the matching strategy.
negative – every unigram not in the keywords is a stopword. Default to False. If stopwords are also passed, they will be removed from keywords’ tokens and so still be stopwords.
remove_nested_annots – if two annotations overlap, remove the shorter one. Default to True.
strategy – an IAMsystem matching strategy responsible for searching keywords in document. Default to WindowMatching.
string_distance_ignored_w – words ignored by string distance algorithms to avoid false positives matched.
abbreviations – an iterable of tuples (short_form, long_form).
spellwise – an iterable of SpellWiseWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SpellWiseWrapper init function.
simstring – an iterable of SimStringWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SimStringWrapper init function.
normalizers – an iterable of WordNormalizer init parameters.
fuzzy_regex – an iterable of FuzzyRegex init parameters.

EMatchingStrategy

class iamsystem.EMatchingStrategy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Enumeration of matching strategies.

LARGE_WINDOW = <iamsystem.matcher.strategy.LargeWindowMatching object>: Same annotations as Window but faster than window is large.

NO_OVERLAP = <iamsystem.matcher.strategy.NoOverlapMatching object>: No overlap/nested annotations, fastest strategies.

WINDOW = <iamsystem.matcher.strategy.WindowMatching object>: Default matching strategy.

Span

class iamsystem.matcher.annotation.Span(tokens: List[TokenT])[source]

Bases: ISpan[TokenT], IOffsets

A class that represents a sequence of tokens in a document.

end: int: end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

property end_i: The index of the last token within the parent document.

get_text_substring(text: str) → str[source]: Return text substring.

start: int: The start offset of the first token.

property start_i: The index of the first token within the parent document.

property tokens: List[TokenT]

The tokens of the document that matched the keywords attribute of this instance.

Returns: an ordered sequence of TokenT, a generic type that implements IToken.

property tokens_label: The concatenation of each token’s label.

property tokens_norm_label: The concatenation of each token’s norm_label.

Annotation

class iamsystem.Annotation(tokens: List[TokenT], algos: List[List[str]], node: INode, stop_tokens: List[TokenT], text: Optional[str] = None)[source]

Bases: Span[TokenT], IAnnotation[TokenT]

Ouput class of Matcher storing information on the detected entities.

property algos: List[List[str]]: For each token, the list of algorithms that matched. One to several algorithms per token.

annot_to_str(annot: IAnnotation): A class function that generates a string representation of an annotation.

end: int: end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

property end_i: The index of the last token within the parent document.

get_text_substring(text: str) → str: Return text substring.

get_tokens_algos() → Iterable[Tuple[TokenT, List[str]]][source]

Get each token and the list of fuzzy algorithms that matched it.

Returns: an iterable of tuples (token0, [‘algo1’,…]) where token0 is a token and [‘algo1’,…] a list of fuzzy algorithms.

property keywords: Sequence[IKeyword]: The linked entities, IKeyword instances that matched a document’s tokens.

property label: @Deprecated. An annotation label. Return ‘tokens_label’ attribute

classmethod set_brat_formatter(brat_formatter: Union[EBratFormatters, IBratFormatter])[source]

Change Brat Formatter to change text-span and offsets.

Parameters: brat_formatter – A Brat formatter to produce a different Brat annotation. If None, default to ContSeqFormatter.
Returns: None

start: int: The start offset of the first token.

property start_i: The index of the first token within the parent document.

property stop_tokens: List[TokenT]: The list of stopwords tokens inside the annotation detected by the Matcher stopwords instance.

property text: Optional[str]: Return the annotated text.

to_dict(text: str = None) → Dict[str, Any][source]

Return a dictionary representation of this object.

Parameters: text – the document from which this annotation comes from. Default to None.
Returns: A dictionary of relevant attributes.

to_string(text=False, debug=False) → str[source]

Get a default string representation of this object.

Parameters

text – the document from which this annotation comes from. Default to None. If set, add the document substring: text[first-token-start-offset : last-token-end-offset].
debug – default to False. If True, add the sequence of tokens and fuzzyalgo names.

Returns

a concatenated string

property tokens: List[TokenT]

The tokens of the document that matched the keywords attribute of this instance.

Returns: an ordered sequence of TokenT, a generic type that implements IToken.

property tokens_label: The concatenation of each token’s label.

property tokens_norm_label: The concatenation of each token’s norm_label.

rm_nested_annots

iamsystem.rm_nested_annots(annots: List[Annotation], keep_ancestors=False)[source]

In case of two nested annotations, remove the shorter one. For example, if we have “prostate” and “prostate cancer” annnotations, “prostate” annotation is removed.

Parameters

annots – a list of annotations.
keep_ancestors – Default to False. Whether to keep the nested annotations that are ancestors and remove only other cases.

Returns

a filtered list of annotations.

replace_annots

iamsystem.replace_annots(text: str, annots: Sequence[Annotation], new_labels: Sequence[str])[source]

Replace each annotation in a document (text parameter) by a new label. Warning: an annotation is ignored if overlapped by another one.

Parameters

text – the document from which the annotations come from.
annots – an ordered sequence of annotation.
new_labels – one new label per annotation, same length as annots expected.

Returns

a new document.

Keyword and subclasses

IKeyword

class iamsystem.IKeyword(*args, **kwargs)[source]

Bases: Protocol

A string to search in a document (ex: “heart failure”).

label: str

IEntity

class iamsystem.IEntity(*args, **kwargs)[source]

Bases: IKeyword, Protocol

An entity of a knowledge base.

kb_id: str

Keyword

class iamsystem.Keyword(label: str)[source]

Bases: IKeyword

Base class to search keywords in a document.

__init__(label: str) → None

asdict()[source]: Returns the fields of the dataclass instance.

label: str

‘heart failure’).

Type: The string to search in a document (ex

Entity

class iamsystem.Entity(label: str, kb_id: str)[source]

Bases: Keyword, IEntity

A class that represents an entity of a knowledge base.

__init__(label: str, kb_id: str) → None

kb_id: str: The entity id in the knowledge base. Ex: https://www.wikidata.org/wiki/Q304330

Terminology

class iamsystem.Terminology[source]

Bases: IStoreKeywords

An utility class to store a set of keywords.

__init__()[source]

add_keyword(keyword: IKeyword) → None[source]

Add a keyword.

Parameters: keyword – a IKeyword or a subclass.
Returns: None

add_keywords(keywords: Iterable[IKeyword]) → None[source]

Add multiple keywords.

Parameters: keywords – a IKeyword or a subclass.
Returns: None

get_unigrams(tokenizer: ITokenizer, stopwords: IStopwords) → Set[str][source]: Get all the unigrams (single words excluding stopwords) in the keywords.

property keywords: Collection[IKeyword]: Get the collection of keywords.

property size: int: Get the number of keywords.

Tokenization

IOffsets

class iamsystem.IOffsets(*args, **kwargs)[source]

Bases: Protocol

Offsets interface. Default implementation Offsets.

end: int: end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

start: int: start-offset is the index of the first character.

Offsets

class iamsystem.Offsets(start: int, end: int)[source]

Bases: IOffsets

Store the start and end offsets of a token.

__init__(start: int, end: int)[source]

Parameters

start – start-offset is the index of the first character.
end – end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

IToken

class iamsystem.IToken(*args, **kwargs)[source]

Bases: IOffsets, Protocol

Token interface. Default implementation Token

i: int: The index of the token within the parent document.

label: str: the label as it is in the document/keyword.

norm_label: str: the normalized label used by iamsystem’s algorithm to perform entity linking.

Token

class iamsystem.Token(start: int, end: int, label: str, norm_label: str, i: int)[source]

Bases: Offsets, IToken

Store the label, normalized label, start and end offsets of a token.

__init__(start: int, end: int, label: str, norm_label: str, i: int)[source]

Create a token.

Parameters

start – start-offset is the index of the first character.
end – end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]
label – the label as it is in the document/keyword.
norm_label – the normalized label (used by iamsystem’s algorithm to perform entity linking).
i – the index of the token within the parent document.

ITokenizer

class iamsystem.ITokenizer(*args, **kwargs)[source]

Bases: Protocol[TokenT]

Tokenizer Interface. Default implementation TokenizerImp.

tokenize(text: str) → Sequence[TokenT][source]

Tokenize a string.

Parameters: text – an unormalized string.
Returns: A sequence of generic type (TokenT) that implements IToken protocol.

TokenizerImp

class iamsystem.TokenizerImp(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]

Bases: ITokenizer[Token]

A ITokenizer implementation. Class responsible for the tokenization, normalization of tokens. See also french_tokenizer(), english_tokenizer().

__init__(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]

Create a custom tokenizer that splits and normalizes a string.

Parameters

split – a function that split a text into (start,end) tuples. This function must return an iterable of IOffsets . See also split_find_iter_closure().
normalize – a function that normalizes a string. This function must return a string.

tokenize(text: str) → Sequence[Token][source]: Split the text into a sequence of Token.

english_tokenizer

iamsystem.english_tokenizer() → TokenizerImp[source]

An opinionated English tokenizer.: It splits the text by ‘word’ character.

It normalizes by lowercasing.

Returns: a TokenizerImp implementation.

french_tokenizer

iamsystem.french_tokenizer() → TokenizerImp[source]

An opinionated French tokenizer.: It splits the text by ‘word’ character.

It normalizes by lowercasing and unicode normalization form.

Returns: a TokenizerImp implementation.

Build a custom split function

iamsystem.split_find_iter_closure(pattern: str) → Callable[[str], Iterable[IOffsets]][source]

Build a split function that maps a document to (start, end) tuples.

Parameters: pattern – a regex to split sentence characters.
Returns: a split function.

Order tokens

iamsystem.tokenize_and_order_decorator(tokenize: Callable[[str], Sequence[TokenT]]) → Callable[[str], Sequence[TokenT]][source]

Decorate a tokenize function: the tokens are sorted alphabetically by their label.

Parameters: tokenize – a tokenize function to decorate.
Returns: the decorated tokenize function.

Stopwords classes

IStopwords

class iamsystem.IStopwords(*args, **kwargs)[source]

Bases: Protocol[TokenT]

Stopwords Interface.

is_token_a_stopword(token: TokenT) → bool[source]

Check if a token is a stopword.

Parameters: token – a generic Token that implements IToken protocol.
Returns: true if this token is a stopword.

Stopwords

class iamsystem.Stopwords(stopwords: Optional[Iterable[str]] = None)[source]

Bases: SimpleStopwords[TokenT]

A simple implementation of IStopwords protocol.

__init__(stopwords: Optional[Iterable[str]] = None)[source]

Create a Stopword instance to store stopwords.

Parameters: stopwords – a set of stopwords. Default to None.

add(words: Iterable[str]) → None[source]

Add stopwords.

Parameters: words – a list of string.
Returns: None

is_stopword(word: str) → bool[source]: True if, after lowercasing, the word belongs to the stopwords set

property stopwords: Get the set of stopwords.

NegativeStopwords

class iamsystem.NegativeStopwords(words_to_keep: Optional[Iterable[str]] = None)[source]

Bases: IStopwords[TokenT]

Like a negative image (a total inversion, in which light areas appear dark and vice versa), every token is a stopword until proven otherwise.

__init__(words_to_keep: Optional[Iterable[str]] = None)[source]

Create a NegativeStopwords instance to store words to keep and/or define functions that check if a word should be kept.

Parameters: words_to_keep – a set of words not to ignore.

add_fun_is_a_word_to_keep(fun: Callable[[TokenT], bool]) → None[source]

Add a function that checks if a word should be kept.

Parameters: fun – a Callable that takes a token as a parameter and returns a boolean.
Returns: None.

add_words(words_to_keep: Iterable[str]) → None[source]

Add words not to be ignored.

Parameters: words_to_keep – a list of string.
Returns: None

is_token_a_stopword(token: TokenT) → bool[source]

Check if it’s not token to keep.

Parameters: token – a token.
Returns: False if the token’s lowercase belongs to the set of word to keep or if a function add_fun_is_a_word_to_keep() returns True.

NoStopwords

class iamsystem.NoStopwords[source]

Bases: SimpleStopwords[TokenT]

Utility class. Class to use when no stopwords are used.

is_stopword(word: str) → bool[source]: Return False.

is_token_a_stopword(token: TokenT) → bool[source]: Return False.

Fuzzy algorithms

Abstract Base classes

FuzzyAlgo

class iamsystem.FuzzyAlgo(name: str)[source]

Bases: Generic[TokenT], ABC

Fuzzy Algorithm base class.

NO_SYN: Iterable[Tuple[str, ...]] = []: Default value to return by a fuzzy algorithm if no synonym found.

abstract get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) → List[Tuple[Tuple[str, ...], str]][source]

Main API function to retrieve all synonyms provided by a fuzzy algorithm.

Parameters

tokens – the sequence of tokens of the document. Useful when the fuzzy algorithm needs context, namely the tokens around the token of interest.
token – the token of this sequence for which synonyms are expected.
transitions –
the state transitions in which the algorithm currently is. Useful is the fuzzy algorithm needs to know the next

or possible transitions.

Returns

0 to many synonyms (SynAlgo type).

static word_to_syn(word: str) → Tuple[str, ...][source]

Utility function to transform a string to expected SynType.

Parameters: word – a word synonym produced by the algorithm. Ex: word=’insuffisance’ for token ‘ins’.
Returns: SynType, the expected output format.

static words_seq_to_syn(words: Sequence[str]) → Tuple[str, ...][source]

Utility function to transform a sequence of string to the expected output type.

Parameters: words – a sequence of words produced by the algorithm. Ex: words=[‘insuffisance’, ‘cardiaque’] for the token ‘ic’.
Returns: SynType, the expected output format.

ContextFreeAlgo

class iamsystem.ContextFreeAlgo(name: str)[source]

Bases: FuzzyAlgo[TokenT], ABC

A FuzzyAlgo that doesn’t take into account context, only the current token.

get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) → List[Tuple[Tuple[str, ...], str]][source]: Delegate to get_syns_of_token.

abstract get_syns_of_token(token: TokenT) → Iterable[Tuple[str, ...]][source]: Returns synonyms of this token.

NormLabelAlgo

class iamsystem.NormLabelAlgo(name: str)[source]

Bases: ContextFreeAlgo[TokenT], INormLabelAlgo, ABC

A FuzzyAlgo that uses only the normalized label of a token. These fuzzy algorithms can be put in cache to avoid calling them multiple times. See CacheFuzzyAlgos.

get_syns_of_token(token: TokenT) → Iterable[Tuple[str, ...]][source]: Delegate to get_syns_of_word.

abstract get_syns_of_word(word: str) → Iterable[Tuple[str, ...]][source]: Returns synonyms of this word (e.g. the normalized label of a token).

CacheFuzzyAlgos

class iamsystem.CacheFuzzyAlgos(name: str = 'Cache')[source]

Bases: FuzzyAlgo, Generic[TokenT]

A FuzzyAlgo that provides a cache for NormLabelAlgo algorithms. Since these algorithms don’t depend on context, their output can be cached to avoid calling them multiple times.

__init__(name: str = 'Cache')[source]

Create a fuzzy algorithm to allow a partial match between a text token and a keyword token.

Parameters: name – algorithm’s name.

add_algo(algo: INormLabelAlgo) → None[source]: Add NormLabelAlgo.

empty_cache() → None[source]: Empty the cache. Done automatically when an algorithm is added.

get_synonyms(tokens: Sequence[IToken], token: TokenT, transitions: Iterable[StateTransition]) → List[Tuple[Tuple[str, ...], str]][source]: Overrides. Implements superclass abstract method.

get_syns_of_word(word: str) → List[Tuple[Tuple[str, ...], str]][source]: Retrieve all synonyms of fuzzy algorithms from cache or by calling them once.

property max_nb_of_words: The maximum number of words to put in cache. Default 100.000 words

Abbreviations

class iamsystem.Abbreviations(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]

Bases: ContextFreeAlgo[TokenT], INormLabelAlgo

A FuzzyAlgo to handle abbreviations. This class doesn’t take into account the context of a document to return a long form.

__init__(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]

Create an instance to store abbreviations.

Parameters

name – a name given to this algorithm. (ex: ‘medical abbs’)
token_is_an_abbreviation – a function that verify if a token is an abbreviation (ex: checks all letters are uppercase). The function is called before the dictionary look-up is performed to retrieve long forms. Default: no checks performed, the function returns always true.

add(short_form: str, long_form: str, tokenizer: ITokenizer) → None[source]

Add an abbreviation.

Parameters

short_form – an abbreviation short form (ex: CHF).
long_form – an abbreviation long form. (ex: congestive heart failure).
tokenizer – a ITokenizer to tokenize the long form. It is recommanded to use your Matcher tokenizer.

Returns

None.

add_tokenized_long_form(short_form, long_form: Sequence[str]) → None[source]: Add an abbreviation already tokenized.

get_syns_of_token(token: TokenT) → Iterable[Tuple[str, ...]][source]: Return the abbreviation long form(s).

get_syns_of_word(word: str) → Iterable[Tuple[str, ...]][source]: Return the abbreviation long form(s).

FuzzyRegex

class iamsystem.FuzzyRegex(name: str, pattern: str, pattern_name: str)[source]

Bases: ContextFreeAlgo, INormLabelAlgo

A FuzzyAlgo to handle regular expressions. Useful when one or multiple tokens of a keyword need to be matched to a regular expression.

__init__(name: str, pattern: str, pattern_name: str)[source]

Create a FuzzyRegex instance.

Parameters

name – a name given to this algorithm.
pattern – a regular expression.
pattern_name – a name given to this pattern (ex: ‘numval’) that is also a token of a IKeyword.

get_syns_of_token(token: TokenT) → Iterable[Tuple[str, ...]][source]: Return the pattern_name if this token matches the regular expression.

get_syns_of_word(word: str) → Iterable[Tuple[str, ...]][source]: Return the pattern_name if this word matches it.

replace_pattern_in_keyword(keyword: IKeyword, tokenizer: ITokenizer) → IKeyword[source]: Utility function to replace keyword’s tokens that match the pattern by the pattern name.

token_matches_pattern(token: TokenT) → bool[source]: Return True if this token matches this instance’s pattern.

WordNormalizer

class iamsystem.WordNormalizer(name: str, norm_fun: Callable[[str], str])[source]

Bases: NormLabelAlgo

A FuzzyAlgo to handle normalization techniques such as stemming and lemmatization.

__init__(name: str, norm_fun: Callable[[str], str])[source]

Create an instance that will store the normalized tokens of a set of IKeyword.

Parameters

name – a name given to this algorithm (ex: ‘english stemmer’).
norm_fun – a normalizing function, for example a stemming function or lemmatization function.

add_words(words: Iterable[str]) → None[source]

A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call get_keywords_unigrams() of the matcher.

Parameters: words – A list of words to normalize and store.
Returns: None.

get_syns_of_word(word: str) → Iterable[Tuple[str, ...]][source]

Return all the words that have the same normalized form of this word

For example, if the normalize function is an english stemmer, and you provided add_words=[“eating”], this instance stored the stem “eat” associated to the word “eating”. Then, if a document contains the token “eats”, since the stem is the same, this function returns the synonym “eating”.

Parameters: word – a string, i.e. a word from a document.
Returns: word synonyms and algorithm name.

SpellWise

SpellWiseWrapper

class iamsystem.SpellWiseWrapper(measure: Union[str, ESpellWiseAlgo], max_distance: int, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None, name: str = None)[source]

Bases: StringDistance

A FuzzyAlgo that wraps an algorithm from the spellwise library.

__init__(measure: Union[str, ESpellWiseAlgo], max_distance: int, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None, name: str = None)[source]

Create an instance to take advantage of a spellwise algorithm.

Parameters

measure – The measure string or a value selected from SpellWiseAlgo enumerated list.
max_distance – maximum edit distance (see spellwise documentation).
min_nb_char – the minimum number of characters a word must have in order not to be ignored.
words2ignore – words that must be ignored by the algorithm to avoid false positives, for example English vocabulary words.
name – a name given to this algorithm. Default: spellwise algorithm’s name.

add_words(words: Iterable[str], warn=False) → None[source]

A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call get_keywords_unigrams() method after you added your keywords to the matcher instance.

Parameters

words – A list of possible synonyms.
warn – raise a warning if a word added is ignored. Default False.

Returns

None.

get_syns_of_word(word: str) → Iterable[Tuple[str, ...]][source]: Compute string distance if it is not a word to be ignored and return keywords’ unigrams in the maximum distance from that word.

property max_distance: Maximum edit distance (see spellwise documentation).

ESpellWiseAlgo

class iamsystem.ESpellWiseAlgo(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enumerated list of spellwise library algorithms. See spellwise documentation for more information.

CAVERPHONE_1 = <class 'spellwise.algorithms.caverphone_one.CaverphoneOne'>

CAVERPHONE_2 = <class 'spellwise.algorithms.caverphone_two.CaverphoneTwo'>

EDITEX = <class 'spellwise.algorithms.editex.Editex'>

LEVENSHTEIN = <class 'spellwise.algorithms.levenshtein.Levenshtein'>

SOUNDEX = <class 'spellwise.algorithms.soundex.Soundex'>

TYPOX = <class 'spellwise.algorithms.typox.Typox'>

SimString

SimStringWrapper

class iamsystem.SimStringWrapper(words: Iterable[str], measure: Union[str, ESimStringMeasure] = ESimStringMeasure.JACCARD, name: str = None, threshold=0.5, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None)[source]

Bases: StringDistance

SimString algorithm interface.

__init__(words: Iterable[str], measure: Union[str, ESimStringMeasure] = ESimStringMeasure.JACCARD, name: str = None, threshold=0.5, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None)[source]

Create a fuzzy algorithm that calls simstring.

Parameters

words – the words to index in the simstring database. An easy way to provide these words is to call get_keywords_unigrams().
name – a name given to this algorithm. Default measure name.
measure – a similarity measure string or selected from ESimStringMeasure. Default JACCARD.
threshold – similarity measure threshold.
min_nb_char – the minimum number of characters a word must have in order not to be ignored.
words2ignore – words that must be ignored by the algorithm to avoid false positives, for example English vocabulary words.

get_syns_of_word(word: str) → Iterable[Tuple[str, ...]][source]: Retrieve simstring similar words.

ESimStringMeasure

class iamsystem.ESimStringMeasure(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enumerated list of simstring measures.

COSINE = 'cosine'

DICE = 'dice'

EXACT = 'exact'

JACCARD = 'jaccard'

OVERLAP = 'overlap'

Brat

Formatter

EBratFormatters

class iamsystem.EBratFormatters(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

An enumerated list of available Brat Formatters.

CONTINUOUS_SEQ = <iamsystem.brat.formatter.ContSeqFormatter object>: Merge a continuous sequence of tokens but ignore stopwords.

CONTINUOUS_SEQ_STOP = <iamsystem.brat.formatter.ContSeqStopFormatter object>: Merge a continuous sequence of tokens with stopwords.

DEFAULT = <iamsystem.brat.formatter.ContSeqFormatter object>: Default to CONTINUOUS_SEQ.

SPAN = <iamsystem.brat.formatter.SpanFormatter object>: A Brat annotation from first token start-offsets to last token end-offsets.

TOKEN = <iamsystem.brat.formatter.TokenFormatter object>: A fragment for each token.

ContSeqFormatter

class iamsystem.ContSeqFormatter[source]

Bases: IBratFormatter

Default Brat Formatter: annotate a document by selecting continuous sequences of tokens but ignore stopwords.

__init__()

get_text_and_offsets(annot: IAnnotation) → Tuple[str, str][source]: Return tokens’ labels and token’s offsets (merge if continuous)

ContSeqStopFormatter

class iamsystem.ContSeqStopFormatter(remove_trailing_stop=True)[source]

Bases: IBratFormatter

A Brat formatter that takes into account stopwords: annotate a document by selecting continuous sequences of tokens/stopwords.

__init__(remove_trailing_stop=True)[source]

Create a brat formatter.

Parameters: remove_trailing_stop – if True, trailing stopwords in a discontinuous sequence will be removed. Ex: [[‘North’, ‘and’], [‘America’]] -> [[‘North’, [‘America’]]

get_text_and_offsets(annot: IAnnotation) → Tuple[str, str][source]

Return text (document substring) and annotation’s offsets in the: Brat format.

Parameters: annot – an annotation.
Returns: A text span and its offsets: ‘The start-offset is the index of the first character of the annotated span in the text (“.txt” file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span.’

TokenFormatter

class iamsystem.TokenFormatter[source]

Bases: IBratFormatter

Annotate a document by creating (start,end) offsets for each token (In comparison to TokenFormatter, it doesn’t merge continuous sequence).

__init__()

get_text_and_offsets(annot: IAnnotation) → Tuple[str, str][source]: Return tokens’ labels and token’s offsets (merge if continuous)

SpanFormatter

class iamsystem.SpanFormatter[source]

Bases: IBratFormatter

A simple Brat formatter that only uses start, end offsets of an annotation

__init__()

get_text_and_offsets(annot: IAnnotation) → Tuple[str, str][source]: Return text, offsets by start and end offsets of the annotation.

BratDocument

class iamsystem.BratDocument(brat_formatter: IBratFormatter = None)[source]

Bases: object

Class representing a Brat Document containing Brat’s annotations, namely Brat Entity and Brat Note in this package. A BratDocument should be linked to a single text document. Entities and notes can be serialized in a text file with ‘ann’ extension, one per line. See https://brat.nlplab.org/standoff.html

__init__(brat_formatter: IBratFormatter = None)[source]

Create a Brat Document.

Parameters: brat_formatter – a strategy to create Brat annotations span, like merging continuous sequence of tokens. Default BratFormatter create a Brat span for each individual token.

add_annots(annots: List[IAnnotation], keyword_attr: str = None, brat_type: str = None) → None[source]

Add iamsystem annotations to convert them to Brat format.

Parameters

annots – a list of Annotation, Matcher output.
keyword_attr – the attribute name of a IKeyword that stores brat_type. Default to None. If None, brat_type parameter must be used.
brat_type – A string, the Brat entity type for all these annotations. Default to None. If None, keyword_attr parameter must be used.

Returns

None

add_entity(brat_type: str, offsets: str, text: str) → None[source]

Add a Brat Entity.

Parameters

brat_type – A Brat entity type (see Brat documentation).
offsets – a list of (start,end) annotation offsets. See IOffsets. A list is expected since the tokens can be discontinuous.
text – document substring using (start,end) offsets (not the document itself).

Returns

None

entities_to_string() → str[source]: Brat entities in the Brat format ready to be serialized to ‘.ann’ text file.

get_entities() → Iterable[BratEntity][source]: An iterable of Brat entities.

get_notes() → Iterable[BratNote][source]: An iterable of Brat notes.

notes_to_string() → str[source]: Brat notes in the Brat format ready to be serialized to ‘.ann’ text file.

BratEntity

class iamsystem.BratEntity(entity_id: str, brat_type: str, offsets: str, text: str)[source]

Bases: object

Class representing a Brat Entity. https://brat.nlplab.org/standoff.html: ‘Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization). and the span of characters containing the entity mention (represented as a “start end” offset pair).’

Format: ID TYPE START END[;START END]* TEXT.

__init__(entity_id: str, brat_type: str, offsets: str, text: str)[source]

Create a Brat Entity.

Parameters

entity_id – a unique ID (^T[0-9]+$).
brat_type – A Brat entity type (see Brat documentation).
offsets – (start,end) offsets.
text – document substring using (start,end) offsets.

BratNote

class iamsystem.BratNote(note_id: str, ref_id: str, note: str)[source]

Bases: object

Class representing a Brat Note. https://brat.nlplab.org/standoff.html Brat notes are used to store additionnal information on a detected entity. Format: #ID TYPE REFID NOTE

__init__(note_id: str, ref_id: str, note: str)[source]

Create a Brat Note.

Parameters

note_id – a unique ID (^#[0-9]+$)
ref_id – a unique ID. For a BratEntity, the format is (^T[0-9]+$)
note – any string comment.

TYPE = 'IAMSYSTEM': BratNote type. Replace by ‘AnnotatorNotes’ to be human writable in Brat interface

BratWriter

class iamsystem.BratWriter[source]

Bases: object

Utility class to write IAMsystem annotations in Brat format to a text file.

__init__()

classmethod saveEntities(brat_entities: Iterable[BratEntity], write: Callable[[str], Any]) → None[source]

Write Brat entities.

Parameters

brat_entities – an iterable of Brat entities.
write – a write function (ex: f.write from ‘with(open(filename, ‘w’)) as f:’)

Returns

None

classmethod saveNotes(brat_notes: Iterable[BratNote], write: Callable[[str], Any]) → None[source]

Write Brat notes.

Parameters

brat_notes – an iterable of Brat notes.
write – a write function ex: f.write from ‘with(open(filename, ‘w’)) as f:

Returns

None

spaCy

IAMsystemSpacy

class iamsystem.spacy.IAMsystemSpacy(nlp: ~spacy.language.Language, name: str, keywords: ~typing.Iterable[~iamsystem.keywords.api.IKeyword], fuzzy_algos: ~typing.Iterable[~iamsystem.fuzzy.api.FuzzyAlgo], w: int = 1, remove_nested_annots: bool = True, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.spacy.token.TokenSpacyAdapter] = None, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>, attr: str = 'iamsystem')[source]

Bases: BaseCustomComp

A stateful component. ‘Component factories are callables that take settings and return a pipeline component function. This is useful if your component is stateful and if you need to customize their creation’. See: https://spacy.io/usage/processing-pipelines#custom-components

__init__(nlp: ~spacy.language.Language, name: str, keywords: ~typing.Iterable[~iamsystem.keywords.api.IKeyword], fuzzy_algos: ~typing.Iterable[~iamsystem.fuzzy.api.FuzzyAlgo], w: int = 1, remove_nested_annots: bool = True, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.spacy.token.TokenSpacyAdapter] = None, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>, attr: str = 'iamsystem')[source]

Create a custom spaCy component. Matcher uses spaCy tokenizer to tokenize the documents and the keywords.

Parameters

nlp – a spacy Language.
name – the name of this spaCy component.
keywords – a list of IKeywords to detect in a document.
fuzzy_algos – a list of FuzzyAlgo.
w – Matcher’s window parameter.
remove_nested_annots – whether to remove nested annotations.
stopwords – IStopwords instance.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem.
attr – the attribute to store iamsystem’s annotation in a spaCy span instance.

property matcher: IMatcher[TokenSpacyAdapter]: A matcher that uses spaCy tokenizer.

IAMsystemBuildSpacy

class iamsystem.spacy.IAMsystemBuildSpacy(nlp: Language, name: str, build_params: Dict[Any, Any], serialized_kw: Dict[Any, Any] = None, attr: str = 'iamsystem', norm_fun: Callable[[str], str] = None)[source]

Bases: BaseCustomComp

A serializable custom component.

__init__(nlp: Language, name: str, build_params: Dict[Any, Any], serialized_kw: Dict[Any, Any] = None, attr: str = 'iamsystem', norm_fun: Callable[[str], str] = None)[source]

Create a custom spaCy component. Matcher uses spaCy tokenizer to tokenize the documents and the keywords.

Parameters

nlp – a spacy Language.
name – the name of this spaCy component.
attr – the attribute to store iamsystem’s annotation in a spaCy span instance.
serialized_kw –
a way to import serialized keywords. A dictionary containing 3 fields:
- ’module’: module name of the class to import. ex: ‘iamsystem’.
- ’class_name’: the Keyword class to import.
- ’kw’: an iterable of dict created with the asdict() function.
If None, keywords are expected in ‘build_params’.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem. Default to lower case and remove accents.
build_params – build() parameters, the spacy tokenizer will be used whatever the tokenizer value.

property matcher: IMatcher[TokenSpacyAdapter]: A matcher that uses spaCy tokenizer.

TokenSpacyAdapter

class iamsystem.spacy.TokenSpacyAdapter(spacy_token: ~spacy.tokens.token.Token, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>)[source]

Bases: IToken

A custom Token that wraps spaCy’s Token and implements the iamsystem’s IToken interface.

__init__(spacy_token: ~spacy.tokens.token.Token, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>)[source]

Create a iamsystem’s token from a spaCy token.

Parameters

spacy_token – a spacy.tokens instance.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem.

IsStopSpacy

class iamsystem.spacy.IsStopSpacy(*args, **kwargs)[source]

Bases: IStopwords[TokenSpacyAdapter]

Stopwords that uses spaCy’s ‘is_stop’ function.

__init__(*args, **kwargs)

is_token_a_stopword(token: TokenSpacyAdapter) → bool[source]: Return spaCy’s token attribute ‘is_stop’.

SpacyTokenizer

class iamsystem.spacy.SpacyTokenizer(nlp: Language, norm_fun: Callable[[str], str])[source]

Bases: ITokenizer[TokenSpacyAdapter]

A class that wraps spaCy’s tokenizer.

__init__(nlp: Language, norm_fun: Callable[[str], str])[source]

Create a tokenizer for iamsystem algorithm that uses spaCy’s tokenizer.

Parameters

nlp – a spacy Language.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem algorithm.

tokenize(text: str) → Sequence[TokenSpacyAdapter][source]

Tokenize a text. This function is used only to tokenize the keywords by the matcher since this custom component receives from spaCy the document already tokenized.

Parameters: text – a string to tokenize with spaCy component.
Returns: an ordered sequence of tokens.