API Documentation

Documentation of classes and methods.

Matcher

class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]

Bases: IMatcher[TokenT]

Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.

__init__(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]

Create an IAMsystem matcher to annotate documents. Prefer build() method to create a matcher.

Parameters
add_fuzzy_algo(fuzzy_algo: FuzzyAlgo[TokenT]) None[source]
Add a fuzzy algorithms to provide synonym(s) that helps matching

a token of a document and a token of a keyword.

Parameters

fuzzy_algo – a FuzzyAlgo instance.

Returns

None.

add_keyword(keyword: IKeyword) None[source]

Add a keyword to find in a document.

Parameters

keywordIKeyword to search in a document.

Returns

None.

add_keywords(keywords: Iterable[Union[str, IKeyword]]) None[source]

Utility function to add multiple keywords.

Parameters

keywords – an iterable of string (labels) or IKeyword to search in a document.

Returns

None.

add_stopwords(words: Iterable[str]) None[source]

Add words (tokens) to be ignored in IKeyword and in documents.

Parameters

words – a list of words to ignore.

Returns

None.

annot_text(text: str) List[IAnnotation[TokenT]][source]

Annotate a document.

Parameters

text – the document to annotate.

Returns

a list of Annotation.

annot_tokens(tokens: Sequence[TokenT]) List[IAnnotation[TokenT]][source]

Annotate a sequence of tokens.

Parameters

tokens – an ordered or unordered sequence of tokens.

Returns

a list of Annotation.

classmethod build(keywords: ~typing.Iterable[~typing.Union[str, ~iamsystem.keywords.api.IKeyword]], tokenizer: ~iamsystem.tokenization.api.ITokenizer = None, stopwords: ~typing.Union[~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT], ~typing.Iterable[str]] = <iamsystem.stopwords.simple.NoStopwords object>, w=1, order_tokens=False, negative=False, remove_nested_annots=True, strategy: ~typing.Union[str, ~iamsystem.matcher.strategy.EMatchingStrategy] = EMatchingStrategy.WINDOW, string_distance_ignored_w: ~typing.Optional[~typing.Iterable[str]] = None, abbreviations: ~typing.Optional[~typing.Iterable[~typing.Tuple[str, str]]] = None, spellwise: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, simstring: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, normalizers: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, fuzzy_regex: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None) Matcher[TokenT][source]

Create an IAMsystem matcher to annotate documents.

Parameters
  • keywords – an iterable of keywords string or IKeyword instances.

  • tokenizer – default french_tokenizer(). A ITokenizer instance responsible for tokenizing and normalizing.

  • stopwords – provide a IStopwords. If None, default to NoStopwords.

  • w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.

  • order_tokens – order tokens alphabetically if order doesn’t matter in the matching strategy.

  • negative – every unigram not in the keywords is a stopword. Default to False. If stopwords are also passed, they will be removed from keywords’ tokens and so still be stopwords.

  • remove_nested_annots – if two annotations overlap, remove the shorter one. Default to True.

  • strategy – an IAMsystem matching strategy responsible for searching keywords in document. Default to WindowMatching.

  • string_distance_ignored_w – words ignored by string distance algorithms to avoid false positives matched.

  • abbreviations – an iterable of tuples (short_form, long_form).

  • spellwise – an iterable of SpellWiseWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SpellWiseWrapper init function.

  • simstring – an iterable of SimStringWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SimStringWrapper init function.

  • normalizers – an iterable of WordNormalizer init parameters.

  • fuzzy_regex – an iterable of FuzzyRegex init parameters.

property fuzzy_algos: Iterable[FuzzyAlgo[TokenT]]

The fuzzy algorithms used by the algorithm.

Returns

FuzzyAlgo instances responsible for finding possible synonyms for each token of a document.

get_initial_state() INode[source]

Return the initial state from which iamsystem algorithm will start searching for a sequence of keywords’tokens.

get_keywords_unigrams() Set[str][source]

Get all the unigrams (single words excluding stopwords) in the keywords.

get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], List[str]]][source]

Get synonyms of a token with configured fuzzy algorithms.

Parameters
  • tokens – document’s tokens.

  • token – the token for which synonyms are expected.

  • transitions – algorithm’s states.

Returns

tuples of synonyms and fuzzy algorithm’s names.

is_stopword(word: str) bool[source]

Return True if word is a stopword.

is_token_a_stopword(token: TokenT) bool[source]

Check if a token is a stopword.

Parameters

token – a generic token that implements IToken.

Returns

True if the token is a stopword.

property keywords: Collection[IKeyword]

Return the keywords added.

property remove_nested_annots: bool

Whether to remove nested annotations. Default to True.

property stopwords: IStopwords[TokenT]

Return the IStopwords used by the matcher.

property strategy: IMatchingStrategy[TokenT]

Return the matching strategy.

tokenize(text: str) Sequence[TokenT][source]

Tokenize a text with the tokenizer’s instance.

Parameters

text – a document or a keyword.

Returns

A sequence of tokens, the type depends on the tokenizer but must implement IToken protocol.

property tokenizer: ITokenizer[TokenT]

Return the ITokenizer used by the matcher.

property w: int

Return the window parameter of this matcher.

Matcher build

class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]

Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.

classmethod build(keywords: ~typing.Iterable[~typing.Union[str, ~iamsystem.keywords.api.IKeyword]], tokenizer: ~iamsystem.tokenization.api.ITokenizer = None, stopwords: ~typing.Union[~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT], ~typing.Iterable[str]] = <iamsystem.stopwords.simple.NoStopwords object>, w=1, order_tokens=False, negative=False, remove_nested_annots=True, strategy: ~typing.Union[str, ~iamsystem.matcher.strategy.EMatchingStrategy] = EMatchingStrategy.WINDOW, string_distance_ignored_w: ~typing.Optional[~typing.Iterable[str]] = None, abbreviations: ~typing.Optional[~typing.Iterable[~typing.Tuple[str, str]]] = None, spellwise: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, simstring: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, normalizers: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, fuzzy_regex: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None) Matcher[TokenT][source]

Create an IAMsystem matcher to annotate documents.

Parameters
  • keywords – an iterable of keywords string or IKeyword instances.

  • tokenizer – default french_tokenizer(). A ITokenizer instance responsible for tokenizing and normalizing.

  • stopwords – provide a IStopwords. If None, default to NoStopwords.

  • w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.

  • order_tokens – order tokens alphabetically if order doesn’t matter in the matching strategy.

  • negative – every unigram not in the keywords is a stopword. Default to False. If stopwords are also passed, they will be removed from keywords’ tokens and so still be stopwords.

  • remove_nested_annots – if two annotations overlap, remove the shorter one. Default to True.

  • strategy – an IAMsystem matching strategy responsible for searching keywords in document. Default to WindowMatching.

  • string_distance_ignored_w – words ignored by string distance algorithms to avoid false positives matched.

  • abbreviations – an iterable of tuples (short_form, long_form).

  • spellwise – an iterable of SpellWiseWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SpellWiseWrapper init function.

  • simstring – an iterable of SimStringWrapper init parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SimStringWrapper init function.

  • normalizers – an iterable of WordNormalizer init parameters.

  • fuzzy_regex – an iterable of FuzzyRegex init parameters.

EMatchingStrategy

class iamsystem.EMatchingStrategy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Enumeration of matching strategies.

LARGE_WINDOW = <iamsystem.matcher.strategy.LargeWindowMatching object>

Same annotations as Window but faster than window is large.

NO_OVERLAP = <iamsystem.matcher.strategy.NoOverlapMatching object>

No overlap/nested annotations, fastest strategies.

WINDOW = <iamsystem.matcher.strategy.WindowMatching object>

Default matching strategy.

Span

class iamsystem.matcher.annotation.Span(tokens: List[TokenT])[source]

Bases: ISpan[TokenT], IOffsets

A class that represents a sequence of tokens in a document.

end: int

end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

property end_i

The index of the last token within the parent document.

get_text_substring(text: str) str[source]

Return text substring.

start: int

The start offset of the first token.

property start_i

The index of the first token within the parent document.

property tokens: List[TokenT]

The tokens of the document that matched the keywords attribute of this instance.

Returns

an ordered sequence of TokenT, a generic type that implements IToken.

property tokens_label

The concatenation of each token’s label.

property tokens_norm_label

The concatenation of each token’s norm_label.

Annotation

class iamsystem.Annotation(tokens: List[TokenT], algos: List[List[str]], node: INode, stop_tokens: List[TokenT], text: Optional[str] = None)[source]

Bases: Span[TokenT], IAnnotation[TokenT]

Ouput class of Matcher storing information on the detected entities.

property algos: List[List[str]]

For each token, the list of algorithms that matched. One to several algorithms per token.

annot_to_str(annot: IAnnotation)

A class function that generates a string representation of an annotation.

end: int

end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

property end_i

The index of the last token within the parent document.

get_text_substring(text: str) str

Return text substring.

get_tokens_algos() Iterable[Tuple[TokenT, List[str]]][source]

Get each token and the list of fuzzy algorithms that matched it.

Returns

an iterable of tuples (token0, [‘algo1’,…]) where token0 is a token and [‘algo1’,…] a list of fuzzy algorithms.

property keywords: Sequence[IKeyword]

The linked entities, IKeyword instances that matched a document’s tokens.

property label

@Deprecated. An annotation label. Return ‘tokens_label’ attribute

classmethod set_brat_formatter(brat_formatter: Union[EBratFormatters, IBratFormatter])[source]

Change Brat Formatter to change text-span and offsets.

Parameters

brat_formatter – A Brat formatter to produce a different Brat annotation. If None, default to ContSeqFormatter.

Returns

None

start: int

The start offset of the first token.

property start_i

The index of the first token within the parent document.

property stop_tokens: List[TokenT]

The list of stopwords tokens inside the annotation detected by the Matcher stopwords instance.

property text: Optional[str]

Return the annotated text.

to_dict(text: str = None) Dict[str, Any][source]

Return a dictionary representation of this object.

Parameters

text – the document from which this annotation comes from. Default to None.

Returns

A dictionary of relevant attributes.

to_string(text=False, debug=False) str[source]

Get a default string representation of this object.

Parameters
  • text – the document from which this annotation comes from. Default to None. If set, add the document substring: text[first-token-start-offset : last-token-end-offset].

  • debug – default to False. If True, add the sequence of tokens and fuzzyalgo names.

Returns

a concatenated string

property tokens: List[TokenT]

The tokens of the document that matched the keywords attribute of this instance.

Returns

an ordered sequence of TokenT, a generic type that implements IToken.

property tokens_label

The concatenation of each token’s label.

property tokens_norm_label

The concatenation of each token’s norm_label.

rm_nested_annots

iamsystem.rm_nested_annots(annots: List[Annotation], keep_ancestors=False)[source]

In case of two nested annotations, remove the shorter one. For example, if we have “prostate” and “prostate cancer” annnotations, “prostate” annotation is removed.

Parameters
  • annots – a list of annotations.

  • keep_ancestors – Default to False. Whether to keep the nested annotations that are ancestors and remove only other cases.

Returns

a filtered list of annotations.

replace_annots

iamsystem.replace_annots(text: str, annots: Sequence[Annotation], new_labels: Sequence[str])[source]

Replace each annotation in a document (text parameter) by a new label. Warning: an annotation is ignored if overlapped by another one.

Parameters
  • text – the document from which the annotations come from.

  • annots – an ordered sequence of annotation.

  • new_labels – one new label per annotation, same length as annots expected.

Returns

a new document.

Keyword and subclasses

IKeyword

class iamsystem.IKeyword(*args, **kwargs)[source]

Bases: Protocol

A string to search in a document (ex: “heart failure”).

label: str

IEntity

class iamsystem.IEntity(*args, **kwargs)[source]

Bases: IKeyword, Protocol

An entity of a knowledge base.

kb_id: str

Keyword

class iamsystem.Keyword(label: str)[source]

Bases: IKeyword

Base class to search keywords in a document.

__init__(label: str) None
asdict()[source]

Returns the fields of the dataclass instance.

label: str

‘heart failure’).

Type

The string to search in a document (ex

Entity

class iamsystem.Entity(label: str, kb_id: str)[source]

Bases: Keyword, IEntity

A class that represents an entity of a knowledge base.

__init__(label: str, kb_id: str) None
kb_id: str

The entity id in the knowledge base. Ex: https://www.wikidata.org/wiki/Q304330

Terminology

class iamsystem.Terminology[source]

Bases: IStoreKeywords

An utility class to store a set of keywords.

__init__()[source]
add_keyword(keyword: IKeyword) None[source]

Add a keyword.

Parameters

keyword – a IKeyword or a subclass.

Returns

None

add_keywords(keywords: Iterable[IKeyword]) None[source]

Add multiple keywords.

Parameters

keywords – a IKeyword or a subclass.

Returns

None

get_unigrams(tokenizer: ITokenizer, stopwords: IStopwords) Set[str][source]

Get all the unigrams (single words excluding stopwords) in the keywords.

property keywords: Collection[IKeyword]

Get the collection of keywords.

property size: int

Get the number of keywords.

Tokenization

IOffsets

class iamsystem.IOffsets(*args, **kwargs)[source]

Bases: Protocol

Offsets interface. Default implementation Offsets.

end: int

end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

start: int

start-offset is the index of the first character.

Offsets

class iamsystem.Offsets(start: int, end: int)[source]

Bases: IOffsets

Store the start and end offsets of a token.

__init__(start: int, end: int)[source]
Parameters
  • start – start-offset is the index of the first character.

  • end – end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

IToken

class iamsystem.IToken(*args, **kwargs)[source]

Bases: IOffsets, Protocol

Token interface. Default implementation Token

i: int

The index of the token within the parent document.

label: str

the label as it is in the document/keyword.

norm_label: str

the normalized label used by iamsystem’s algorithm to perform entity linking.

Token

class iamsystem.Token(start: int, end: int, label: str, norm_label: str, i: int)[source]

Bases: Offsets, IToken

Store the label, normalized label, start and end offsets of a token.

__init__(start: int, end: int, label: str, norm_label: str, i: int)[source]

Create a token.

Parameters
  • start – start-offset is the index of the first character.

  • end – end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]

  • label – the label as it is in the document/keyword.

  • norm_label – the normalized label (used by iamsystem’s algorithm to perform entity linking).

  • i – the index of the token within the parent document.

ITokenizer

class iamsystem.ITokenizer(*args, **kwargs)[source]

Bases: Protocol[TokenT]

Tokenizer Interface. Default implementation TokenizerImp.

tokenize(text: str) Sequence[TokenT][source]

Tokenize a string.

Parameters

text – an unormalized string.

Returns

A sequence of generic type (TokenT) that implements IToken protocol.

TokenizerImp

class iamsystem.TokenizerImp(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]

Bases: ITokenizer[Token]

A ITokenizer implementation. Class responsible for the tokenization, normalization of tokens. See also french_tokenizer(), english_tokenizer().

__init__(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]

Create a custom tokenizer that splits and normalizes a string.

Parameters
  • split – a function that split a text into (start,end) tuples. This function must return an iterable of IOffsets . See also split_find_iter_closure().

  • normalize – a function that normalizes a string. This function must return a string.

tokenize(text: str) Sequence[Token][source]

Split the text into a sequence of Token.

english_tokenizer

iamsystem.english_tokenizer() TokenizerImp[source]
An opinionated English tokenizer.
It splits the text by ‘word’ character.
It normalizes by lowercasing.
Returns

a TokenizerImp implementation.

french_tokenizer

iamsystem.french_tokenizer() TokenizerImp[source]
An opinionated French tokenizer.
It splits the text by ‘word’ character.
It normalizes by lowercasing and unicode normalization form.
Returns

a TokenizerImp implementation.

Build a custom split function

iamsystem.split_find_iter_closure(pattern: str) Callable[[str], Iterable[IOffsets]][source]

Build a split function that maps a document to (start, end) tuples.

Parameters

pattern – a regex to split sentence characters.

Returns

a split function.

Order tokens

iamsystem.tokenize_and_order_decorator(tokenize: Callable[[str], Sequence[TokenT]]) Callable[[str], Sequence[TokenT]][source]

Decorate a tokenize function: the tokens are sorted alphabetically by their label.

Parameters

tokenize – a tokenize function to decorate.

Returns

the decorated tokenize function.

Stopwords classes

IStopwords

class iamsystem.IStopwords(*args, **kwargs)[source]

Bases: Protocol[TokenT]

Stopwords Interface.

is_token_a_stopword(token: TokenT) bool[source]

Check if a token is a stopword.

Parameters

token – a generic Token that implements IToken protocol.

Returns

true if this token is a stopword.

Stopwords

class iamsystem.Stopwords(stopwords: Optional[Iterable[str]] = None)[source]

Bases: SimpleStopwords[TokenT]

A simple implementation of IStopwords protocol.

__init__(stopwords: Optional[Iterable[str]] = None)[source]

Create a Stopword instance to store stopwords.

Parameters

stopwords – a set of stopwords. Default to None.

add(words: Iterable[str]) None[source]

Add stopwords.

Parameters

words – a list of string.

Returns

None

is_stopword(word: str) bool[source]

True if, after lowercasing, the word belongs to the stopwords set

property stopwords

Get the set of stopwords.

NegativeStopwords

class iamsystem.NegativeStopwords(words_to_keep: Optional[Iterable[str]] = None)[source]

Bases: IStopwords[TokenT]

Like a negative image (a total inversion, in which light areas appear dark and vice versa), every token is a stopword until proven otherwise.

__init__(words_to_keep: Optional[Iterable[str]] = None)[source]

Create a NegativeStopwords instance to store words to keep and/or define functions that check if a word should be kept.

Parameters

words_to_keep – a set of words not to ignore.

add_fun_is_a_word_to_keep(fun: Callable[[TokenT], bool]) None[source]

Add a function that checks if a word should be kept.

Parameters

fun – a Callable that takes a token as a parameter and returns a boolean.

Returns

None.

add_words(words_to_keep: Iterable[str]) None[source]

Add words not to be ignored.

Parameters

words_to_keep – a list of string.

Returns

None

is_token_a_stopword(token: TokenT) bool[source]

Check if it’s not token to keep.

Parameters

token – a token.

Returns

False if the token’s lowercase belongs to the set of word to keep or if a function add_fun_is_a_word_to_keep() returns True.

NoStopwords

class iamsystem.NoStopwords[source]

Bases: SimpleStopwords[TokenT]

Utility class. Class to use when no stopwords are used.

is_stopword(word: str) bool[source]

Return False.

is_token_a_stopword(token: TokenT) bool[source]

Return False.

Fuzzy algorithms

Abstract Base classes

FuzzyAlgo

class iamsystem.FuzzyAlgo(name: str)[source]

Bases: Generic[TokenT], ABC

Fuzzy Algorithm base class.

NO_SYN: Iterable[Tuple[str, ...]] = []

Default value to return by a fuzzy algorithm if no synonym found.

abstract get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], str]][source]

Main API function to retrieve all synonyms provided by a fuzzy algorithm.

Parameters
  • tokens – the sequence of tokens of the document. Useful when the fuzzy algorithm needs context, namely the tokens around the token of interest.

  • token – the token of this sequence for which synonyms are expected.

  • transitions

    the state transitions in which the algorithm currently is. Useful is the fuzzy algorithm needs to know the next

    or possible transitions.

Returns

0 to many synonyms (SynAlgo type).

static word_to_syn(word: str) Tuple[str, ...][source]

Utility function to transform a string to expected SynType.

Parameters

word – a word synonym produced by the algorithm. Ex: word=’insuffisance’ for token ‘ins’.

Returns

SynType, the expected output format.

static words_seq_to_syn(words: Sequence[str]) Tuple[str, ...][source]

Utility function to transform a sequence of string to the expected output type.

Parameters

words – a sequence of words produced by the algorithm. Ex: words=[‘insuffisance’, ‘cardiaque’] for the token ‘ic’.

Returns

SynType, the expected output format.

ContextFreeAlgo

class iamsystem.ContextFreeAlgo(name: str)[source]

Bases: FuzzyAlgo[TokenT], ABC

A FuzzyAlgo that doesn’t take into account context, only the current token.

get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], str]][source]

Delegate to get_syns_of_token.

abstract get_syns_of_token(token: TokenT) Iterable[Tuple[str, ...]][source]

Returns synonyms of this token.

NormLabelAlgo

class iamsystem.NormLabelAlgo(name: str)[source]

Bases: ContextFreeAlgo[TokenT], INormLabelAlgo, ABC

A FuzzyAlgo that uses only the normalized label of a token. These fuzzy algorithms can be put in cache to avoid calling them multiple times. See CacheFuzzyAlgos.

get_syns_of_token(token: TokenT) Iterable[Tuple[str, ...]][source]

Delegate to get_syns_of_word.

abstract get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]

Returns synonyms of this word (e.g. the normalized label of a token).

CacheFuzzyAlgos

class iamsystem.CacheFuzzyAlgos(name: str = 'Cache')[source]

Bases: FuzzyAlgo, Generic[TokenT]

A FuzzyAlgo that provides a cache for NormLabelAlgo algorithms. Since these algorithms don’t depend on context, their output can be cached to avoid calling them multiple times.

__init__(name: str = 'Cache')[source]

Create a fuzzy algorithm to allow a partial match between a text token and a keyword token.

Parameters

name – algorithm’s name.

add_algo(algo: INormLabelAlgo) None[source]

Add NormLabelAlgo.

empty_cache() None[source]

Empty the cache. Done automatically when an algorithm is added.

get_synonyms(tokens: Sequence[IToken], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], str]][source]

Overrides. Implements superclass abstract method.

get_syns_of_word(word: str) List[Tuple[Tuple[str, ...], str]][source]

Retrieve all synonyms of fuzzy algorithms from cache or by calling them once.

property max_nb_of_words

The maximum number of words to put in cache. Default 100.000 words

Abbreviations

class iamsystem.Abbreviations(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]

Bases: ContextFreeAlgo[TokenT], INormLabelAlgo

A FuzzyAlgo to handle abbreviations. This class doesn’t take into account the context of a document to return a long form.

__init__(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]

Create an instance to store abbreviations.

Parameters
  • name – a name given to this algorithm. (ex: ‘medical abbs’)

  • token_is_an_abbreviation – a function that verify if a token is an abbreviation (ex: checks all letters are uppercase). The function is called before the dictionary look-up is performed to retrieve long forms. Default: no checks performed, the function returns always true.

add(short_form: str, long_form: str, tokenizer: ITokenizer) None[source]

Add an abbreviation.

Parameters
  • short_form – an abbreviation short form (ex: CHF).

  • long_form – an abbreviation long form. (ex: congestive heart failure).

  • tokenizer – a ITokenizer to tokenize the long form. It is recommanded to use your Matcher tokenizer.

Returns

None.

add_tokenized_long_form(short_form, long_form: Sequence[str]) None[source]

Add an abbreviation already tokenized.

get_syns_of_token(token: TokenT) Iterable[Tuple[str, ...]][source]

Return the abbreviation long form(s).

get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]

Return the abbreviation long form(s).

FuzzyRegex

class iamsystem.FuzzyRegex(name: str, pattern: str, pattern_name: str)[source]

Bases: ContextFreeAlgo, INormLabelAlgo

A FuzzyAlgo to handle regular expressions. Useful when one or multiple tokens of a keyword need to be matched to a regular expression.

__init__(name: str, pattern: str, pattern_name: str)[source]

Create a FuzzyRegex instance.

Parameters
  • name – a name given to this algorithm.

  • pattern – a regular expression.

  • pattern_name – a name given to this pattern (ex: ‘numval’) that is also a token of a IKeyword.

get_syns_of_token(token: TokenT) Iterable[Tuple[str, ...]][source]

Return the pattern_name if this token matches the regular expression.

get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]

Return the pattern_name if this word matches it.

replace_pattern_in_keyword(keyword: IKeyword, tokenizer: ITokenizer) IKeyword[source]

Utility function to replace keyword’s tokens that match the pattern by the pattern name.

token_matches_pattern(token: TokenT) bool[source]

Return True if this token matches this instance’s pattern.

WordNormalizer

class iamsystem.WordNormalizer(name: str, norm_fun: Callable[[str], str])[source]

Bases: NormLabelAlgo

A FuzzyAlgo to handle normalization techniques such as stemming and lemmatization.

__init__(name: str, norm_fun: Callable[[str], str])[source]

Create an instance that will store the normalized tokens of a set of IKeyword.

Parameters
  • name – a name given to this algorithm (ex: ‘english stemmer’).

  • norm_fun – a normalizing function, for example a stemming function or lemmatization function.

add_words(words: Iterable[str]) None[source]

A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call get_keywords_unigrams() of the matcher.

Parameters

words – A list of words to normalize and store.

Returns

None.

get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]

Return all the words that have the same normalized form of this word

For example, if the normalize function is an english stemmer, and you provided add_words=[“eating”], this instance stored the stem “eat” associated to the word “eating”. Then, if a document contains the token “eats”, since the stem is the same, this function returns the synonym “eating”.

Parameters

word – a string, i.e. a word from a document.

Returns

word synonyms and algorithm name.

SpellWise

SpellWiseWrapper

class iamsystem.SpellWiseWrapper(measure: Union[str, ESpellWiseAlgo], max_distance: int, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None, name: str = None)[source]

Bases: StringDistance

A FuzzyAlgo that wraps an algorithm from the spellwise library.

__init__(measure: Union[str, ESpellWiseAlgo], max_distance: int, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None, name: str = None)[source]

Create an instance to take advantage of a spellwise algorithm.

Parameters
  • measure – The measure string or a value selected from SpellWiseAlgo enumerated list.

  • max_distance – maximum edit distance (see spellwise documentation).

  • min_nb_char – the minimum number of characters a word must have in order not to be ignored.

  • words2ignore – words that must be ignored by the algorithm to avoid false positives, for example English vocabulary words.

  • name – a name given to this algorithm. Default: spellwise algorithm’s name.

add_words(words: Iterable[str], warn=False) None[source]

A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call get_keywords_unigrams() method after you added your keywords to the matcher instance.

Parameters
  • words – A list of possible synonyms.

  • warn – raise a warning if a word added is ignored. Default False.

Returns

None.

get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]

Compute string distance if it is not a word to be ignored and return keywords’ unigrams in the maximum distance from that word.

property max_distance

Maximum edit distance (see spellwise documentation).

ESpellWiseAlgo

class iamsystem.ESpellWiseAlgo(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enumerated list of spellwise library algorithms. See spellwise documentation for more information.

CAVERPHONE_1 = <class 'spellwise.algorithms.caverphone_one.CaverphoneOne'>
CAVERPHONE_2 = <class 'spellwise.algorithms.caverphone_two.CaverphoneTwo'>
EDITEX = <class 'spellwise.algorithms.editex.Editex'>
LEVENSHTEIN = <class 'spellwise.algorithms.levenshtein.Levenshtein'>
SOUNDEX = <class 'spellwise.algorithms.soundex.Soundex'>
TYPOX = <class 'spellwise.algorithms.typox.Typox'>

SimString

SimStringWrapper

class iamsystem.SimStringWrapper(words: Iterable[str], measure: Union[str, ESimStringMeasure] = ESimStringMeasure.JACCARD, name: str = None, threshold=0.5, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None)[source]

Bases: StringDistance

SimString algorithm interface.

__init__(words: Iterable[str], measure: Union[str, ESimStringMeasure] = ESimStringMeasure.JACCARD, name: str = None, threshold=0.5, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None)[source]

Create a fuzzy algorithm that calls simstring.

Parameters
  • words – the words to index in the simstring database. An easy way to provide these words is to call get_keywords_unigrams().

  • name – a name given to this algorithm. Default measure name.

  • measure – a similarity measure string or selected from ESimStringMeasure. Default JACCARD.

  • threshold – similarity measure threshold.

  • min_nb_char – the minimum number of characters a word must have in order not to be ignored.

  • words2ignore – words that must be ignored by the algorithm to avoid false positives, for example English vocabulary words.

get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]

Retrieve simstring similar words.

ESimStringMeasure

class iamsystem.ESimStringMeasure(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enumerated list of simstring measures.

COSINE = 'cosine'
DICE = 'dice'
EXACT = 'exact'
JACCARD = 'jaccard'
OVERLAP = 'overlap'

Brat

Formatter

EBratFormatters

class iamsystem.EBratFormatters(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

An enumerated list of available Brat Formatters.

CONTINUOUS_SEQ = <iamsystem.brat.formatter.ContSeqFormatter object>

Merge a continuous sequence of tokens but ignore stopwords.

CONTINUOUS_SEQ_STOP = <iamsystem.brat.formatter.ContSeqStopFormatter object>

Merge a continuous sequence of tokens with stopwords.

DEFAULT = <iamsystem.brat.formatter.ContSeqFormatter object>

Default to CONTINUOUS_SEQ.

SPAN = <iamsystem.brat.formatter.SpanFormatter object>

A Brat annotation from first token start-offsets to last token end-offsets.

TOKEN = <iamsystem.brat.formatter.TokenFormatter object>

A fragment for each token.

ContSeqFormatter

class iamsystem.ContSeqFormatter[source]

Bases: IBratFormatter

Default Brat Formatter: annotate a document by selecting continuous sequences of tokens but ignore stopwords.

__init__()
get_text_and_offsets(annot: IAnnotation) Tuple[str, str][source]

Return tokens’ labels and token’s offsets (merge if continuous)

ContSeqStopFormatter

class iamsystem.ContSeqStopFormatter(remove_trailing_stop=True)[source]

Bases: IBratFormatter

A Brat formatter that takes into account stopwords: annotate a document by selecting continuous sequences of tokens/stopwords.

__init__(remove_trailing_stop=True)[source]

Create a brat formatter.

Parameters

remove_trailing_stop – if True, trailing stopwords in a discontinuous sequence will be removed. Ex: [[‘North’, ‘and’], [‘America’]] -> [[‘North’, [‘America’]]

get_text_and_offsets(annot: IAnnotation) Tuple[str, str][source]
Return text (document substring) and annotation’s offsets in the

Brat format.

Parameters

annot – an annotation.

Returns

A text span and its offsets: ‘The start-offset is the index of the first character of the annotated span in the text (“.txt” file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span.’

TokenFormatter

class iamsystem.TokenFormatter[source]

Bases: IBratFormatter

Annotate a document by creating (start,end) offsets for each token (In comparison to TokenFormatter, it doesn’t merge continuous sequence).

__init__()
get_text_and_offsets(annot: IAnnotation) Tuple[str, str][source]

Return tokens’ labels and token’s offsets (merge if continuous)

SpanFormatter

class iamsystem.SpanFormatter[source]

Bases: IBratFormatter

A simple Brat formatter that only uses start, end offsets of an annotation

__init__()
get_text_and_offsets(annot: IAnnotation) Tuple[str, str][source]

Return text, offsets by start and end offsets of the annotation.

BratDocument

class iamsystem.BratDocument(brat_formatter: IBratFormatter = None)[source]

Bases: object

Class representing a Brat Document containing Brat’s annotations, namely Brat Entity and Brat Note in this package. A BratDocument should be linked to a single text document. Entities and notes can be serialized in a text file with ‘ann’ extension, one per line. See https://brat.nlplab.org/standoff.html

__init__(brat_formatter: IBratFormatter = None)[source]

Create a Brat Document.

Parameters

brat_formatter – a strategy to create Brat annotations span, like merging continuous sequence of tokens. Default BratFormatter create a Brat span for each individual token.

add_annots(annots: List[IAnnotation], keyword_attr: str = None, brat_type: str = None) None[source]

Add iamsystem annotations to convert them to Brat format.

Parameters
  • annots – a list of Annotation, Matcher output.

  • keyword_attr – the attribute name of a IKeyword that stores brat_type. Default to None. If None, brat_type parameter must be used.

  • brat_type – A string, the Brat entity type for all these annotations. Default to None. If None, keyword_attr parameter must be used.

Returns

None

add_entity(brat_type: str, offsets: str, text: str) None[source]

Add a Brat Entity.

Parameters
  • brat_type – A Brat entity type (see Brat documentation).

  • offsets – a list of (start,end) annotation offsets. See IOffsets. A list is expected since the tokens can be discontinuous.

  • text – document substring using (start,end) offsets (not the document itself).

Returns

None

entities_to_string() str[source]

Brat entities in the Brat format ready to be serialized to ‘.ann’ text file.

get_entities() Iterable[BratEntity][source]

An iterable of Brat entities.

get_notes() Iterable[BratNote][source]

An iterable of Brat notes.

notes_to_string() str[source]

Brat notes in the Brat format ready to be serialized to ‘.ann’ text file.

BratEntity

class iamsystem.BratEntity(entity_id: str, brat_type: str, offsets: str, text: str)[source]

Bases: object

Class representing a Brat Entity. https://brat.nlplab.org/standoff.html: ‘Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization). and the span of characters containing the entity mention (represented as a “start end” offset pair).’

Format: ID TYPE START END[;START END]* TEXT.

__init__(entity_id: str, brat_type: str, offsets: str, text: str)[source]

Create a Brat Entity.

Parameters
  • entity_id – a unique ID (^T[0-9]+$).

  • brat_type – A Brat entity type (see Brat documentation).

  • offsets – (start,end) offsets.

  • text – document substring using (start,end) offsets.

BratNote

class iamsystem.BratNote(note_id: str, ref_id: str, note: str)[source]

Bases: object

Class representing a Brat Note. https://brat.nlplab.org/standoff.html Brat notes are used to store additionnal information on a detected entity. Format: #ID TYPE REFID NOTE

__init__(note_id: str, ref_id: str, note: str)[source]

Create a Brat Note.

Parameters
  • note_id – a unique ID (^#[0-9]+$)

  • ref_id – a unique ID. For a BratEntity, the format is (^T[0-9]+$)

  • note – any string comment.

TYPE = 'IAMSYSTEM'

BratNote type. Replace by ‘AnnotatorNotes’ to be human writable in Brat interface

BratWriter

class iamsystem.BratWriter[source]

Bases: object

Utility class to write IAMsystem annotations in Brat format to a text file.

__init__()
classmethod saveEntities(brat_entities: Iterable[BratEntity], write: Callable[[str], Any]) None[source]

Write Brat entities.

Parameters
  • brat_entities – an iterable of Brat entities.

  • write – a write function (ex: f.write from ‘with(open(filename, ‘w’)) as f:’)

Returns

None

classmethod saveNotes(brat_notes: Iterable[BratNote], write: Callable[[str], Any]) None[source]

Write Brat notes.

Parameters
  • brat_notes – an iterable of Brat notes.

  • write – a write function ex: f.write from ‘with(open(filename, ‘w’)) as f:

Returns

None

spaCy

IAMsystemSpacy

class iamsystem.spacy.IAMsystemSpacy(nlp: ~spacy.language.Language, name: str, keywords: ~typing.Iterable[~iamsystem.keywords.api.IKeyword], fuzzy_algos: ~typing.Iterable[~iamsystem.fuzzy.api.FuzzyAlgo], w: int = 1, remove_nested_annots: bool = True, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.spacy.token.TokenSpacyAdapter] = None, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>, attr: str = 'iamsystem')[source]

Bases: BaseCustomComp

A stateful component. ‘Component factories are callables that take settings and return a pipeline component function. This is useful if your component is stateful and if you need to customize their creation’. See: https://spacy.io/usage/processing-pipelines#custom-components

__init__(nlp: ~spacy.language.Language, name: str, keywords: ~typing.Iterable[~iamsystem.keywords.api.IKeyword], fuzzy_algos: ~typing.Iterable[~iamsystem.fuzzy.api.FuzzyAlgo], w: int = 1, remove_nested_annots: bool = True, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.spacy.token.TokenSpacyAdapter] = None, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>, attr: str = 'iamsystem')[source]

Create a custom spaCy component. Matcher uses spaCy tokenizer to tokenize the documents and the keywords.

Parameters
  • nlp – a spacy Language.

  • name – the name of this spaCy component.

  • keywords – a list of IKeywords to detect in a document.

  • fuzzy_algos – a list of FuzzyAlgo.

  • wMatcher’s window parameter.

  • remove_nested_annots – whether to remove nested annotations.

  • stopwordsIStopwords instance.

  • norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem.

  • attr – the attribute to store iamsystem’s annotation in a spaCy span instance.

property matcher: IMatcher[TokenSpacyAdapter]

A matcher that uses spaCy tokenizer.

IAMsystemBuildSpacy

class iamsystem.spacy.IAMsystemBuildSpacy(nlp: Language, name: str, build_params: Dict[Any, Any], serialized_kw: Dict[Any, Any] = None, attr: str = 'iamsystem', norm_fun: Callable[[str], str] = None)[source]

Bases: BaseCustomComp

A serializable custom component.

__init__(nlp: Language, name: str, build_params: Dict[Any, Any], serialized_kw: Dict[Any, Any] = None, attr: str = 'iamsystem', norm_fun: Callable[[str], str] = None)[source]

Create a custom spaCy component. Matcher uses spaCy tokenizer to tokenize the documents and the keywords.

Parameters
  • nlp – a spacy Language.

  • name – the name of this spaCy component.

  • attr – the attribute to store iamsystem’s annotation in a spaCy span instance.

  • serialized_kw

    a way to import serialized keywords. A dictionary containing 3 fields:

    • ’module’: module name of the class to import. ex: ‘iamsystem’.

    • ’class_name’: the Keyword class to import.

    • ’kw’: an iterable of dict created with the asdict() function.

    If None, keywords are expected in ‘build_params’.

  • norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem. Default to lower case and remove accents.

  • build_paramsbuild() parameters, the spacy tokenizer will be used whatever the tokenizer value.

property matcher: IMatcher[TokenSpacyAdapter]

A matcher that uses spaCy tokenizer.

TokenSpacyAdapter

class iamsystem.spacy.TokenSpacyAdapter(spacy_token: ~spacy.tokens.token.Token, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>)[source]

Bases: IToken

A custom Token that wraps spaCy’s Token and implements the iamsystem’s IToken interface.

__init__(spacy_token: ~spacy.tokens.token.Token, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>)[source]

Create a iamsystem’s token from a spaCy token.

Parameters
  • spacy_token – a spacy.tokens instance.

  • norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem.

IsStopSpacy

class iamsystem.spacy.IsStopSpacy(*args, **kwargs)[source]

Bases: IStopwords[TokenSpacyAdapter]

Stopwords that uses spaCy’s ‘is_stop’ function.

__init__(*args, **kwargs)
is_token_a_stopword(token: TokenSpacyAdapter) bool[source]

Return spaCy’s token attribute ‘is_stop’.

SpacyTokenizer

class iamsystem.spacy.SpacyTokenizer(nlp: Language, norm_fun: Callable[[str], str])[source]

Bases: ITokenizer[TokenSpacyAdapter]

A class that wraps spaCy’s tokenizer.

__init__(nlp: Language, norm_fun: Callable[[str], str])[source]

Create a tokenizer for iamsystem algorithm that uses spaCy’s tokenizer.

Parameters
  • nlp – a spacy Language.

  • norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem algorithm.

tokenize(text: str) Sequence[TokenSpacyAdapter][source]

Tokenize a text. This function is used only to tokenize the keywords by the matcher since this custom component receives from spaCy the document already tokenized.

Parameters

text – a string to tokenize with spaCy component.

Returns

an ordered sequence of tokens.