API Documentation
Documentation of classes and methods.
Matcher
- class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]
Bases:
IMatcher[TokenT]Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.
- __init__(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]
Create an IAMsystem matcher to annotate documents. Prefer
build()method to create a matcher.- Parameters
tokenizer – default
french_tokenizer(). AITokenizerinstance responsible for tokenizing and normalizing.stopwords – a
IStopwordsto ignore empty words in keywords and documents. If None, default toStopwords.
- add_fuzzy_algo(fuzzy_algo: FuzzyAlgo[TokenT]) None[source]
- Add a fuzzy algorithms to provide synonym(s) that helps matching
a token of a document and a token of a keyword.
- Parameters
fuzzy_algo – a
FuzzyAlgoinstance.- Returns
None.
- add_keyword(keyword: IKeyword) None[source]
Add a keyword to find in a document.
- Parameters
keyword –
IKeywordto search in a document.- Returns
None.
- add_keywords(keywords: Iterable[Union[str, IKeyword]]) None[source]
Utility function to add multiple keywords.
- Parameters
keywords – an iterable of string (labels) or
IKeywordto search in a document.- Returns
None.
- add_stopwords(words: Iterable[str]) None[source]
Add words (tokens) to be ignored in
IKeywordand in documents.- Parameters
words – a list of words to ignore.
- Returns
None.
- annot_text(text: str) List[IAnnotation[TokenT]][source]
Annotate a document.
- Parameters
text – the document to annotate.
- Returns
a list of
Annotation.
- annot_tokens(tokens: Sequence[TokenT]) List[IAnnotation[TokenT]][source]
Annotate a sequence of tokens.
- Parameters
tokens – an ordered or unordered sequence of tokens.
- Returns
a list of
Annotation.
- classmethod build(keywords: ~typing.Iterable[~typing.Union[str, ~iamsystem.keywords.api.IKeyword]], tokenizer: ~iamsystem.tokenization.api.ITokenizer = None, stopwords: ~typing.Union[~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT], ~typing.Iterable[str]] = <iamsystem.stopwords.simple.NoStopwords object>, w=1, order_tokens=False, negative=False, remove_nested_annots=True, strategy: ~typing.Union[str, ~iamsystem.matcher.strategy.EMatchingStrategy] = EMatchingStrategy.WINDOW, string_distance_ignored_w: ~typing.Optional[~typing.Iterable[str]] = None, abbreviations: ~typing.Optional[~typing.Iterable[~typing.Tuple[str, str]]] = None, spellwise: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, simstring: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, normalizers: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, fuzzy_regex: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None) Matcher[TokenT][source]
Create an IAMsystem matcher to annotate documents.
- Parameters
keywords – an iterable of keywords string or
IKeywordinstances.tokenizer – default
french_tokenizer(). AITokenizerinstance responsible for tokenizing and normalizing.stopwords – provide a
IStopwords. If None, default toNoStopwords.w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.
order_tokens – order tokens alphabetically if order doesn’t matter in the matching strategy.
negative – every unigram not in the keywords is a stopword. Default to False. If stopwords are also passed, they will be removed from keywords’ tokens and so still be stopwords.
remove_nested_annots – if two annotations overlap, remove the shorter one. Default to True.
strategy – an IAMsystem matching strategy responsible for searching keywords in document. Default to
WindowMatching.string_distance_ignored_w – words ignored by string distance algorithms to avoid false positives matched.
abbreviations – an iterable of tuples (short_form, long_form).
spellwise – an iterable of
SpellWiseWrapperinit parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SpellWiseWrapper init function.simstring – an iterable of
SimStringWrapperinit parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SimStringWrapper init function.normalizers – an iterable of
WordNormalizerinit parameters.fuzzy_regex – an iterable of
FuzzyRegexinit parameters.
- property fuzzy_algos: Iterable[FuzzyAlgo[TokenT]]
The fuzzy algorithms used by the algorithm.
- Returns
FuzzyAlgoinstances responsible for finding possible synonyms for each token of a document.
- get_initial_state() INode[source]
Return the initial state from which iamsystem algorithm will start searching for a sequence of keywords’tokens.
- get_keywords_unigrams() Set[str][source]
Get all the unigrams (single words excluding stopwords) in the keywords.
- get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], List[str]]][source]
Get synonyms of a token with configured fuzzy algorithms.
- Parameters
tokens – document’s tokens.
token – the token for which synonyms are expected.
transitions – algorithm’s states.
- Returns
tuples of synonyms and fuzzy algorithm’s names.
- is_token_a_stopword(token: TokenT) bool[source]
Check if a token is a stopword.
- Parameters
token – a generic token that implements
IToken.- Returns
True if the token is a stopword.
- property remove_nested_annots: bool
Whether to remove nested annotations. Default to True.
- property stopwords: IStopwords[TokenT]
Return the
IStopwordsused by the matcher.
- property strategy: IMatchingStrategy[TokenT]
Return the matching strategy.
- tokenize(text: str) Sequence[TokenT][source]
Tokenize a text with the tokenizer’s instance.
- Parameters
text – a document or a keyword.
- Returns
A sequence of tokens, the type depends on the tokenizer but must implement
ITokenprotocol.
- property tokenizer: ITokenizer[TokenT]
Return the
ITokenizerused by the matcher.
- property w: int
Return the window parameter of this matcher.
Matcher build
- class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]
Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.
- classmethod build(keywords: ~typing.Iterable[~typing.Union[str, ~iamsystem.keywords.api.IKeyword]], tokenizer: ~iamsystem.tokenization.api.ITokenizer = None, stopwords: ~typing.Union[~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT], ~typing.Iterable[str]] = <iamsystem.stopwords.simple.NoStopwords object>, w=1, order_tokens=False, negative=False, remove_nested_annots=True, strategy: ~typing.Union[str, ~iamsystem.matcher.strategy.EMatchingStrategy] = EMatchingStrategy.WINDOW, string_distance_ignored_w: ~typing.Optional[~typing.Iterable[str]] = None, abbreviations: ~typing.Optional[~typing.Iterable[~typing.Tuple[str, str]]] = None, spellwise: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, simstring: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, normalizers: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None, fuzzy_regex: ~typing.Optional[~typing.List[~typing.Dict[~typing.Any, ~typing.Any]]] = None) Matcher[TokenT][source]
Create an IAMsystem matcher to annotate documents.
- Parameters
keywords – an iterable of keywords string or
IKeywordinstances.tokenizer – default
french_tokenizer(). AITokenizerinstance responsible for tokenizing and normalizing.stopwords – provide a
IStopwords. If None, default toNoStopwords.w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.
order_tokens – order tokens alphabetically if order doesn’t matter in the matching strategy.
negative – every unigram not in the keywords is a stopword. Default to False. If stopwords are also passed, they will be removed from keywords’ tokens and so still be stopwords.
remove_nested_annots – if two annotations overlap, remove the shorter one. Default to True.
strategy – an IAMsystem matching strategy responsible for searching keywords in document. Default to
WindowMatching.string_distance_ignored_w – words ignored by string distance algorithms to avoid false positives matched.
abbreviations – an iterable of tuples (short_form, long_form).
spellwise – an iterable of
SpellWiseWrapperinit parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SpellWiseWrapper init function.simstring – an iterable of
SimStringWrapperinit parameters. if ‘string_distance_ignored_w’ is set, these words are passed to SimStringWrapper init function.normalizers – an iterable of
WordNormalizerinit parameters.fuzzy_regex – an iterable of
FuzzyRegexinit parameters.
EMatchingStrategy
- class iamsystem.EMatchingStrategy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Enumeration of matching strategies.
- LARGE_WINDOW = <iamsystem.matcher.strategy.LargeWindowMatching object>
Same annotations as Window but faster than window is large.
- NO_OVERLAP = <iamsystem.matcher.strategy.NoOverlapMatching object>
No overlap/nested annotations, fastest strategies.
- WINDOW = <iamsystem.matcher.strategy.WindowMatching object>
Default matching strategy.
Span
- class iamsystem.matcher.annotation.Span(tokens: List[TokenT])[source]
Bases:
ISpan[TokenT],IOffsetsA class that represents a sequence of tokens in a document.
- end: int
end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]
- property end_i
The index of the last token within the parent document.
- start: int
The start offset of the first token.
- property start_i
The index of the first token within the parent document.
- property tokens: List[TokenT]
The tokens of the document that matched the keywords attribute of this instance.
- Returns
an ordered sequence of TokenT, a generic type that implements
IToken.
- property tokens_label
The concatenation of each token’s label.
- property tokens_norm_label
The concatenation of each token’s norm_label.
Annotation
- class iamsystem.Annotation(tokens: List[TokenT], algos: List[List[str]], node: INode, stop_tokens: List[TokenT], text: Optional[str] = None)[source]
Bases:
Span[TokenT],IAnnotation[TokenT]Ouput class of
Matcherstoring information on the detected entities.- property algos: List[List[str]]
For each token, the list of algorithms that matched. One to several algorithms per token.
- annot_to_str(annot: IAnnotation)
A class function that generates a string representation of an annotation.
- end: int
end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]
- property end_i
The index of the last token within the parent document.
- get_text_substring(text: str) str
Return text substring.
- get_tokens_algos() Iterable[Tuple[TokenT, List[str]]][source]
Get each token and the list of fuzzy algorithms that matched it.
- Returns
an iterable of tuples (token0, [‘algo1’,…]) where token0 is a token and [‘algo1’,…] a list of fuzzy algorithms.
- property keywords: Sequence[IKeyword]
The linked entities,
IKeywordinstances that matched a document’s tokens.
- property label
@Deprecated. An annotation label. Return ‘tokens_label’ attribute
- classmethod set_brat_formatter(brat_formatter: Union[EBratFormatters, IBratFormatter])[source]
Change Brat Formatter to change text-span and offsets.
- Parameters
brat_formatter – A Brat formatter to produce a different Brat annotation. If None, default to
ContSeqFormatter.- Returns
None
- start: int
The start offset of the first token.
- property start_i
The index of the first token within the parent document.
- property stop_tokens: List[TokenT]
The list of stopwords tokens inside the annotation detected by the Matcher stopwords instance.
- property text: Optional[str]
Return the annotated text.
- to_dict(text: str = None) Dict[str, Any][source]
Return a dictionary representation of this object.
- Parameters
text – the document from which this annotation comes from. Default to None.
- Returns
A dictionary of relevant attributes.
- to_string(text=False, debug=False) str[source]
Get a default string representation of this object.
- Parameters
text – the document from which this annotation comes from. Default to None. If set, add the document substring: text[first-token-start-offset : last-token-end-offset].
debug – default to False. If True, add the sequence of tokens and fuzzyalgo names.
- Returns
a concatenated string
- property tokens: List[TokenT]
The tokens of the document that matched the keywords attribute of this instance.
- Returns
an ordered sequence of TokenT, a generic type that implements
IToken.
- property tokens_label
The concatenation of each token’s label.
- property tokens_norm_label
The concatenation of each token’s norm_label.
rm_nested_annots
- iamsystem.rm_nested_annots(annots: List[Annotation], keep_ancestors=False)[source]
In case of two nested annotations, remove the shorter one. For example, if we have “prostate” and “prostate cancer” annnotations, “prostate” annotation is removed.
- Parameters
annots – a list of annotations.
keep_ancestors – Default to False. Whether to keep the nested annotations that are ancestors and remove only other cases.
- Returns
a filtered list of annotations.
replace_annots
- iamsystem.replace_annots(text: str, annots: Sequence[Annotation], new_labels: Sequence[str])[source]
Replace each annotation in a document (text parameter) by a new label. Warning: an annotation is ignored if overlapped by another one.
- Parameters
text – the document from which the annotations come from.
annots – an ordered sequence of annotation.
new_labels – one new label per annotation, same length as annots expected.
- Returns
a new document.
Keyword and subclasses
IKeyword
IEntity
Keyword
Entity
Terminology
- class iamsystem.Terminology[source]
Bases:
IStoreKeywordsAn utility class to store a set of keywords.
- add_keyword(keyword: IKeyword) None[source]
Add a keyword.
- Parameters
keyword – a
IKeywordor a subclass.- Returns
None
- add_keywords(keywords: Iterable[IKeyword]) None[source]
Add multiple keywords.
- Parameters
keywords – a
IKeywordor a subclass.- Returns
None
- get_unigrams(tokenizer: ITokenizer, stopwords: IStopwords) Set[str][source]
Get all the unigrams (single words excluding stopwords) in the keywords.
- property size: int
Get the number of keywords.
Tokenization
IOffsets
- class iamsystem.IOffsets(*args, **kwargs)[source]
Bases:
ProtocolOffsets interface. Default implementation
Offsets.- end: int
end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]
- start: int
start-offset is the index of the first character.
Offsets
IToken
- class iamsystem.IToken(*args, **kwargs)[source]
Bases:
IOffsets,ProtocolToken interface. Default implementation
Token- i: int
The index of the token within the parent document.
- label: str
the label as it is in the document/keyword.
- norm_label: str
the normalized label used by iamsystem’s algorithm to perform entity linking.
Token
- class iamsystem.Token(start: int, end: int, label: str, norm_label: str, i: int)[source]
-
Store the label, normalized label, start and end offsets of a token.
- __init__(start: int, end: int, label: str, norm_label: str, i: int)[source]
Create a token.
- Parameters
start – start-offset is the index of the first character.
end – end-offset is the index of the last character + 1, that is to say the first character to exclude from the returned substring when slicing with [start:end]
label – the label as it is in the document/keyword.
norm_label – the normalized label (used by iamsystem’s algorithm to perform entity linking).
i – the index of the token within the parent document.
ITokenizer
- class iamsystem.ITokenizer(*args, **kwargs)[source]
Bases:
Protocol[TokenT]Tokenizer Interface. Default implementation
TokenizerImp.
TokenizerImp
- class iamsystem.TokenizerImp(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]
Bases:
ITokenizer[Token]A
ITokenizerimplementation. Class responsible for the tokenization, normalization of tokens. See alsofrench_tokenizer(),english_tokenizer().- __init__(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]
Create a custom tokenizer that splits and normalizes a string.
- Parameters
split – a function that split a text into (start,end) tuples. This function must return an iterable of
IOffsets. See alsosplit_find_iter_closure().normalize – a function that normalizes a string. This function must return a string.
english_tokenizer
- iamsystem.english_tokenizer() TokenizerImp[source]
- An opinionated English tokenizer.
- It splits the text by ‘word’ character.It normalizes by lowercasing.
- Returns
a
TokenizerImpimplementation.
french_tokenizer
- iamsystem.french_tokenizer() TokenizerImp[source]
- An opinionated French tokenizer.
- It splits the text by ‘word’ character.It normalizes by lowercasing and unicode normalization form.
- Returns
a
TokenizerImpimplementation.
Build a custom split function
Order tokens
- iamsystem.tokenize_and_order_decorator(tokenize: Callable[[str], Sequence[TokenT]]) Callable[[str], Sequence[TokenT]][source]
Decorate a tokenize function: the tokens are sorted alphabetically by their label.
- Parameters
tokenize – a tokenize function to decorate.
- Returns
the decorated tokenize function.
Stopwords classes
IStopwords
Stopwords
- class iamsystem.Stopwords(stopwords: Optional[Iterable[str]] = None)[source]
Bases:
SimpleStopwords[TokenT]A simple implementation of
IStopwordsprotocol.- __init__(stopwords: Optional[Iterable[str]] = None)[source]
Create a Stopword instance to store stopwords.
- Parameters
stopwords – a set of stopwords. Default to None.
- add(words: Iterable[str]) None[source]
Add stopwords.
- Parameters
words – a list of string.
- Returns
None
- is_stopword(word: str) bool[source]
True if, after lowercasing, the word belongs to the stopwords set
- property stopwords
Get the set of stopwords.
NegativeStopwords
- class iamsystem.NegativeStopwords(words_to_keep: Optional[Iterable[str]] = None)[source]
Bases:
IStopwords[TokenT]Like a negative image (a total inversion, in which light areas appear dark and vice versa), every token is a stopword until proven otherwise.
- __init__(words_to_keep: Optional[Iterable[str]] = None)[source]
Create a NegativeStopwords instance to store words to keep and/or define functions that check if a word should be kept.
- Parameters
words_to_keep – a set of words not to ignore.
- add_fun_is_a_word_to_keep(fun: Callable[[TokenT], bool]) None[source]
Add a function that checks if a word should be kept.
- Parameters
fun – a Callable that takes a token as a parameter and returns a boolean.
- Returns
None.
- add_words(words_to_keep: Iterable[str]) None[source]
Add words not to be ignored.
- Parameters
words_to_keep – a list of string.
- Returns
None
- is_token_a_stopword(token: TokenT) bool[source]
Check if it’s not token to keep.
- Parameters
token – a token.
- Returns
False if the token’s lowercase belongs to the set of word to keep or if a function
add_fun_is_a_word_to_keep()returns True.
Fuzzy algorithms
Abstract Base classes
FuzzyAlgo
- class iamsystem.FuzzyAlgo(name: str)[source]
Bases:
Generic[TokenT],ABCFuzzy Algorithm base class.
- NO_SYN: Iterable[Tuple[str, ...]] = []
Default value to return by a fuzzy algorithm if no synonym found.
- abstract get_synonyms(tokens: Sequence[TokenT], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], str]][source]
Main API function to retrieve all synonyms provided by a fuzzy algorithm.
- Parameters
tokens – the sequence of tokens of the document. Useful when the fuzzy algorithm needs context, namely the tokens around the token of interest.
token – the token of this sequence for which synonyms are expected.
transitions –
the state transitions in which the algorithm currently is. Useful is the fuzzy algorithm needs to know the next
or possible transitions.
- Returns
0 to many synonyms (SynAlgo type).
- static word_to_syn(word: str) Tuple[str, ...][source]
Utility function to transform a string to expected SynType.
- Parameters
word – a word synonym produced by the algorithm. Ex: word=’insuffisance’ for token ‘ins’.
- Returns
SynType, the expected output format.
- static words_seq_to_syn(words: Sequence[str]) Tuple[str, ...][source]
Utility function to transform a sequence of string to the expected output type.
- Parameters
words – a sequence of words produced by the algorithm. Ex: words=[‘insuffisance’, ‘cardiaque’] for the token ‘ic’.
- Returns
SynType, the expected output format.
ContextFreeAlgo
- class iamsystem.ContextFreeAlgo(name: str)[source]
Bases:
FuzzyAlgo[TokenT],ABCA
FuzzyAlgothat doesn’t take into account context, only the current token.
NormLabelAlgo
- class iamsystem.NormLabelAlgo(name: str)[source]
Bases:
ContextFreeAlgo[TokenT],INormLabelAlgo,ABCA
FuzzyAlgothat uses only the normalized label of a token. These fuzzy algorithms can be put in cache to avoid calling them multiple times. SeeCacheFuzzyAlgos.
CacheFuzzyAlgos
- class iamsystem.CacheFuzzyAlgos(name: str = 'Cache')[source]
Bases:
FuzzyAlgo,Generic[TokenT]A
FuzzyAlgothat provides a cache forNormLabelAlgoalgorithms. Since these algorithms don’t depend on context, their output can be cached to avoid calling them multiple times.- __init__(name: str = 'Cache')[source]
Create a fuzzy algorithm to allow a partial match between a text token and a keyword token.
- Parameters
name – algorithm’s name.
- add_algo(algo: INormLabelAlgo) None[source]
Add
NormLabelAlgo.
- get_synonyms(tokens: Sequence[IToken], token: TokenT, transitions: Iterable[StateTransition]) List[Tuple[Tuple[str, ...], str]][source]
Overrides. Implements superclass abstract method.
- get_syns_of_word(word: str) List[Tuple[Tuple[str, ...], str]][source]
Retrieve all synonyms of fuzzy algorithms from cache or by calling them once.
- property max_nb_of_words
The maximum number of words to put in cache. Default 100.000 words
Abbreviations
- class iamsystem.Abbreviations(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]
Bases:
ContextFreeAlgo[TokenT],INormLabelAlgoA
FuzzyAlgoto handle abbreviations. This class doesn’t take into account the context of a document to return a long form.- __init__(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]
Create an instance to store abbreviations.
- Parameters
name – a name given to this algorithm. (ex: ‘medical abbs’)
token_is_an_abbreviation – a function that verify if a token is an abbreviation (ex: checks all letters are uppercase). The function is called before the dictionary look-up is performed to retrieve long forms. Default: no checks performed, the function returns always true.
- add(short_form: str, long_form: str, tokenizer: ITokenizer) None[source]
Add an abbreviation.
- Parameters
short_form – an abbreviation short form (ex: CHF).
long_form – an abbreviation long form. (ex: congestive heart failure).
tokenizer – a
ITokenizerto tokenize the long form. It is recommanded to use yourMatchertokenizer.
- Returns
None.
- add_tokenized_long_form(short_form, long_form: Sequence[str]) None[source]
Add an abbreviation already tokenized.
FuzzyRegex
- class iamsystem.FuzzyRegex(name: str, pattern: str, pattern_name: str)[source]
Bases:
ContextFreeAlgo,INormLabelAlgoA
FuzzyAlgoto handle regular expressions. Useful when one or multiple tokens of a keyword need to be matched to a regular expression.- __init__(name: str, pattern: str, pattern_name: str)[source]
Create a FuzzyRegex instance.
- Parameters
name – a name given to this algorithm.
pattern – a regular expression.
pattern_name – a name given to this pattern (ex: ‘numval’) that is also a token of a
IKeyword.
- get_syns_of_token(token: TokenT) Iterable[Tuple[str, ...]][source]
Return the pattern_name if this token matches the regular expression.
- get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]
Return the pattern_name if this word matches it.
- replace_pattern_in_keyword(keyword: IKeyword, tokenizer: ITokenizer) IKeyword[source]
Utility function to replace keyword’s tokens that match the pattern by the pattern name.
WordNormalizer
- class iamsystem.WordNormalizer(name: str, norm_fun: Callable[[str], str])[source]
Bases:
NormLabelAlgoA
FuzzyAlgoto handle normalization techniques such as stemming and lemmatization.- __init__(name: str, norm_fun: Callable[[str], str])[source]
Create an instance that will store the normalized tokens of a set of
IKeyword.- Parameters
name – a name given to this algorithm (ex: ‘english stemmer’).
norm_fun – a normalizing function, for example a stemming function or lemmatization function.
- add_words(words: Iterable[str]) None[source]
A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call
get_keywords_unigrams()of the matcher.- Parameters
words – A list of words to normalize and store.
- Returns
None.
- get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]
Return all the words that have the same normalized form of this word
For example, if the normalize function is an english stemmer, and you provided add_words=[“eating”], this instance stored the stem “eat” associated to the word “eating”. Then, if a document contains the token “eats”, since the stem is the same, this function returns the synonym “eating”.
- Parameters
word – a string, i.e. a word from a document.
- Returns
word synonyms and algorithm name.
SpellWise
SpellWiseWrapper
- class iamsystem.SpellWiseWrapper(measure: Union[str, ESpellWiseAlgo], max_distance: int, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None, name: str = None)[source]
Bases:
StringDistanceA
FuzzyAlgothat wraps an algorithm from the spellwise library.- __init__(measure: Union[str, ESpellWiseAlgo], max_distance: int, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None, name: str = None)[source]
Create an instance to take advantage of a spellwise algorithm.
- Parameters
measure – The measure string or a value selected from
SpellWiseAlgoenumerated list.max_distance – maximum edit distance (see spellwise documentation).
min_nb_char – the minimum number of characters a word must have in order not to be ignored.
words2ignore – words that must be ignored by the algorithm to avoid false positives, for example English vocabulary words.
name – a name given to this algorithm. Default: spellwise algorithm’s name.
- add_words(words: Iterable[str], warn=False) None[source]
A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call
get_keywords_unigrams()method after you added your keywords to the matcher instance.- Parameters
words – A list of possible synonyms.
warn – raise a warning if a word added is ignored. Default False.
- Returns
None.
- get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]
Compute string distance if it is not a word to be ignored and return keywords’ unigrams in the maximum distance from that word.
- property max_distance
Maximum edit distance (see spellwise documentation).
ESpellWiseAlgo
- class iamsystem.ESpellWiseAlgo(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumEnumerated list of spellwise library algorithms. See spellwise documentation for more information.
- CAVERPHONE_1 = <class 'spellwise.algorithms.caverphone_one.CaverphoneOne'>
- CAVERPHONE_2 = <class 'spellwise.algorithms.caverphone_two.CaverphoneTwo'>
- EDITEX = <class 'spellwise.algorithms.editex.Editex'>
- LEVENSHTEIN = <class 'spellwise.algorithms.levenshtein.Levenshtein'>
- SOUNDEX = <class 'spellwise.algorithms.soundex.Soundex'>
- TYPOX = <class 'spellwise.algorithms.typox.Typox'>
SimString
SimStringWrapper
- class iamsystem.SimStringWrapper(words: Iterable[str], measure: Union[str, ESimStringMeasure] = ESimStringMeasure.JACCARD, name: str = None, threshold=0.5, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None)[source]
Bases:
StringDistanceSimString algorithm interface.
- __init__(words: Iterable[str], measure: Union[str, ESimStringMeasure] = ESimStringMeasure.JACCARD, name: str = None, threshold=0.5, min_nb_char=5, words2ignore: Optional[IWords2ignore] = None)[source]
Create a fuzzy algorithm that calls simstring.
- Parameters
words – the words to index in the simstring database. An easy way to provide these words is to call
get_keywords_unigrams().name – a name given to this algorithm. Default measure name.
measure – a similarity measure string or selected from
ESimStringMeasure. Default JACCARD.threshold – similarity measure threshold.
min_nb_char – the minimum number of characters a word must have in order not to be ignored.
words2ignore – words that must be ignored by the algorithm to avoid false positives, for example English vocabulary words.
Brat
Formatter
EBratFormatters
- class iamsystem.EBratFormatters(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumAn enumerated list of available Brat Formatters.
- CONTINUOUS_SEQ = <iamsystem.brat.formatter.ContSeqFormatter object>
Merge a continuous sequence of tokens but ignore stopwords.
- CONTINUOUS_SEQ_STOP = <iamsystem.brat.formatter.ContSeqStopFormatter object>
Merge a continuous sequence of tokens with stopwords.
- DEFAULT = <iamsystem.brat.formatter.ContSeqFormatter object>
Default to CONTINUOUS_SEQ.
- SPAN = <iamsystem.brat.formatter.SpanFormatter object>
A Brat annotation from first token start-offsets to last token end-offsets.
- TOKEN = <iamsystem.brat.formatter.TokenFormatter object>
A fragment for each token.
ContSeqFormatter
ContSeqStopFormatter
- class iamsystem.ContSeqStopFormatter(remove_trailing_stop=True)[source]
Bases:
IBratFormatterA Brat formatter that takes into account stopwords: annotate a document by selecting continuous sequences of tokens/stopwords.
- __init__(remove_trailing_stop=True)[source]
Create a brat formatter.
- Parameters
remove_trailing_stop – if True, trailing stopwords in a discontinuous sequence will be removed. Ex: [[‘North’, ‘and’], [‘America’]] -> [[‘North’, [‘America’]]
- get_text_and_offsets(annot: IAnnotation) Tuple[str, str][source]
- Return text (document substring) and annotation’s offsets in the
Brat format.
- Parameters
annot – an annotation.
- Returns
A text span and its offsets: ‘The start-offset is the index of the first character of the annotated span in the text (“.txt” file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span.’
BratDocument
- class iamsystem.BratDocument(brat_formatter: IBratFormatter = None)[source]
Bases:
objectClass representing a Brat Document containing Brat’s annotations, namely Brat Entity and Brat Note in this package. A BratDocument should be linked to a single text document. Entities and notes can be serialized in a text file with ‘ann’ extension, one per line. See https://brat.nlplab.org/standoff.html
- __init__(brat_formatter: IBratFormatter = None)[source]
Create a Brat Document.
- Parameters
brat_formatter – a strategy to create Brat annotations span, like merging continuous sequence of tokens. Default BratFormatter create a Brat span for each individual token.
- add_annots(annots: List[IAnnotation], keyword_attr: str = None, brat_type: str = None) None[source]
Add iamsystem annotations to convert them to Brat format.
- Parameters
annots – a list of
Annotation,Matcheroutput.keyword_attr – the attribute name of a
IKeywordthat stores brat_type. Default to None. If None, brat_type parameter must be used.brat_type – A string, the Brat entity type for all these annotations. Default to None. If None, keyword_attr parameter must be used.
- Returns
None
- add_entity(brat_type: str, offsets: str, text: str) None[source]
Add a Brat Entity.
- Parameters
brat_type – A Brat entity type (see Brat documentation).
offsets – a list of (start,end) annotation offsets. See
IOffsets. A list is expected since the tokens can be discontinuous.text – document substring using (start,end) offsets (not the document itself).
- Returns
None
- entities_to_string() str[source]
Brat entities in the Brat format ready to be serialized to ‘.ann’ text file.
- get_entities() Iterable[BratEntity][source]
An iterable of Brat entities.
BratEntity
- class iamsystem.BratEntity(entity_id: str, brat_type: str, offsets: str, text: str)[source]
Bases:
objectClass representing a Brat Entity. https://brat.nlplab.org/standoff.html: ‘Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization). and the span of characters containing the entity mention (represented as a “start end” offset pair).’
Format: ID TYPE START END[;START END]* TEXT.
BratNote
- class iamsystem.BratNote(note_id: str, ref_id: str, note: str)[source]
Bases:
objectClass representing a Brat Note. https://brat.nlplab.org/standoff.html Brat notes are used to store additionnal information on a detected entity. Format: #ID TYPE REFID NOTE
- __init__(note_id: str, ref_id: str, note: str)[source]
Create a Brat Note.
- Parameters
note_id – a unique ID (^#[0-9]+$)
ref_id – a unique ID. For a BratEntity, the format is (^T[0-9]+$)
note – any string comment.
- TYPE = 'IAMSYSTEM'
BratNote type. Replace by ‘AnnotatorNotes’ to be human writable in Brat interface
BratWriter
- class iamsystem.BratWriter[source]
Bases:
objectUtility class to write IAMsystem annotations in Brat format to a text file.
- __init__()
- classmethod saveEntities(brat_entities: Iterable[BratEntity], write: Callable[[str], Any]) None[source]
Write Brat entities.
- Parameters
brat_entities – an iterable of Brat entities.
write – a write function (ex: f.write from ‘with(open(filename, ‘w’)) as f:’)
- Returns
None
spaCy
IAMsystemSpacy
- class iamsystem.spacy.IAMsystemSpacy(nlp: ~spacy.language.Language, name: str, keywords: ~typing.Iterable[~iamsystem.keywords.api.IKeyword], fuzzy_algos: ~typing.Iterable[~iamsystem.fuzzy.api.FuzzyAlgo], w: int = 1, remove_nested_annots: bool = True, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.spacy.token.TokenSpacyAdapter] = None, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>, attr: str = 'iamsystem')[source]
Bases:
BaseCustomCompA stateful component. ‘Component factories are callables that take settings and return a pipeline component function. This is useful if your component is stateful and if you need to customize their creation’. See: https://spacy.io/usage/processing-pipelines#custom-components
- __init__(nlp: ~spacy.language.Language, name: str, keywords: ~typing.Iterable[~iamsystem.keywords.api.IKeyword], fuzzy_algos: ~typing.Iterable[~iamsystem.fuzzy.api.FuzzyAlgo], w: int = 1, remove_nested_annots: bool = True, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.spacy.token.TokenSpacyAdapter] = None, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>, attr: str = 'iamsystem')[source]
Create a custom spaCy component.
Matcheruses spaCy tokenizer to tokenize the documents and the keywords.- Parameters
nlp – a spacy Language.
name – the name of this spaCy component.
keywords – a list of
IKeywordsto detect in a document.fuzzy_algos – a list of
FuzzyAlgo.w –
Matcher’s window parameter.remove_nested_annots – whether to remove nested annotations.
stopwords –
IStopwordsinstance.norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem.
attr – the attribute to store iamsystem’s annotation in a spaCy span instance.
- property matcher: IMatcher[TokenSpacyAdapter]
A matcher that uses spaCy tokenizer.
IAMsystemBuildSpacy
- class iamsystem.spacy.IAMsystemBuildSpacy(nlp: Language, name: str, build_params: Dict[Any, Any], serialized_kw: Dict[Any, Any] = None, attr: str = 'iamsystem', norm_fun: Callable[[str], str] = None)[source]
Bases:
BaseCustomCompA serializable custom component.
- __init__(nlp: Language, name: str, build_params: Dict[Any, Any], serialized_kw: Dict[Any, Any] = None, attr: str = 'iamsystem', norm_fun: Callable[[str], str] = None)[source]
Create a custom spaCy component.
Matcheruses spaCy tokenizer to tokenize the documents and the keywords.- Parameters
nlp – a spacy Language.
name – the name of this spaCy component.
attr – the attribute to store iamsystem’s annotation in a spaCy span instance.
serialized_kw –
a way to import serialized keywords. A dictionary containing 3 fields:
’module’: module name of the class to import. ex: ‘iamsystem’.
’class_name’: the Keyword class to import.
’kw’: an iterable of dict created with the asdict() function.
If None, keywords are expected in ‘build_params’.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem. Default to lower case and remove accents.
build_params –
build()parameters, the spacy tokenizer will be used whatever the tokenizer value.
- property matcher: IMatcher[TokenSpacyAdapter]
A matcher that uses spaCy tokenizer.
TokenSpacyAdapter
- class iamsystem.spacy.TokenSpacyAdapter(spacy_token: ~spacy.tokens.token.Token, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>)[source]
Bases:
ITokenA custom Token that wraps spaCy’s Token and implements the iamsystem’s IToken interface.
- __init__(spacy_token: ~spacy.tokens.token.Token, norm_fun: ~typing.Callable[[str], str] = <function lower_no_accents>)[source]
Create a iamsystem’s token from a spaCy token.
- Parameters
spacy_token – a spacy.tokens instance.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem.
IsStopSpacy
- class iamsystem.spacy.IsStopSpacy(*args, **kwargs)[source]
Bases:
IStopwords[TokenSpacyAdapter]Stopwords that uses spaCy’s ‘is_stop’ function.
- __init__(*args, **kwargs)
- is_token_a_stopword(token: TokenSpacyAdapter) bool[source]
Return spaCy’s token attribute ‘is_stop’.
SpacyTokenizer
- class iamsystem.spacy.SpacyTokenizer(nlp: Language, norm_fun: Callable[[str], str])[source]
Bases:
ITokenizer[TokenSpacyAdapter]A class that wraps spaCy’s tokenizer.
- __init__(nlp: Language, norm_fun: Callable[[str], str])[source]
Create a tokenizer for iamsystem algorithm that uses spaCy’s tokenizer.
- Parameters
nlp – a spacy Language.
norm_fun – a function that normalizes the ‘norm_’ attribute of a spaCy token, attribute used by iamsystem algorithm.
- tokenize(text: str) Sequence[TokenSpacyAdapter][source]
Tokenize a text. This function is used only to tokenize the keywords by the matcher since this custom component receives from spaCy the document already tokenized.
- Parameters
text – a string to tokenize with spaCy component.
- Returns
an ordered sequence of tokens.