API Documentation
Documentation of classes and methods.
Matcher
- class iamsystem.Matcher(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]
Bases:
IMatcher[TokenT]Main public API to perform semantic annotation (aka entity linking) with iamsystem algorithm.
- __init__(tokenizer: ~iamsystem.tokenization.api.ITokenizer = <iamsystem.tokenization.tokenize.TokenizerImp object>, stopwords: ~iamsystem.stopwords.api.IStopwords[~iamsystem.tokenization.api.TokenT] = None)[source]
Create an IAMsystem matcher to annotate documents.
- Parameters
tokenizer – default
french_tokenizer(). AITokenizerinstance responsible for tokenizing and normalizing.stopwords – provide a
IStopwords. If None, default toStopwords.
- add_fuzzy_algo(fuzzy_algo: FuzzyAlgo[TokenT]) None[source]
- Add a fuzzy algorithms to provide synonym(s) that helps matching
a token of a document and a token of a keyword.
- Parameters
fuzzy_algo – a
FuzzyAlgoinstance.- Returns
None.
- add_keyword(keyword: IKeyword) None[source]
Add a keyword to find in a document.
- Parameters
keyword –
IKeywordto search in a document.- Returns
None.
- add_keywords(keywords: Iterable[IKeyword]) None[source]
Utility function to add multiple keywords.
- Parameters
keywords –
IKeywordto search in a document.- Returns
None.
- add_labels(labels: Iterable[str]) None[source]
Utility function to call ‘add_keywords’ by providing a list of labels,
IKeywordinstances are created and added.- Parameters
labels – the labels (keywords) to be searched in the document.
- Returns
None.
- add_stopwords(words: Iterable[str]) None[source]
Add words (tokens) to be ignored in
IKeywordand in documents.- Parameters
words – a list of words to ignore.
- Returns
None.
- annot_text(text: str, w: int = 1) List[Annotation[TokenT]][source]
Annotate a document.
- Parameters
text – the document to annotate.
w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.
- Returns
a list of
Annotation.
- annot_tokens(tokens: Sequence[TokenT], w: int) List[Annotation[TokenT]][source]
Annotate a sequence of tokens.
- Parameters
tokens – an ordered or unordered sequence of tokens.
w – Window. How much discontinuous keyword’s tokens to find can be. By default, w=1 means the sequence must be continuous. w=2 means each token can be separated by another token.
remove_nested_annots – if two annotations overlap, remove the shorter one.
- Returns
a list of
Annotation.
- property fuzzy_algos: Iterable[FuzzyAlgo[TokenT]]
The fuzzy algorithms used by the algorithm.
- Returns
FuzzyAlgoinstances responsible for finding possible synonyms for each token of a document.
- get_keywords_unigrams() Set[str][source]
Get all the unigrams (single words excluding stopwords) in the keywords.
- get_synonyms(tokens: Sequence[TokenT], i: int, w_states: List[List[IState]]) Iterable[Tuple[Tuple[str, ...], List[str]]][source]
Get synonyms of a token with configured fuzzy algorithms.
- Parameters
tokens – document’s tokens.
i – the ith token for which synonyms are expected.
w_states – algorithm’s states.
- Returns
tuples of synonyms and fuzzy algorithm’s names.
- is_token_a_stopword(token: TokenT) bool[source]
Check if a token is a stopword.
- Parameters
token – a generic token that implements
IToken.- Returns
True if the token is a stopword.
- property remove_nested_annots: bool
whether to remove nested annotations. Default to True.
- Type
Matcher config
Annotation
- class iamsystem.Annotation(tokens_states: Sequence[TransitionState[TokenT]])[source]
Bases:
Span[TokenT]Ouput class of
Matcherstoring information about linked entities.- end: int
- get_tokens_algos() Iterable[Tuple[TokenT, List[str]]][source]
Get each token and the list of fuzzy algorithms that matched it.
- Returns
an iterator of tuples (token0, [‘algo1’,…]) where token0 is a token and [‘algo1’,…] a list of fuzzy algorithms.
- property keywords: Sequence[IKeyword]
The linked entities,
IKeywordinstances that matched a document’s tokens.
- label: str
- norm_label: str
- start: int
- to_brat_format() str
Get Brat offsets format. See https://brat.nlplab.org/standoff.html ‘The start-offset is the index of the first character of the annotated span in the text (“.txt” file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span.’
- Returns
a string format of tokens’ offsets
- to_dict(text: str = None) Dict[str, Any][source]
Return a dictionary representation of this object.
- Parameters
text – the document from which this annotation comes from. Default to None.
- Returns
A dictionary of relevant attributes.
- to_string(text: str = None, debug=False) str[source]
Get a default string representation of this object.
- Parameters
text – the document from which this annotation comes from. Default to None. If set, add the document substring: text[ first-token-start-offset : last-token-end-offset].
debug – default to False. If True, add the sequence of tokens and fuzzyalgo names.
- Returns
a concatenated string of ‘keywords’ ‘start’ ‘end’ ‘substring’? ‘debug_info’?
rm_nested_annots
- iamsystem.rm_nested_annots(annots: List[Annotation], keep_ancestors=False)[source]
In case of two nested annotations, remove the shorter one. For example, if we have “prostate” and “prostate cancer” annnotations, “prostate” annotation is removed.
- Parameters
annots – a list of annotations.
keep_ancestors – Default to False. Whether to keep the nested annotations that are ancestors and remove only other cases.
- Returns
a filtered list of annotations.
replace_annots
- iamsystem.replace_annots(text: str, annots: Sequence[Annotation], new_labels: Sequence[str])[source]
Replace each annotation in a document (text parameter) by a new label. Warning: an annotation is ignored if overlapped by another one.
- Parameters
text – the document from which the annotations come from.
annots – an ordered sequence of annotation.
new_labels – one new label per annotation, same length as annots expected.
- Returns
a new document.
Keyword and subclasses
IKeyword
Keyword
Term
- class iamsystem.Term(label: str, code: str)[source]
Bases:
KeywordThis class represents a term in a particular domain where each keyword is associated to a unique identifier called a code.
Terminology
- class iamsystem.Terminology[source]
Bases:
IStoreKeywordsA utility class to store a set of keywords.
- add_keyword(keyword: IKeyword) None[source]
Add a keyword.
- Parameters
keyword – a
IKeywordor a subclass.- Returns
None
- add_keywords(keywords: Iterable[IKeyword]) None[source]
Add multiple keywords.
- Parameters
keywords – a
IKeywordor a subclass.- Returns
None
- get_unigrams(tokenizer: ITokenizer, stopwords: IStopwords) Set[str][source]
Get all the unigrams (single words excluding stopwords) in the keywords.
- property size: int
Get the number of keywords.
Tokenization
IOffsets
Offsets
IToken
Token
- class iamsystem.Token(start: int, end: int, label: str, norm_label: str)[source]
-
Store the label, normalized label, start and end offsets of a token.
- __init__(start: int, end: int, label: str, norm_label: str)[source]
Create a token.
- Parameters
start – start-offset is the index of the first character of the annotated span.
end – end-offset is the index of the first character after the annotated span.
label – the label as it is in the document.
norm_label – the normalized label (used by iamsystem’s algorithm to perform entity linking).
ITokenizer
- class iamsystem.ITokenizer(*args, **kwargs)[source]
Bases:
Protocol[TokenT]Tokenizer Interface. Default implementation
TokenizerImp.
TokenizerImp
- class iamsystem.TokenizerImp(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]
Bases:
ITokenizer[Token]A
ITokenizerimplementation. Class responsible for the tokenization, normalization of tokens. See alsofrench_tokenizer(),english_tokenizer().- __init__(split: Callable[[str], Iterable[IOffsets]], normalize: Callable[[str], str])[source]
Create a custom tokenizer that splits and normalizes a string.
- Parameters
split – a function that split a text into (start,end) tuples. This function must return an iterable of
IOffsets. See alsosplit_find_iter_closure().normalize – a function that normalizes a string. This function must return a string.
english_tokenizer
- iamsystem.english_tokenizer() TokenizerImp[source]
- An opinionated English tokenizer.
- It splits the text by ‘word’ character.It normalizes by lowercasing.
- Returns
a
TokenizerImpimplementation.
french_tokenizer
- iamsystem.french_tokenizer() TokenizerImp[source]
- An opinionated French tokenizer.
- It splits the text by ‘word’ character.It normalizes by lowercasing and unicode normalization form.
- Returns
a
TokenizerImpimplementation.
Build a custom split function
Order tokens
- iamsystem.tokenize_and_order_decorator(tokenize: Callable[[str], Sequence[TokenT]]) Callable[[str], Sequence[TokenT]][source]
Decorate a tokenize function: the tokens are sorted alphabetically by their label.
- Parameters
tokenize – a tokenize function to decorate.
- Returns
the decorated tokenize function.
Stopwords classes
IStopwords
Stopwords
- class iamsystem.Stopwords(stopwords: Optional[Iterable[str]] = None)[source]
Bases:
SimpleStopwords[TokenT]A simple implementation of
IStopwordsprotocol.- add(words: Iterable[str]) None[source]
Add stopwords.
- Parameters
words – a list of string.
- Returns
None
- is_stopword(word: str) bool[source]
True if, after lowercasing, the word belongs to the stopwords set
- property stopwords
Get the set of stopwords.
NegativeStopwords
- class iamsystem.NegativeStopwords(words_to_keep: Optional[Iterable[str]] = None)[source]
Bases:
IStopwords[TokenT]Like a negative image (a total inversion, in which light areas appear dark and vice versa), every token is a stopword until proven otherwise.
- add_fun_is_a_word_to_keep(fun: Callable[[TokenT], bool]) None[source]
Add a function that checks if a word should be kept.
- Parameters
fun – a Callable that takes a token as a parameter and returns a boolean.
- Returns
None.
- add_words(words_to_keep: Iterable[str]) None[source]
Add words not to be ignored.
- Parameters
words_to_keep – a list of string.
- Returns
None
- is_token_a_stopword(token: TokenT) bool[source]
Check if it’s not token to keep.
- Parameters
token – a token.
- Returns
False if the token’s lowercase belongs to the set of word to keep or if a function
add_fun_is_a_word_to_keep()returns True.
Fuzzy algorithms
Abstract Base classes
FuzzyAlgo
- class iamsystem.FuzzyAlgo(name: str)[source]
Bases:
Generic[TokenT],ABCFuzzy Algorithm base class.
- NO_SYN: Iterable[Tuple[str, ...]] = []
Default value to return by a fuzzy algorithm if no synonym found.
- abstract get_synonyms(tokens: Sequence[TokenT], i: int, w_states: List[List[IState]]) Iterable[Tuple[Tuple[str, ...], str]][source]
Main API function to retrieve all synonyms provided by a fuzzy algorithm.
- Parameters
tokens – the sequence of tokens of the document. Useful when the fuzzy algorithm needs context, namely the tokens around the token of interest given by ‘i’ parameter.
i – the ith token of this sequence for which synonyms are expected.
w_states – the states in which the algorithm currently is. Useful is the fuzzy algorithm needs to know the current states and the possible state transitions.
- Returns
0 to many synonyms (SynAlgo type).
- static word_to_syn(word: str) Tuple[str, ...][source]
Utility function to transform a string to expected SynType.
- Parameters
word – a word synonym produced by the algorithm. Ex: word=’insuffisance’ for token ‘ins’.
- Returns
SynType, the expected output format.
- static words_seq_to_syn(words: Sequence[str]) Tuple[str, ...][source]
Utility function to transform a sequence of string to the expected output type.
- Parameters
words – a sequence of words produced by the algorithm. Ex: words=[‘insuffisance’, ‘cardiaque’] for the token ‘ic’.
- Returns
SynType, the expected output format.
ContextFreeAlgo
- class iamsystem.ContextFreeAlgo(name: str)[source]
Bases:
FuzzyAlgo[TokenT],ABCA
FuzzyAlgothat doesn’t take into account context, only the current token.
NormLabelAlgo
- class iamsystem.NormLabelAlgo(name: str)[source]
Bases:
ContextFreeAlgo[TokenT],INormLabelAlgo,ABCA
FuzzyAlgothat uses only the normalized label of a token. These fuzzy algorithms can be put in cache to avoid calling them multiple times. SeeCacheFuzzyAlgos.
CacheFuzzyAlgos
- class iamsystem.CacheFuzzyAlgos(name: str = 'Cache')[source]
Bases:
FuzzyAlgo,Generic[TokenT]A
FuzzyAlgothat provides a cache forNormLabelAlgoalgorithms. Since these algorithms don’t depend on context, their output can be cached to avoid calling them multiple times.- add_algo(algo: INormLabelAlgo) None[source]
Add
NormLabelAlgo.
- get_synonyms(tokens: Sequence[IToken], i: int, w_states: List[List[IState]]) List[Tuple[Tuple[str, ...], str]][source]
Implements superclass abstract method.
- get_syns_of_word(word: str) List[Tuple[Tuple[str, ...], str]][source]
Retrieve all synonyms of fuzzy algorithms from cache or by calling them once.
- property max_nb_of_words
The maximum number of words to put in cache. Default 100.000 words
Abbreviations
- class iamsystem.Abbreviations(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]
Bases:
ContextFreeAlgo[TokenT],INormLabelAlgo,ABCA
FuzzyAlgoto handle abbreviations. This class doesn’t take into account the context of a document to return a long form.- __init__(name: str, token_is_an_abbreviation: ~typing.Callable[[~iamsystem.tokenization.api.TokenT], bool] = <function Abbreviations.<lambda>>)[source]
Create an instance to store abbreviations.
- Parameters
name – a name given to this algorithm. (ex: ‘medical abbs’)
token_is_an_abbreviation – a function that verify if a token is an abbreviation (ex: checks all letters are uppercase). The function is called before the dictionary look-up is performed to retrieve long forms. Default: no checks performed, the function returns always true.
- add(short_form: str, long_form: str, tokenizer: ITokenizer) None[source]
Add an abbreviation.
- Parameters
short_form – an abbreviation short form (ex: CHF).
long_form – an abbreviation long form. (ex: congestive heart failure).
tokenizer – a
ITokenizerto tokenize the long form. It is recommanded to use yourMatchertokenizer.
- Returns
None.
- add_tokenized_long_form(short_form, long_form: Sequence[str]) None[source]
Add an abbreviation already tokenized.
FuzzyRegex
- class iamsystem.FuzzyRegex(algo_name: str, pattern: str, pattern_name: str)[source]
Bases:
ContextFreeAlgo,INormLabelAlgoA
FuzzyAlgoto handle regular expressions. Useful when one or multiple tokens of a keyword need to be matched to a regular expression.- get_syns_of_token(token: TokenT) Iterable[Tuple[str, ...]][source]
Return the pattern_name if this token matches the regular expression.
- get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]
Return the pattern_name if this word matches it.
- replace_pattern_in_keyword(keyword: IKeyword, tokenizer: ITokenizer) IKeyword[source]
Utility function to replace keyword’s tokens that match the pattern by the pattern name.
WordNormalizer
- class iamsystem.WordNormalizer(name: str, norm_fun: Callable[[str], str])[source]
Bases:
NormLabelAlgoA
FuzzyAlgoto handle normalization techniques such as stemming and lemmatization.- add_words(words: Iterable[str]) None[source]
A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call
get_keywords_unigrams()of the matcher.- Parameters
words – A list of words to normalize and store.
- Returns
None.
- get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]
Return all the words that have the same normalized form of this word
For example, if the normalize function is an english stemmer, and you provided add_words=[“eating”], this instance stored the stem “eat” associated to the word “eating”. Then, if a document contains the token “eats”, since the stem is the same, this function returns the synonym “eating”.
- Parameters
word – a string, i.e. a word from a document.
- Returns
word synonyms and algorithm name.
SpellWise
SpellWiseWrapper
- class iamsystem.SpellWiseWrapper(spellwise_algo: ESpellWiseAlgo, max_distance: int, min_nb_char: int = 5, name: str = None)[source]
Bases:
NormLabelAlgoA
FuzzyAlgothat wraps an algorithm from the spellwise library.- add_words(words: Iterable[str], warn=False) None[source]
A list of possible word synonyms, in general all the tokens of your keywords. An easy way to provide these tokens is to call
get_keywords_unigrams()method after you added your keywords to the matcher instance.- Parameters
words – A list of possible synonyms.
warn – raise a warning if a word added is ignored. Default False.
- Returns
None.
- add_words_to_ignore(words: Iterable[str])[source]
Add words that the algorithm will ignore: no string distance will be computed.
- get_syns_of_word(word: str) Iterable[Tuple[str, ...]][source]
Returns closest words if this the word is not a word to ignore.
- property max_distance
Maximum edit distance (see spellwise documentation).
- property min_nb_char
The minimum number of characters a word must have not to be ignored.
ESpellWiseAlgo
- class iamsystem.ESpellWiseAlgo(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumEnumerated list of spellwise library algorithms. See spellwise documentation for more information.
- CAVERPHONE_1 = <class 'spellwise.algorithms.caverphone_one.CaverphoneOne'>
- CAVERPHONE_2 = <class 'spellwise.algorithms.caverphone_two.CaverphoneTwo'>
- EDITEX = <class 'spellwise.algorithms.editex.Editex'>
- LEVENSHTEIN = <class 'spellwise.algorithms.levenshtein.Levenshtein'>
- SOUNDEX = <class 'spellwise.algorithms.soundex.Soundex'>
- TYPOX = <class 'spellwise.algorithms.typox.Typox'>
Brat
BratDocument
- class iamsystem.BratDocument[source]
Bases:
objectClass representing a Brat Document containing Brat’s annotations, namely Brat Entity and Brat Note in this package. A BratDocument should be linked to a single text document. Entities and notes can be serialized in a text file with ‘ann’ extension, one per line. See https://brat.nlplab.org/standoff.html
- add_annots(annots: List[Annotation], text: str, keyword_attr: str = None, brat_type: str = None) None[source]
Add iamsystem annotations to convert them to Brat format.
- Parameters
annots – a list of
Annotation,Matcheroutput.text – the document from which these annotations comes from.
keyword_attr – the attribute name of a
IKeywordthat stores brat_type. Default to None. If None, brat_type parameter must be used.brat_type – A string, the Brat entity type for all these annotations. Default to None. If None, keyword_attr parameter must be used.
- Returns
None
- add_entity(brat_type: str, offsets: List[IOffsets], text: str) None[source]
Add a Brat Entity.
- Parameters
brat_type – A Brat entity type (see Brat documentation).
offsets – a list of (start,end) annotation offsets. See
IOffsets. A list is expected since the tokens can be discontinuous.text – document substring using (start,end) offsets (not the document itself).
- Returns
None
- entities_to_string() str[source]
Brat entities in the Brat format ready to be serialized to ‘.ann’ text file.
- get_entities() Iterable[BratEntity][source]
An iterable of Brat entities.
BratEntity
- class iamsystem.BratEntity(entity_id: str, brat_type: str, offsets: Sequence[IOffsets], text: str)[source]
Bases:
objectClass representing a Brat Entity. https://brat.nlplab.org/standoff.html: ‘Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization). and the span of characters containing the entity mention (represented as a “start end” offset pair).’
Format: ID TYPE START END[;START END]* TEXT.
- __init__(entity_id: str, brat_type: str, offsets: Sequence[IOffsets], text: str)[source]
Create a Brat Entity.
- Parameters
entity_id – a unique ID (^T[0-9]+$).
brat_type – A Brat entity type (see Brat documentation).
offsets – (start,end) annotation offsets. See
IOffsets.text – document substring using (start,end) offsets.
BratNote
- class iamsystem.BratNote(note_id: str, ref_id: str, note: str)[source]
Bases:
objectClass representing a Brat Note. https://brat.nlplab.org/standoff.html Brat notes are used to store additionnal information on a detected entity. Format: #ID TYPE REFID NOTE
- __init__(note_id: str, ref_id: str, note: str)[source]
Create a Brat Note.
- Parameters
note_id – a unique ID (^#[0-9]+$)
ref_id – a unique ID. For a BratEntity, the format is (^T[0-9]+$)
note – any string comment.
- TYPE = 'IAMSYSTEM'
BratNote type. Replace by ‘AnnotatorNotes’ to be human writable in Brat interface
BratWriter
- class iamsystem.BratWriter[source]
Bases:
objectUtility class to write IAMsystem annotations in Brat format to a text file.
- classmethod saveEntities(brat_entities: Iterable[BratEntity], write: Callable[[str], Any]) None[source]
Write Brat entities.
- Parameters
brat_entities – an iterable of Brat entities.
write – a write function (ex: f.write from ‘with(open(filename, ‘w’)) as f:’)
- Returns
None