spaCy
With a list of words
This package provides a spaCy component to add iamsystem algorithm in a spaCy pipeline.
1from spacy.lang.fr import French
2
3from iamsystem.spacy.component import IAMsystemBuildSpacy # noqa
4
5nlp = French()
6nlp.add_pipe(
7 "iamsystem_matcher",
8 name="iamsystem",
9 last=True,
10 config={
11 "build_params": {
12 "keywords": [
13 "North America",
14 "South America",
15 ],
16 "abbreviations": [("amer", "America")],
17 "stopwords": ["and"],
18 "w": 2,
19 "remove_nested_annots": True,
20 "spellwise": [dict(max_distance=1, measure="Levenshtein")],
21 },
22 },
23)
24doc = nlp("Northh and South Amer.")
25self.assertEqual(2, len(doc.spans["iamsystem"]))
26spans = doc.spans["iamsystem"]
27for span in spans:
28 print(span._.iamsystem)
29# Northh Amer 0 6;17 21 North America
30# South Amer 11 21 South America
The build_params expects serializable Matcher build parameters. See IAMsystemBuildSpacy to configure this component.
With a list of keywords
Since Keyword implementation is not JSON serializable, you will have an error passing keywords instance to the keywords parameter. You have three options:
Create your own IKeyword implementation that is JSON serializable.
Pass a registered function:
@spacy.registry.misc("umls_ents.v1")
def get_termino_umls() -> Iterable[IKeyword]:
"""An imaginary set of umls ents."""
termino = Terminology()
ent1 = Entity("Insuffisance Cardiaque", "I50.9")
ent2 = Entity("Insuffisance Cardiaque Gauche", "I50.1")
termino.add_keywords(keywords=[ent1, ent2])
return termino
"build_params": {
"keywords": {"@misc": "umls_ents.v1"},
}
Note that if you call nlp.to_disk your keywords will not be serialized.
Pass the Keyword as a dictionary with asdict() function and pass the module and classname of the Keyword dataclass:
config={
"serialized_kw": {
"module": "iamsystem",
"class_name": "Keyword",
"kws": [Keyword(label="insuffisance cardiaque").asdict()],
},
"build_params": {"w": 1},
},
Note that if you call nlp.to_disk your keywords will be serialized.