spaCy

With a list of words

This package provides a spaCy component to add iamsystem algorithm in a spaCy pipeline.

 1from spacy.lang.fr import French
 2
 3from iamsystem.spacy.component import IAMsystemBuildSpacy  # noqa
 4
 5nlp = French()
 6nlp.add_pipe(
 7    "iamsystem_matcher",
 8    name="iamsystem",
 9    last=True,
10    config={
11        "build_params": {
12            "keywords": [
13                "North America",
14                "South America",
15            ],
16            "abbreviations": [("amer", "America")],
17            "stopwords": ["and"],
18            "w": 2,
19            "remove_nested_annots": True,
20            "spellwise": [dict(max_distance=1, measure="Levenshtein")],
21        },
22    },
23)
24doc = nlp("Northh and South Amer.")
25self.assertEqual(2, len(doc.spans["iamsystem"]))
26spans = doc.spans["iamsystem"]
27for span in spans:
28    print(span._.iamsystem)
29# Northh Amer	0 6;17 21	North America
30# South Amer	11 21	South America

The build_params expects serializable Matcher build parameters. See IAMsystemBuildSpacy to configure this component.

With a list of keywords

Since Keyword implementation is not JSON serializable, you will have an error passing keywords instance to the keywords parameter. You have three options:

  • Create your own IKeyword implementation that is JSON serializable.

  • Pass a registered function:

@spacy.registry.misc("umls_ents.v1")
def get_termino_umls() -> Iterable[IKeyword]:
    """An imaginary set of umls ents."""
    termino = Terminology()
    ent1 = Entity("Insuffisance Cardiaque", "I50.9")
    ent2 = Entity("Insuffisance Cardiaque Gauche", "I50.1")
    termino.add_keywords(keywords=[ent1, ent2])
    return termino

"build_params": {
    "keywords": {"@misc": "umls_ents.v1"},
}

Note that if you call nlp.to_disk your keywords will not be serialized.

  • Pass the Keyword as a dictionary with asdict() function and pass the module and classname of the Keyword dataclass:

config={
    "serialized_kw": {
        "module": "iamsystem",
        "class_name": "Keyword",
        "kws": [Keyword(label="insuffisance cardiaque").asdict()],
    },
    "build_params": {"w": 1},
},

Note that if you call nlp.to_disk your keywords will be serialized.