Skip to content

Instantly share code, notes, and snippets.

@polm
Created October 12, 2021 10:59
Show Gist options
  • Save polm/3fee9f8181942ec58da1e9e7bc2dae49 to your computer and use it in GitHub Desktop.
Save polm/3fee9f8181942ec58da1e9e7bc2dae49 to your computer and use it in GitHub Desktop.
Check differences with degree tokenization changes
# script to test degree tokenization related changes.
# https://github.com/explosion/spaCy/pull/9155
import spacy
langs = ("af am ar az bg bn ca cs da de el en es et eu fa fi fr ga grc gu he hi "
"hr hu hy id is it ja kn ko ky lb lij lt lv mk ml mr nb ne nl pl pt ro "
"ru sa si sk sl sq sr sv ta te th ti tl tn tr tt uk ur vi xx yo zh").split()
check = ("°c °f °k °C °F °K °c. °f. °k. °C. °F. °K. 1°c 1°f 1°k 1°C 1°F 1°K 1°c. "
"1°f. 1°k. 1°C. 1°F. 1°K.").split()
for lang in langs:
try:
nlp = spacy.blank(lang)
for ex in check:
toks = "|".join([tok.text for tok in nlp(ex)])
print(lang.upper(), toks)
except:
# this can happen if we don't have a dependency
print("Skipping {lang}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment