Skip to content

Instantly share code, notes, and snippets.

@adosib
Created October 22, 2024 18:09
Show Gist options
  • Save adosib/67ca24e46c9e4293656332a0f9a77d06 to your computer and use it in GitHub Desktop.
Save adosib/67ca24e46c9e4293656332a0f9a77d06 to your computer and use it in GitHub Desktop.
Regular expressions for validating ICD-10 PCS and ICD-10 CM codes. There aren't enough codes to really need a regex (~80k codes for each code set), BUT if latency is more important than being 100% correct (false positives possible), these regexes will do better than a database or cache lookup.

ICD-10 PCS

\b[0-9BCDFGHX][0-9A-HJ-NP-Z]{6}\b

False positives exist! E.g. “G0BBLDGˮ would be identified as an ICD10 PCS code with the expression even though it isnʼt one.

But no false negatives (tested against all 2023 CMS-approved codes):

In [13]: import re
In [14]: with open("icd10pcs_codes_2023.txt") as pcs:
    ...:     pcs_codes = pcs.readlines()
    ...:
In [15]: len(pcs_codes)
Out[15]: 78530
In [16]: matches = 0
In [17]: for line in pcs_codes:
    ...:     if re.search(r'\b[0-9BCDFGHX][0-9A-HJ-NP-Z]{6}\b',
...: ...:
In [18]: matches
Out[18]: 78530
matches += 1

ICD-10 CM

\b[A-TV-Z]\d[A-Z\d]\.?[A-Z\d]{0,4}\b This follows the specification*:

  • 3 - 7 characters
  • Character 1 is alpha (all letters except U are used) Character 2 is numeric
  • Characters 3  7 are alpha or numeric
  • Use of decimal after 3 characters

Brute force validated as well, though there were 3 false negatives:

In [1]: import re
In [2]: with open("icd10cm_codes_2024.txt") as cm:
   ...:     icd10cm = cm.readlines()
   ...:
In [3]: len(icd10cm)
Out[3]: 74044
In [5]: matches = 0
In [6]: for line in icd10cm:
   ...:     if re.search(r'\b[A-TV-Z]\d[A-Z\d]\.?[A-Z\d]{0,4}\b
   ...:         matches += 1
   ...:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment