Created
January 19, 2016 16:49
-
-
Save olivernn/7cd496f8654a0246c53c to your computer and use it in GitHub Desktop.
Better handling of English contractions in lunr.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lunr.contractionTrimmer = function (token) { | |
return token.replace(/('ve|n't|'d|'ll|'ve|'s|'re)$/, "") | |
} | |
lunr.Pipeline.registerFunction(lunr.stopWordFilter, 'contractionTrimmer') | |
var englishContractions = function (idx) { | |
idx.pipeline.after(lunr.trimmer, lunr.contractionTrimmer) | |
} |
I took a bit of a blunderbust approach to this:
token.replace(/[^A-Za-z é]/g, "");
I had an issue that the possessive for of the surname "Burns" had been misspelt as "Burn's" in the corpus, and wanted to add tolerance for those kind of misspellings.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm considering using this in our production environment.
Questions:
n't
, and not just't
?return token.replace(/('m|'ve|'t|'d|'ll|'ve|'s|'re)$/, "")
also replaces "I'm" - seems to work alright. Is there a downside? Did you leave it out on purpose?