Created
October 29, 2022 15:05
-
-
Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
One sentece per line
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
local function sentence_lines (el) | |
local inlines = el.content | |
for i = 2, #inlines do | |
if inlines[i].t == 'Space' and | |
inlines[i-1].t == 'Str' and | |
inlines[i-1].text:match '%.$' then | |
inlines[i] = pandoc.SoftBreak() | |
end | |
end | |
return el | |
end | |
return { | |
{SoftBreak = function () return pandoc.Space() end}, | |
{Para = sentence_lines}, | |
{Plain = sentence_lines}, | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@bpj Yes of course with the much more aggressive approach to not leave sentences on the table there will be false positives (I know I'll have to tackle some abbreviation issues at some point), but with the original I was getting hundreds of paragraphs in a book that hand 2-10 sentences not split up. I'll definitely be looking into
lpeg.utfR
because better locale dependent case detection will be important.@tarleb Yes definitely I had that in mind already, but at the moment I'm going to be rolling it out to a few dozen book projects in two languages over the next few weeks/month and it will be easier to iterate on in conjunction with other normalization stuff I use, but when it gets a little more mature and can move at it's own pace it definitely should land in it's own project.