Skip to content

Instantly share code, notes, and snippets.

@tarleb
Created October 29, 2022 15:05
Show Gist options
  • Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
One sentece per line
local function sentence_lines (el)
local inlines = el.content
for i = 2, #inlines do
if inlines[i].t == 'Space' and
inlines[i-1].t == 'Str' and
inlines[i-1].text:match '%.$' then
inlines[i] = pandoc.SoftBreak()
end
end
return el
end
return {
{SoftBreak = function () return pandoc.Space() end},
{Para = sentence_lines},
{Plain = sentence_lines},
}
@alerque
Copy link

alerque commented Aug 16, 2024

@bpj Yes of course with the much more aggressive approach to not leave sentences on the table there will be false positives (I know I'll have to tackle some abbreviation issues at some point), but with the original I was getting hundreds of paragraphs in a book that hand 2-10 sentences not split up. I'll definitely be looking into lpeg.utfR because better locale dependent case detection will be important.

@tarleb Yes definitely I had that in mind already, but at the moment I'm going to be rolling it out to a few dozen book projects in two languages over the next few weeks/month and it will be easier to iterate on in conjunction with other normalization stuff I use, but when it gets a little more mature and can move at it's own pace it definitely should land in it's own project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment