- 🐮🤠 Welcome to the (Derivative) Rodeo 🤠🐮
Presented on <2023-05-03 Wed> at Samvera Virtual Connect 2023
- Name: Jeremy Friesen
- Pronouns: he/him
- Employer: Software Services by Scientist.com
- Job Title: Senior Lead Engineer
- Email: [email protected]
- Personal Blog: https://takeonrules.com
“This ain’t my first rodeo.”1
In this talk I’ll go over:
- The Problem Statement
- History
- The Rodeo
- Wrap Up
- What?
- Where?
- Why?
Given that I have a million billion objects
When I ingest those objects
Then things are really slow
Also
Given that I have a million billion objects
And I already have a mixture of derivatives
When I ingest those objects
Then I really don’t want to recreate things I already have
Time why you punish me
Like a wave bashing into the shore
You wash away my dreams
Time why you walk away
Like a friend with somewhere to go
You left me crying
Can you teach me 'bout tomorrow
And all the pain and sorrow running free
'Cause tomorrow's just another day
And I don't believe in time
The Gems:
- Hydra::Derivatives
- Hyrax::DerivativeService
- NewspaperWorks
- Extending/Overriding NewspaperWorks
- IiifPrint
- DerivativeRodeo
Said in the voice of Sophia from Golden Girls:
“Picture it: Minneapolis, 2013. A younger Justin Coyne creates a repository.”
The Hydra::Derivatives is a venerable and long-used gem for generating derivatives for the Samvera community. It’s very configurable and extensible.
The Hyrax::DerivativeService implements the interface for generating derivatives for a FileSet
. It uses the registered services to find the first valid one and then uses that to create the derivatives.
# @api public
#
# Get the first valid registered service for the given file_set.
#
# @param file_set [#uri, #file_set]
# @return [#cleanup_derivatives, #create_derivatives, #derivative_url]
def self.for(file_set, services: Hyrax.config.derivative_services)
services.map { |service| service.new(file_set) }.find(&:valid?) ||
new(file_set)
end
The Hyrax::FileSetDerivativesService class leverages Hydra::Derivatives
and by default is registered as the one and only .services
. It is the long-standing approach for creating derivatives.
For each original file, we create derivatives of that original file based on its mime type.
Created by the Boston Public Library and University of Utah, the NewspaperWorks gem introduced quite a few concepts:
- models for Title, Issue, Page, and Article
- batch ingest via command line
- OCR and ALTO creation
- newspaper-specific metadata fields
- full-text search
- calendar-based issue browsing
- advanced search
- OCR keyword match highlighting
- viewer with page navigation and deep zooming
It does some of this by creating a new derivative service and registering that in the aforementioned Hyrax.config.derivative_services
.
For the NNP we leveraged the NewspaperWorks and made several modifications and omissions.
Fundamentally we wanted to:
- Rip PDFs apart, one image per page
- Run OCR on those images
- Index the image text as part of the parent PDF
All in service of a more pleasant and responsive IIIF Viewer Experience for the PDFs.
IiifPrint
: The woefully incorrect name of a gem SoftServ has been working on.
It is subset of extracted features from the NewspaperWorks gem; the features we are seeing as common requests for our clients.
Guided by the use-case of NNP and other Hyku installations (e.g. British Library, Adventist, University of Tennessee Knoxville, etc.).
- Splitting a PDF into constituent pages, with a parent/child relationship.
- Returning parent works when children match the search criteria.
- IIIF Manifest includes parent/child relationships.
- Auto-assignment of parent/child relationship when splitting a PDF into constituent Pages.
- Text extraction, via tesseract, of text within an image.
It does some of this by creating a new derivative service and registering that in the aforementioned Hyrax.config.derivative_services
.
Finally, the actual thing I’m here to talk about!
🐮🤠 The DerivativeRodeo
is a further decomposition of the IiifPrint
. 🤠🐮
In the future, IiifPrint
will:
- depend on the
DerivativeRodeo
- be renamed to something rodeo adjacent
- provide the parent/child relationship management
- search/indexing behavior
First, we want to do the PDF splitting and text extracting in a distributed environment (e.g. AWS Lambdas).
And given that we’re generating some derivatives in AWS Lambda, we want to be able to generate other derivatives in that space.
We also want to have our Lambda functions use the same code as our Monolith (but we definitely don’t want the monolith loaded in a lambda).
In the previous diagram, the preprocess and import represent the AWS Lambdas and the Hyrax monolith. The primary idea being that each environment knows how, via the DerivativeRodeo
, to find or create the requisite derivatives for the original file’s mime type.
At it’s core, the DerivativeRodeo
orchestrates the following:
- Checking if something already exists “here”…
- Or fetching something when it exists “elsewhere” and put it “here”…
- Or generating it “here”
What is “here”? It depends on the place where things are running.
Today’s existing Hyrax implementation does not handle the case where we already have some (or all) of the desired derivatives. And if you’re looking to rip apart PDFs, the processing within Hyrax is slow.
If you’re not minting new gems, what are you doing?
SoftServ is iterating on these concepts. The Github repositories I’m referencing are:
- samvera-labs/newspaper_works: Other community member foundational work
- scientist-softserv/nnp: Our first iteration
- scientist-softserv/iiif_print: Our second iteration
- scientist-softserv/derivative-rodeo: An exploratory repository (to archive)
- scientist-softserv/derivative_rodeo: The place to look
Rob and I have been playing a game of tennis; in which we write up code to demonstrate our understanding of the problem.
We then respond to the code:
- with questions
- diagrams
- conversations
- refactoring
- proposing alternate approaches
We do much of this asynchronously so we can work within Rob’s particularly challenging calendar constraints.
In our synchronous conversations, we include another developer to ensure that we’re delivering the most accessible code.
The “dash” rodeo was one exploration through code extraction, naming, and working through the process flow. The “underscore” rodeo is a further distillation and simplification of the “dash” rodeo.
In our conversations we reviewed RubyGems’s “Name your gem” Guide and the underscore is more idiomatic for Ruby.2
Hence we’re settling on derivative_rodeo
.
I encourage you all to ask Rob how those naming conventions were established. He was there in the days of yore.
This is all in-progress work; some running in production across different Hykus. Our plan is to get both the derivative_rodeo
and iiif_print
into a suitable state for Samvera Labs and transfer them once they’ve stabilized.
- Name: Jeremy Friesen
- Pronouns: he/him
- Employer: Software Services by Scientist.com
- Job Title: Senior Lead Engineer
- Email: [email protected]
- Personal Blog: https://takeonrules.com
These notes will become a blog post; just need to wrangle up some time to do that.
1 an idiomatic American slang for “I’m prepared for what comes next.”
2 Personally, I like dashes; which are also a more universal word-boundary in regards to search engines and assistive technology.
https://github.com/scientist-softserv/derivative_rodeo/blob/f8e2173fc907a2f24db37479679a4a84c840e00c/artifacts/derivative_rodeo-generator_storage_lifecycle.png