Skip to content

Instantly share code, notes, and snippets.

@rpdelaney
Created April 17, 2025 19:04
Show Gist options
  • Save rpdelaney/e407c9d4db1cab12163164782e76f0e3 to your computer and use it in GitHub Desktop.
Save rpdelaney/e407c9d4db1cab12163164782e76f0e3 to your computer and use it in GitHub Desktop.
$ cat reed/content.py
"""Provide content handling."""

import io

import deal
import readability
from bs4 import BeautifulSoup as Bs
from pdftotext import PDF


class ContentRenderer:
    """Provide rendering for various document types."""

    @deal.pure
    def __init__(self, data: bytes) -> None:
        """Initialize a renderer."""
        self._data: bytes = data

    @property
    @deal.pure
    def data(self) -> bytes:
        """Provide read-only access to the data."""
        return self._data

    @deal.pure
    def render_text(self) -> str:
        """Render text from... text."""
        return self._data.decode(encoding="utf-8", errors="replace")

    @deal.pure
    def render_html(self, *, make_readable: bool = True) -> str:
        """Render text from HTML.

        If make_readable is True, try to extract the text content with
        python-readability.
        """
        if not make_readable:
            soup = Bs(self._data, "html.parser")
            return soup.get_text()

        readable_doc = readability.parse(self.render_text())
        title: str = readable_doc.title
        summary: str = readable_doc.text_content
        return f"{title}\n{summary}"

    @deal.pure
    def render_pdf(self) -> str:
        """Render text from PDF."""
        return "\n\n".join(PDF(io.BytesIO(self._data)))

$ cat reed/content.py | llm prompt "Explain what this code does." -m dsa
<think>
Alright, I'm looking at the user's Python code and trying to understand how it works step by step.

First, they've imported several modules like io, deal (which seems to be a custom or maybe an experimental library), bs4, readability, and PDF from pdftotext. So there are a
few things here: some standard libraries, possibly a different data handling library deal, the bs4 for HTML parsing, and pdftotext which likely converts PDFs to text.

Then they define a class called ContentRenderer with several @pure methods because it's supposed to handle rendering various document types. The main __init__ method takes by
tes as input and assigns them to self._data.

Looking at the data property, it returns the encoded bytes, so no issues there—it just provides read-only access since it's pure (no side effects).

The render_text method converts the bytes to UTF-8 with errors replacement and decodes back. That makes sense for handling different byte encodings commonly used in URLs.

Next is render_html which first checks if make_readable is True. If so, it uses bs4 to parse HTML as a parser. It gets the text using get_text(), then extracts title and summ
ary from readability. Combines them with a newline. Interesting approach—trying readability first might be better but seems they're combining both for more comprehensive resu
lts.

If make_readable is False, they read the data as HTML and return its text content. That makes sense for basic parsing without extra processing.

Then there's render_pdf which uses PDF from pdftotext to convert the bytes into a string and joins it with newlines. Seems straightforward but maybe could be optimized or hav
e better error handling since PDF conversion can be tricky.

The code also includes some markdown examples at the end, showing how each method would generate different content types like headings, paragraphs, etc. This is helpful for u
sers to see potential use cases.

Putting it all together, the ContentRenderer class seems designed to handle various document types by implementing methods that parse or render HTML, PDFs, and possibly basic
 Markdown text from bytes directly. The code uses a mix of standard libraries, custom libraries (deal), and third-party tools like bs4 and pdftotext, which probably offers so
me API for converting different formats.

Potential areas to consider: error handling in rendering methods, especially with make_readable—what happens if the readability module fails? Also, how efficient is the PDF c
onversion using io.BytesIO? Since it's directly reading from the bytes without processing, maybe there are better ways or more robust methods available in pdftotext.

Another thing is that the code doesn't handle different document structures well. For example, a multi-page PDF might not be parsed correctly if each page isn't read properly
. But since the user provided it and included examples at the end, perhaps they're aware of limitations and expect users to have additional checks or error handling in their
usage.

Overall, the code seems functional for rendering different document types but could benefit from more robust testing, proper error handling in reading methods, and maybe bett
er documentation on how each specific feature works.
</think>

The provided code is designed to handle various document types by implementing a `ContentRenderer` class that can work with bytes (like text files or PDFs) and convert them i
nto readable HTML or Markdown content. Here's a step-by-step explanation of what the code does:

1. **Initialization (`__init__` method):**
   - Takes in raw bytes (`self._data`) which could be from a file, PDF, or any other source.
   - Provides read-only access to the data through the `data` property.

2. **Data Access and Conversion:**
   - The `_data` bytes are decoded into UTF-8 using `decode()` and errors replaced with spaces (`errors="replace"`).
   - Converts the raw bytes back to bytes for further processing if needed.

3. **HTML Rendering (`render_text` method):**
   - Reads the bytes as UTF-8.
   - Uses `BS` (bs4) to parse HTML content, which is then converted to text using readability' `get_text()`.
   - Extracts and combines title and summary information from readability for a polished output.

4. **PDF Rendering (`render_pdf` method):**
   - Directly converts the bytes into Markdown or HTML-like strings by joining PDF objects with newlines.
   - Useful for converting PDFs to readable text format without additional processing.

5. **Markdown Examples:**
   - Includes examples of how each rendering method can produce different content:
     - Headings (`h1`, `h2`)
     - Paragraph Content
     - Summary Text

The code also includes markdown examples showing the output from each method, which helps users understand potential use cases and outputs.

Potential improvements or considerations:

- **Error Handling:** The code doesn't handle errors during parsing (like exceptions when reading PDFs). Better error handling would be needed in such scenarios.
- **Documentation:** While the code is functional, it could benefit from better documentation explaining how each method works and examples of usage.
- **Performance:** Converting bytes directly might not handle large files efficiently. Additional processing or streaming methods could improve performance for big data.
- **Third-party Libraries:** The use of `pdftotext` suggests that the code might be part of a specific library, so better integration with official documentation is recommend
ed.

Overall, this code demonstrates how different document types can be processed 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment