Skip to content

Instantly share code, notes, and snippets.

@zurawiki
Last active October 9, 2024 17:09
Show Gist options
  • Save zurawiki/ff47d1b07abebe84d49f9eb5131375f2 to your computer and use it in GitHub Desktop.
Save zurawiki/ff47d1b07abebe84d49f9eb5131375f2 to your computer and use it in GitHub Desktop.
A spec for AI generated code. Examples showcased in JS/TSX, Justfiles, Markdown, Python and Rust

Readme Example

This is an proposal for how to label code that is AI-generated from code that is human authors.

These annotations are useful for human-driven code review, allowing code reviews to have more context on where the code is coming from. These annotations are also useful for AI and code applications. AIs can find places generated code can be optimized and future models can better distinguish generated code from human code.

Core Spec

Any file with annotated AI code must contain @ai-generated in a comment within the file. AI generated sections are labelled with a preceeding line and following line, taking inspiration from projects like HackCodegen.

  • The preceding line comment must begin with BEGIN AI SECTION and the following comments line must begin with END AI SECTION.

  • The comment format uses the line comment format native to the given programming language. Shell langauges and Python comments with given with # followed by space, and C-style languages including JavaScript, Java, and Rust will be // followed by a space.

Metadata

AI generated section can also contain metadata around how the code was generated. Common properties include the model used such as gpt-4 and the prompt.

  • Properties are cascading. Properties defined in the @ai-generated section will apply to all code section, unless overrides with properties defined in the BEGIN AI SECTION comment.

  • Properties are key-value assignments in TOML format. This format allows for multi-strings, show in the examples here.

FAQ

What is the purpose of labeling AI-generated code?

Labeling AI-generated code helps in the code review process by providing more context on the source of the code. This information can be useful for both human reviewers and AI applications. Human reviewers can better understand the code's origin, while AI applications can use these labels to optimize generated code or improve the distinction between human and AI-generated code.

How do I label AI-generated code?

To label AI-generated code, you must include an @ai-generated comment within the file. You should also mark the beginning and end of the AI-generated section using line comments. For example:

# BEGIN AI SECTION
# @ai-generated model="gpt-4" prompt="Write a simple function to add two numbers."
def add(a, b):
    return a + b
# END AI SECTION

Can I include metadata in the AI-generated code labels?

Yes, you can include metadata related to the AI-generated code. Common properties to include are the model (e.g., gpt-4) and the prompt used to generate the code. These properties are defined using key-value assignments in TOML format.

What is the TOML format?

TOML (Tom's Obvious, Minimal Language) is a simple and easy-to-read configuration file format. It uses key-value assignments and can handle multi-line strings. In the context of labeling AI-generated code, TOML is used to define properties such as model and prompt.

How do properties cascade in AI-generated code labels?

Properties defined in the @ai-generated section apply to all AI-generated code sections within the file, unless they are overridden by properties defined in the BEGIN AI SECTION comment. This means that if a property is set in the @ai-generated section, it will be used for all AI-generated code sections unless specifically overridden for a particular section.

/**
* This file contains AI generated code. Generated code exists
* between the BEGIN AI SECTION and END AI SECTION designators.
*
* @ai-generated model=openai.gpt-4
*/
import React from "react";
interface Person {
name: string;
url: string;
}
export function Avatar({ person }: { person: Person }): React.ReactNode {
/* BEGIN AI SECTION model = "gpt-3.5-turbo"
prompt = """
You are an expert programmer.
Write a React Component function body that renders a data table.
"""
*/
return <img className="avatar" src={person.url} alt={person.name} />;
/* END AI SECTION */
}
# This file contains AI generated code. Generated code exists
# between the BEGIN AI SECTION and END AI SECTION designators.
#
# @partially-generated model=openai.gpt-4
# more code above...
# BEGIN AI SECTION
# prompt = "Write an add function"
def add(a, b):
return a + b
# END AI SECTION
# more code below...
# This file contains AI generated code. Generated code exists
# between the BEGIN AI SECTION and END AI SECTION designators.
#
# @ai-generated model=openai.gpt-4
# prompt = """
# You are an expert programmer. Write a Justfile goal that runs a linter in fix mode.
# """
# set positional-arguments
# set dotenv-load := true
help:
@just --list --unsorted
# BEGIN AI SECTION
fix:
poetry run ruff check --fix .
alias f := fix
# END AI SECTION
# This file contains AI generated code. Generated code exists
# between the BEGIN AI SECTION and END AI SECTION designators.
#
# @partially-generated model=openai.gpt-4
[tool.poetry]
name = "middle-man"
version = "0.1.0"
# BEGIN AI SECTION
# prompt = 'Write an project description in toml of the form: description = "..."'
description = "Python project that foos the bar"
# END AI SECTION
authors = ["Roger Zurawicki <[email protected]>"]
readme = "README.md"
packages = [{include = "middle_man"}]
[tool.poetry.dependencies]
python = "3.10.9"
flask = "2.2.2"
// This file contains AI generated code. Generated code exists
// between the BEGIN AI SECTION and END AI SECTION designators.
//
// @partially-generated model=openai.gpt-4
pub(crate) trait SplitPrefixInclusive {
fn split_prefix_inclusive<'a>(&'a self, prefix: &str) -> Vec<&'a str>;
}
impl SplitPrefixInclusive for str {
/// Split string by prefix, including the prefix in the result.
fn split_prefix_inclusive<'a>(&'a self, prefix: &str) -> Vec<&'a str> {
let matches = self.match_indices(prefix).map(|(idx, _)| idx);
let mut start = 0;
let mut substrings = Vec::new();
for idx in matches {
if idx != start {
substrings.push(&self[start..idx]);
start = idx;
}
}
substrings.push(&self[start..]);
substrings
}
}
// BEGIN AI SECTION prompt = "Write test cases for the module above"
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_split_prefix_inclusive() {
let string = include_str!("../tests/data/example_1.diff");
let pattern = "diff --git ";
assert_eq!(string.split_prefix_inclusive(pattern).len(), 5);
}
}
// END AI SECTION
@eMPee584
Copy link

This was the only sensible guide I could find on how to label AI-generated code on the whole interwebz, but seemingly it has not exactly taken off (yet?).. I could not find any usage of labels like this or even any other on github nor on reddit.. might be even a bit late now as i'm sure a lot of generated code has already entered the pool 🫠
Anyway, one important recommendation I would have with this is adding a date of generation because that usually is a one-off event and these models do develop in time so just putting the name of the model there is somewhat insufficient.
Are you aware of any other proposal or practice regarding this? Have you posted it on hacker.news? didn't find anything concerning the topic there either..

@cktang88
Copy link

cktang88 commented Oct 9, 2024

This isn't super useful because line author origins are not so clear cut - they may be human generated + ai fixed, or ai generated + human fixed. Additionally, moving and refactoring lines is a hassle with this convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment