This is an proposal for how to label code that is AI-generated from code that is human authors.
These annotations are useful for human-driven code review, allowing code reviews to have more context on where the code is coming from. These annotations are also useful for AI and code applications. AIs can find places generated code can be optimized and future models can better distinguish generated code from human code.
Any file with annotated AI code must contain @ai-generated
in a comment within the file. AI generated sections are labelled with a preceeding line and following line, taking inspiration from projects like HackCodegen.
-
The preceding line comment must begin with
BEGIN AI SECTION
and the following comments line must begin withEND AI SECTION
. -
The comment format uses the line comment format native to the given programming language. Shell langauges and Python comments with given with
#
followed by space, and C-style languages including JavaScript, Java, and Rust will be//
followed by a space.
AI generated section can also contain metadata around how the code was generated. Common properties include the model
used such as gpt-4
and the prompt
.
-
Properties are cascading. Properties defined in the
@ai-generated
section will apply to all code section, unless overrides with properties defined in theBEGIN AI SECTION
comment. -
Properties are key-value assignments in TOML format. This format allows for multi-strings, show in the examples here.
Labeling AI-generated code helps in the code review process by providing more context on the source of the code. This information can be useful for both human reviewers and AI applications. Human reviewers can better understand the code's origin, while AI applications can use these labels to optimize generated code or improve the distinction between human and AI-generated code.
To label AI-generated code, you must include an @ai-generated
comment within the file. You should also mark the beginning and end of the AI-generated section using line comments. For example:
# BEGIN AI SECTION
# @ai-generated model="gpt-4" prompt="Write a simple function to add two numbers."
def add(a, b):
return a + b
# END AI SECTION
Yes, you can include metadata related to the AI-generated code. Common properties to include are the model
(e.g., gpt-4
) and the prompt
used to generate the code. These properties are defined using key-value assignments in TOML format.
TOML (Tom's Obvious, Minimal Language) is a simple and easy-to-read configuration file format. It uses key-value assignments and can handle multi-line strings. In the context of labeling AI-generated code, TOML is used to define properties such as model
and prompt
.
Properties defined in the @ai-generated
section apply to all AI-generated code sections within the file, unless they are overridden by properties defined in the BEGIN AI SECTION
comment. This means that if a property is set in the @ai-generated
section, it will be used for all AI-generated code sections unless specifically overridden for a particular section.
This was the only sensible guide I could find on how to label AI-generated code on the whole interwebz, but seemingly it has not exactly taken off (yet?).. I could not find any usage of labels like this or even any other on github nor on reddit.. might be even a bit late now as i'm sure a lot of generated code has already entered the pool 🫠
Anyway, one important recommendation I would have with this is adding a date of generation because that usually is a one-off event and these models do develop in time so just putting the name of the model there is somewhat insufficient.
Are you aware of any other proposal or practice regarding this? Have you posted it on hacker.news? didn't find anything concerning the topic there either..