Tom Bailey tombailey

This worked on 14/May/23. The instructions will probably require updating in the future.

llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)

Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.

It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.

Clone llama.cpp from git, I am on commit 08737ef720f0510c7ec2aa84d7f70c691073c35d.

Purpose

Bootstrap knowledge of LLMs ASAP. With a bias/focus to GPT.

Avoid being a link dump. Try to provide only valuable well tuned information.

Prelude

Neural network links before starting with transformers.

Running KoboldAI in 8-bit mode

tl;dr use Linux, install bitsandbytes (either globally or in KAI's conda env, add load_in_8bit=True, device_map="auto" in model pipeline creation calls)

Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.

This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.

Requirements

A curated list of command-line utilities written in Rust

Note: I have moved this list to a proper repository. I'll leave this gist up, but it won't be updated. To submit an idea, open a PR on the repo.

Note that I have not tried all of these personally, and cannot and do not vouch for all of the tools listed here. In most cases, the descriptions here are copied directly from their code repos. Some may have been abandoned. Investigate before installing/using.

The ones I use regularly include: bat, dust, fd, fend, hyperfine, miniserve, ripgrep, just, cargo-audit and cargo-wipe.

atuin: "Magical shell history"
bandwhich: Terminal bandwidth utilization tool

Estimation

This document is an attempt to pin down all the things you don't think about when quoting for a project, and hopefully provide a starting point for some kind of framework to make quoting, working and delivering small-medium jobs more predictable and less stressful.

How to Fine Tune GPT-J - The Basics

Before anything else, you'll likely want to apply for access to the TPU Research Cloud (TRC). Combined with a Google Cloud free trial, that should allow you to do everything here for free. Once you're in TRC, you need to create a project and with the name of the new project fill out the form that was emailed to you. Use create_tfrecords.py from the GPT-NEO repo to prepare your data as tfrecords; I might do a separate guide on that. Another thing you might want to do is fork the mesh-transformer-jax repo to make it easier to add and modify the config files.

Install the Google Cloud SDK. We'll need it later.
If you didn't make a project and activate TPU access through TRC yet (or if you plan on paying out of pocket), make one now.
TPUs use Google Cloud buckets for storage, go ahead and [cr

	# So now you want to finetune that GPT-J-6B on a 3090/TITAN GPU ... okay
	# More exploratory coding. It uses the Huggingface model port, deepspeed and reads all text/md files from a target directory
	# It is a fragment of a larger system with remote editing, but that's another story
	# This is the raw, training tester. Items to look out for:
	# - uses DeepSpeed and has a DS config
	# - to save space uses SGD instead of ADAM
	# - uses gradient checkpointing
	# - freezes 25% of the layers to fit

	# Assumes you can already run https://gist.github.com/kinoc/2d636a68876cd3de7b6e9c9452b61089


	# So you want to run GPT-J-6B using HuggingFace+FastAPI on a local rig (3090 or TITAN) ... tricky.
	# special help from the Kolob Colab server https://colab.research.google.com/drive/1VFh5DOkCJjWIrQ6eB82lxGKKPgXmsO5D?usp=sharing#scrollTo=iCHgJvfL4alW
	# Conversion to HF format (12.6GB tar image) found at https://drive.google.com/u/0/uc?id=1NXP75l1Xa5s9K18yf3qLoZcR6p4Wced1&export=download
	# Uses GDOWN to get the image
	# You will need 26 GB of space, 12+GB for the tar and 12+GB expanded (you can nuke the tar after expansion)

	# Near Simplest Language model API, with room to expand!
	# runs GPT-J-6B on 3090 and TITAN and servers it using FastAPI
	# change "seq" (which is the context size) to adjust footprint

	#!/bin/bash

	gdb -p "$1" -batch -ex 'set {short}$rip = 0x050f' -ex 'set $rax=231' -ex 'set $rdi=0' -ex 'cont'