Skip to content

Fast inference of Instruct tuned LLaMa on your personal devices.

License

Notifications You must be signed in to change notification settings

NolanoOrg/InstructLLaMa.cpp

Repository files navigation

InstructLLaMa.cpp

Discord: https://discord.gg/peBU7yWa

Inference of LLaMA model with Instruct finetuning with LoRA fine-tunable adapter layers.

Dev-notes: We are switching away from our C++ implementation of LLaMa to the more recent one llama.cpp by @ggerganov that now offers nearly same performance (and output quality) on Macbook as well as support over Linux and Windows.

Supported platforms: Mac OS, Linux, Windows (via CMake)

License: MIT

If you use LLaMa weights, then it should only be used for non-commercial research purposes.

Description & Usage

Here is a typical run using the adapter weights uploaded by tloen/alpaca-lora-7b under MIT license:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin --instruction "Write an email to your friend about your plans for the weekend." -t 8 -n 128
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin --instruction "Calculate the area of the a circle given its radius." --input "radius = 3" -t 8 -n 128

These follow the Stanford's Alpaca format for instruction prompt (https://github.com/tatsu-lab/stanford_alpaca#data-release)

Setup

Here are the step for the LLaMA-7B model (same as llama.cpp), defaults to the adapter weights uploaded by tloen/alpaca-lora-7b under MIT license:

# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece transformers

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize.sh 7B

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 --instruction <instruction> --input <input_to_instruction>

How this differs from original LLaMa.cpp:

  • convert-pth-to-ggml.py has been updated to download and handle LoRA weights.
  • utils.h and utils.cpp have been modified to support input prompts in the style of Alpaca.

About

Fast inference of Instruct tuned LLaMa on your personal devices.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published