LLM fine-tuning with MacBook Pro

Matthew Leung
2 min readJan 12, 2024

--

It had been a long time that ML training and inference can only be done on Nvidia GPU. The game has just been changed because The ML framework “MLX” was released, which enable people to run ML training and inference on Apple Silicon CPU/GPU. Therefore, I tried to do the LLM fine-tuning using my MacBook Pro.

The MLX github repository comes with many examples including LLM inference and LORA fine-tuning. I followed the example but using difference dataset.

  1. Download the public LLM model. I chose the model: Mistral-7B
curl -O https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-v0.1.tar
tar -xf mistral-7B-v0.1.tar

2. Convert the model into MLX format. I use the option -q to convert the model with 4-bit quantization in order to make the model inference faster.

python convert.py \
--torch-path <path_to_torch_model> \
--mlx-path <path_to_mlx_model>

3. Prepare the data. I downloaded the public dataset for fake news detection in kaggle, which contains the title, author and the content of the news articles, and the label 0 or 1 which shows whether the news is reliable or not.

4. I incorporated the data into the LLM prompt, and formatted it into the json format for MLX LORA fine-tuning.

prompt = "Please determine whether the following article is correct or not.  If it is correct, please answer: 0.  Otherwise, please answer: 1.\n"
text = f"{prompt}The title is: {title}\nThe article is: {content}\nThe answer is: {label}"
output = {"text": text }
json.dumps(output)

Because my MacBook Pro has only 16G memory with M2 Pro CPU, I truncated the article content by 1600 words and 20480 characters.

5. I run the Lora fine-tuning locally on the above MacBook Pro. Because my machine has limited hardware, I use small Lora layers (4) and small iterations (200).

python lora.py --model mlx_model --data data/fake_news --train --batch-size 1 --lora-layers 4 --iters 200

The model has total parameters 1242.763M and trainable parameters 0.426M. The training performance is about 140 tokens/sec, and 0.164 iteration/sec. (~ around 2 hours totally).

6. I run the model inference with the fine-tuned model over 100 unseen news article. The data input follows the same prompt format as the training data, but I used the fine-tuned model to output the answer 0 or 1.

The final result is that the model correctly determine the article is correct or not in 60 cases out of 100 total cases. (accuracy = 60%)

Compared with the same model (Mistral-7B) without fine-tuning. Only 52 cases are correctly determined out of the same 100 total cases. (accuracy = 52%).

Just 100 test cases is not enough to draw any conclusion. However, this toy example testing can enable us to play small LLM on personal notebook computer.

--

--