LLMs and Card Games

In a previous article I wrote about an experiment where I trained a neural network to play a card game. As a follow up to this project, I figured it would be fun to see if I could get an LLM to play the game instead. This post is a write up of how I arrived at a successful LLM based implementation of the original card game.

Card Game

Briefly summarized, the game I am playing is called Palace, a simple but fun game where the objective is to get rid of all your cards by playing cards of equal or greater face value than your opponent. Suits don’t matter, but 2s and 10s have special properties that allow them to be played on top of any card. In case you are interested in reading the original article, it can be found here.

One common strategy when playing the game is to preserve your high cards and focus on playing the card with the lowest possible face value first. This reduces the risk of not being able to make a play later when your opponent plays a high card. When following this strategy, picking the next card breaks down to picking the smallest card of equal or greater face value than the card that was just played by your opponent.

In the following section I will discuss how I integrated a local LLM (Llama 3.1 8B) with the game to implement the next card selection algorithm. Any LLM will do, but I decided to go with Llama 3.1 8B since I needed something that will run on my local desktop and play nicely with my modest GPU with only 8GB of VRAM.

Picking Cards with LLMs

Since suits don’t matter in this game, we can reduce the problem space to picking numbers instead of cards. My view is that this simplifies the integration with the LLM since the LLM doesn’t even need to know that it’s participating in a card game. All the LLM needs to take care of is inferring the smallest possible numeric value from an array of numbers (i.e. values: 2-14) that that can be mapped to cards in the UI (e.g. 13 = King, 14 = Ace, etc.).

First Experiment - Prompting with RAG

My initial thought was that I would be able to implement this by relying on regular prompting and LLM tool calling. Tool calling would be used to enforce a schema on the generated response, which would be useful for integrating the LLM with the game programmatically. Note: In case you are unfamiliar with tool calling, you might want to check out one of my other articles here.

In the prompt I defined the array (e.g. [3, 4, 5, 6]) and a current “card” with a numerical value (e.g. 4) to compare against to select the smallest possible value from the array. In addition to the numeric values from the UI I enriched the prompt with fixed RAG content behind the scenes to tell the LLM how to compare the numbers.

My initial results were actually pretty good since the LLM generally did a good job picking the correct value and feed it into the tool function. However, I noticed on multiple occasions that the LLM would struggle with the numerical comparisons and select the wrong number from the array. A common scenario was picking a higher value than necessary, which would result in playing a high card where you expected to play a low card. I also noticed cases where the LLM would pick values that were too low, or not even present in the play list. The latter category of issues directly interferes with the game experience since it would result in invalid plays from the LLM.

To improve on the results, I tried implementing some common “prompt engineering” techniques like few shotting with multiple examples of correct selections etc. I also tried making the prompt very specific and even added a few self-check instructions.

While these techniques resulted in noticeable improvements, it was still very challenging to complete a full game without error since the LLM would reliably make a few illegal plays during each game.

After multiple iterations I ended up with a very complex prompt in the tool function as seen below. (See my other article for details on how tool functions are called by LLMs and integrated with the top level prompt).

def get_lowest_next_number(current_play, selected_number: int = Field(description="""selected number""", default=0)): """ Your task is to select a single number from a list of numbers called play list. The requirement is to select the first number from the play list that is greater than or equal to a number called current play. Follow the following requirements when selecting the number: The selected value must be the smallest value from the play list that is greater than or equal to current play The numbers in the play list are ordered in ascending order. After selecting the number, double check to be sure that the selected number is actually the smallest value from the play list that is greater than or equal to the current play. If that is not the case, you must correct the error. If you are unable to find a number that satisfies these requirements, you must select 0 to indicate that no value was selected. Here are some examples: If the current play is 7 and the numbers from the play list are [7, 9, 12] the correct number is 7 since it is the lowest number that satisfies the condition. If the current play is 7 and the numbers from the play list are [7, 8, 12] the correct number is 7 since it is the lowest number that satisfies the condition. If the current play is 5 and the numbers from the play list are [3, 7] the correct number is 7 since it is the lowest number that satisfies the condition. If the current play is 8 and the numbers from the play list are [6, 7, 12] the correct number is 12 since it is the lowest number that satisfies the condition. If the current play is 3 and the numbers from the play list are [3, 4, 7, 8, 11, 12, 13] the correct number is 3 since it is the lowest number that satisfies the condition. If the current play is 4 and the numbers from the play list are [5, 9, 12] the correct number is 5 since it is the lowest number that satisfies the condition. If the current play is 14 and the numbers from the play list are [6, 12, 12] the correct number is 0 since no numbers in the play are greater than or equal to the current play. If the current play is 9 and the numbers from the play list are [3, 6, 12] the correct number is 12 since it is the lowest number that satisfies the condition. If the current play is 7 and the the play list is [5] the selected number is 0 since none of the numbers in the play list are greater than or equal to the current play. If the current play is 9 and the numbers from the play list are [3, 4, 5] the correct number is 0 since no numbers in the play list are greater than or equal to the current play. """ return selected_number

The prompt started out simple but grew in complexity for every iteration. I suspect I could have made further improvements to this by continuing to work on the prompt, but I was struggling to get it to perform well enough to complete a full game.

Second Experiment - Fine-Tuning the LLM

After struggling with the initial prompt-based implementation for a while, I decided to shift gears and try something different. Instead of relying solely on prompting I wanted to see if fine-tuning the LLM would be an appropriate solution here. My hope was that fine-tuning would improve on the LLM's ability to compare numbers and prevent the incorrect results.

There are a few different ways to do fine-tuning of an LLM. The most extreme version is to do full retraining of the entire network, which would be unrealistic for my personal project, even for the Llama 8B version. However, a more realistic approach is to use Lora or QLora where only a small subset of the LLM is retrained while keeping most of the original weights frozen. Lora and QLora are similar, with the main difference that QLora does quantization. A key benefit of this is that the quantization will reduce the memory footprint of the fine-tuning since the precision of the weights is reduced to 4 bits.

Given my limited hardware I decided to go with QLora using the Unsloth framework.

Fine-tuning data

The first step is coming up with a dataset for fine-tuning. In this particular project I decided to go with the Alpaca format. Alpaca consist of an array of json objects with three properties: Instruction, Input and Output.

See the example below as a reference:

[ { "instruction": "Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0.", "input": "{\"play_list\": [12, 7, 3, 9, 4], \"current_play\": 12}", "output": "12" }, { "instruction": "Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0.", "input": "{\"play_list\": [9], \"current_play\": 8}", "output": "9" }, { "instruction": "Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0.", "input": "{\"play_list\": [5, 4], \"current_play\": 5}", "output": "5" }, { "instruction": "Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0.", "input": "{\"play_list\": [14, 11, 8, 6, 13, 4, 3, 9, 7, 12], \"current_play\": 3}", "output": "3" }, ]

The instruction node tells the LLM how to compare the data defined in the input node. Finally, the output node defines the expected outcome. You can think of this as supervised fine-tuning sine you are training using labeled data pairs of defined input and expected output.

Below is the full fine-tuning script:

from unsloth import FastLanguageModel from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported from datasets import Dataset import pandas as pd import json, yaml import torch max_seq_length = 1024 dtype = None model, tokenizer = FastLanguageModel.from_pretrained( model_name = "meta-llama/Llama-3.1-8B", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit=True ) alpaca_prompt_regular_cards = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0. ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token def formatting_prompts_func_regular_cards(examples): inputs = examples["input"] outputs = examples["output"] texts = [] for input, output in zip(inputs, outputs): text = alpaca_prompt_regular_cards.format(input, output) + EOS_TOKEN texts.append(text) return { "text" : texts, } with open('/usr/tuning/output.json', 'r') as f: json_f = yaml.safe_load(f.read()) df1 = pd.DataFrame(json_f) dataset = Dataset.from_pandas(df1) dataset = dataset.map(formatting_prompts_func_regular_cards, batched = True,) model = FastLanguageModel.get_peft_model( model, r=16, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], lora_alpha=16, lora_dropout=0, bias="none", use_gradient_checkpointing="unsloth", random_state=3407, use_rslora=False, loftq_config=None, ) trainer = SFTTrainer( model = model, tokenizer = tokenizer, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, args = TrainingArguments( per_device_train_batch_size = 2, per_device_eval_batch_size = 2, gradient_accumulation_steps = 4, eval_accumulation_steps = 4, warmup_steps = 5, num_train_epochs = 150, max_steps = 150, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "/usr/tuning/outputs", ), train_dataset = dataset ) trainer_stats = trainer.train() model.save_pretrained("lora_model") tokenizer.save_pretrained("lora_model") model.save_pretrained_gguf("model", tokenizer)

The script above is a based on the code from this excellent article from HugginFace. I had to make a few tweaks to implement my own requirements, but conceptually the implementation is still very similar.

My local desktop is pretty modest in terms of resources (8GB VRAM), but fine-tuning with this dataset is still pretty fast (roughly 10 min end-to-end). I had to experiment with the size of my dataset, but in the end I landed on 300 samples in my full dataset.

Once the fine-tuning is done, how do we use the new model in the application?

The output of the fine-tuning process is a GGUF file that can be imported into Ollama by running the following command against the docker hosted Ollama process:

docker exec -it ollama-docker ollama create custom_llama8B -f /root/.ollama/custom/Modelfile

Once the profile is created in Ollama, the LLM can be loaded in Ollama just like any other Ollama supported LLM. See the code listing below as a reference for how to load the new model called custom_llama_model.

from llama_index.llms.ollama import Ollama from llama_index.core import Settings from llama_index.core import PromptTemplate from prompt import prompt_template llm_model_name = "custom_llama_llm" def init_llm(): Settings.llm = Ollama(model=llm_model_name, request_timeout=1000.0, base_url = "http://ollama-docker:11434", temperature=0) def predict_next_card(cards: list[int]): current_card = cards.pop(0) print(cards) play_without_special_cards = list(filter(lambda p: p > 0 and p != 2 and p != 10, cards)) special_cards = list(filter(lambda p: p == 2 or p == 10, cards)) result = "0" if len(play_without_special_cards) > 0: result = predict(play_without_special_cards, current_card) if int(result) == 0 and len(special_cards) > 0: result = predict(special_cards, 2) return result def predict(cards, current_card): prompt = prompt_template(cards, current_card) res = Settings.llm.predict(prompt=PromptTemplate(prompt), verbose = True) print(f"current card: {current_card}") print(f"result{res}") return res.split("Response:")[1].strip()

The prompt I am using when generating responses using the final fine-tuned model is identical to the Alpaca prompt used during fine-tuning traing as seen below:

def prompt_template(cards, current_play): return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0. ### Input: \"play_list\":{cards} \"current_play\": {current_play}"""

For more information on how to run Ollama and Llama locally, please check out my other article here:

Conclusion

After fine-tuning the model, I started seeing really good results and was able to complete multiple games back-to-back without any “cheating” from the LLM. I have also added a pretty compressive test suite of play scenarios to further build confidence in the accuracy of the fine-tuned model.

My theory for why fine-tuning works in this scenario is that LLMs in general have a hard time comparing numeric values reliably, but by fine-tuning the LLM on a numeric dataset for a very specific use case, overall accuracy can improve greatly.

Screenshot of the game below:

Helpful References:

HuggingFace article about FineTuning

Github repo with the code from this article