AIBOX-K3 LLM large model usage

799959745 · Posted at 5/26/2026 15:27:27

Last edited by 799959745 In 5/26/2026 15:34 Editor

Step 1:
Download the Bianbu firmware and flash it to the AIBOX-K3.
Firmware flashing tutorial link: https://wiki.t-firefly.com/en/AI ... tmode_spacemit.html

Step Two:
2.1 Connect the machine to an HDMI monitor. Open the terminal. Install dependencies. The inference tool llama.cpp-tools-spacemit needs to be installed.

sudo apt-get update
# Install the accelerated version of llama.cpp inference tool from Spacemit
sudo apt install llama.cpp-tools-spacemit

Copy the code

2.2. Downloading the Model
When using llama.cpp, a GGUF format model is required. It is recommended to download the model to the default path ~/.cache/models/llm for easy management. For quick verification, you can download the model from the Spacemit server.

Model source:
Spacemit mirror (recommended): https://archive.spacemit.com/spacemit-ai/model_zoo/llm/
Multiple pre-installed GGUF models (such as Qwen2.5, Qwen3, Deepseek, etc.) can be downloaded directly to the default directory:

mkdir -p ~/.cache/models/llm
cd ~/.cache/models/llm
# Example: Download the qwen2.5 0.5b model
https://archive.spacemit.com/spacemit-ai/model_zoo/llm/qwen2.5-0.5b-instruct-q4_0.gguf
# Example: Download the qwen2.5 1.5b model
https://archive.spacemit.com/spacemit-ai/model_zoo/llm/qwen2.5-1.5b-instruct-q4_0.gguf
# Example: Download the glm-edge 1.5b model
wget https://archive.spacemit.com/spacemit-ai/model_zoo/llm/glm-edge-1.5b-chat-q4_0.gguf
# Example: Download the qwen2.5 3b model
wget https://archive.spacemit.com/spacemit-ai/model_zoo/llm/qwen2.5-3b-instruct-q4_0.gguf
# Example: Download the qwen3-30B-A3B model
https://archive.spacemit.com/spacemit-ai/model_zoo/llm/Qwen3-30B-A3B-Q4_0.gguf

Copy the code

Hugging Face: Search for GGUF models on Hugging Face and download the .gguf file to ~/.cache/models/llm or a custom path.

2.3. Testing the Model

Verifying the Dialogue

llama-cli -m ~/.cache/models/llm/qwen2.5-0.5b-instruct-q4_0.gguf -t 8 -p "Hello, please introduce yourself."

Copy the code

If you want real-time dialogue in the terminal, you can remove the -p parameter. The command is as follows:

llama-cli -m ~/.cache/models/llm/qwen2.5-0.5b-instruct-q4_0.gguf -t 8

Copy the code

Browser-based web interface dialogue

llama-server -m ~/.cache/models/llm/qwen2.5-0.5b-instruct-q4_0.gguf -t 8 --port 8080 &

Copy the code

Open the Chromium browser. Enter the URL.

http://127.0.0.1:8080

Copy the code

Performance Verification
Performance testing is performed in the SDK root directory using Qwen3-30B-A3B-Instruct-2507-04_0.gguf.

llama-bench -m ~/.cache/models/llm/Qwen3-30B-A3B-Instruct-2507-04_0.gguf -t 8 -p 128 -n 128 -mmp 0 -fa 1

Copy the code

Note:
Running the 30B large model requires 32GB of RAM. If you encounter insufficient memory, try disabling desktop display to reduce memory usage. Connect to a serial terminal to execute commands.

(The command to close the terminal is generally not needed with a large memory configuration.)

systemctl stop sddm.service

Copy the code