The original post: /r/localllama by /u/EmilPi on 2025-01-06 16:55:49.
My company rig is described in https://www.reddit.com/r/LocalLLaMA/comments/1gjovjm/4x_rtx_3090_threadripper_3970x_256_gb_ram_llm/
0: set up CUDA 12.x
1: set up llama.cpp:
git clone https://github.com/ggerganov/llama.cpp/
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON
cmake --build build --config Release --parallel $(nproc)
Your llama.cpp with recently merged DeepSeek V3 support is ready!https://github.com/ggerganov/llama.cpp/
2: Now download the model:
cd ../
mkdir DeepSeek-V3-Q3_K_M
cd DeepSeek-V3-Q3_K_M
for i in {1..8} ; do wget "https://huggingface.co/bullerwins/DeepSeek-V3-GGUF/resolve/main/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-0000$i-of-00008.gguf?download=true" -o DeepSeek-V3-Q3_K_M-0000$i-of-00008.gguf ; done
3: Now run it on localhost on port 1234:
cd ../
./llama.cpp/build/bin/llama-server --host localhost --port 1234 --model ./DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf --alias DeepSeek-V3-Q3-4k --temp 0.1 -ngl 15 --split-mode layer -ts 3,4,4,4 -c 4096 --numa distribute
Done!
When you ask it something, e.g. using time curl ...
:
time curl 'http://localhost:1234/v1/chat/completions' -X POST -H 'Content-Type: application/json' -d '{"model_name": "DeepSeek-V3-Q3-4k","messages":[{"role":"system","content":"You are an AI coding assistant. You explain as minimum as possible."},{"role":"user","content":"Write prime numbers from 1 to 100, no coding"}], "stream": false}'
you get output like
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97.","role":"assistant"}}],"created":1736179690,"model":"DeepSeek-V3-Q3-4k","system_fingerprint":"b4418-b56f079e","object":"chat.completion","usage":{"completion_tokens":75,"prompt_tokens":29,"total_tokens":104},"id":"chatcmpl-gYypY7Ysa1ludwppicuojr1anMTUSFV2","timings":{"prompt_n":28,"prompt_ms":2382.742,"prompt_per_token_ms":85.09792857142858,"prompt_per_second":11.751167352571112,"predicted_n":75,"predicted_ms":19975.822,"predicted_per_token_ms":266.3442933333333,"predicted_per_second":3.754538862030308}}
real0m22.387s
user0m0.003s
sys0m0.008s
or in journalctl -f
something like
Jan 06 18:01:42 hostname llama-server[1753310]: slot release: id 0 | task 5720 | stop processing: n_past = 331, truncated = 0
Jan 06 18:01:42 hostname llama-server[1753310]: slot print_timing: id 0 | task 5720 |
Jan 06 18:01:42 hostname llama-server[1753310]: prompt eval time = 1292.85 ms / 12 tokens ( 107.74 ms per token, 9.28 tokens per second)
Jan 06 18:01:42 hostname llama-server[1753310]: eval time = 89758.14 ms / 318 tokens ( 282.26 ms per token, 3.54 tokens per second)
Jan 06 18:01:42 hostname llama-server[1753310]: total time = 91050.99 ms / 330 tokens
Jan 06 18:01:42 hostname llama-server[1753310]: srv update_slots: all slots are idle
Jan 06 18:01:42 hostname llama-server[1753310]: request: POST /v1/chat/completions 200172.17.0.2
Good luck, fellow rig-builders!
You must log in or register to comment.