Skip to content

Instantly share code, notes, and snippets.

@awni
Last active April 23, 2025 20:30
Show Gist options
  • Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Run DeepSeek R1 or V3 with MLX Distributed

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:

[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

@stockeh
Copy link

stockeh commented Jan 29, 2025

This is great! Is there a simple way to use mlx-lm’s server with mpi like this to remotely prompt?

@awni
Copy link
Author

awni commented Jan 29, 2025

Good question.. it should be very doable but perhaps not totally simple.. one way would be to have the rank==0 machine be the main "server" listening for requests. That would require some code to make all the server specific parts only run on rank==0. The rest should be pretty much as it is in the pipeline_generate.py script..

@arn4
Copy link

arn4 commented Jan 29, 2025

I am trying to run the model on 2 192GB M2 Ultras, but I keep getting [METAL] Command buffer execution failed: Caused GPU Timeout Error. Do you have any idea on what could be happening? I have connected the two with Thunderbolt cable

@awni
Copy link
Author

awni commented Jan 29, 2025

Try using a pre released MLX LM. I pushed some fixes for that

@arn4
Copy link

arn4 commented Jan 30, 2025

I am already running the version from GitHub with the "Hack to avoid time-outs during prompt processing", but the only way I could get it run is with dist_stream = mx.cpu, that runs all the layers on the CPU and it is super slow (0.2 tokens-per-sec)

@awni
Copy link
Author

awni commented Jan 30, 2025

  1. Did you set your sysctl properly? You need to make sure to allow sufficient memory to be wired.
  2. Make sure you are using macOS 15+. If it is <15 then it will probably be very slow because we don't use residency sets to keep RAM wired.

@arn4
Copy link

arn4 commented Jan 31, 2025

The problem was that one of the two Macs was running MacOS 14. Thanks, it works fine with 14 tokens/sec

@mintisan
Copy link

mintisan commented Feb 8, 2025

Nice job, bro

@dakecrazy
Copy link

only for r1 &v3 ? distill can not be applied

@awni
Copy link
Author

awni commented Feb 9, 2025

Also works for DeepSeek v2. But not for other models.

@StevenPorte
Copy link

StevenPorte commented Feb 11, 2025

Thank you so much for the info.

Your implementation looks solid, but you might want to double-check the gradient calculations—they can be tricky with custom functions. If you're looking for help refining your code explanations or documentation, platform like Academized, found at academized.com can offer clear guidance. Keep experimenting, and you’ll get it running smoothly!

@Tmoss11
Copy link

Tmoss11 commented Feb 11, 2025

Thank you for the interesting and helpful writeup! I'm excited to run this on a large number of hosts if possible.

I've got 31x M1 hosts provisioned with 16 GB RAM configured with iogpu.wired_limit_mb=12000, running macOS 15.2, openmpi 5.0.6, mlx 0.22.1, mlx-lm 0.21.4 and validated all hosts can reach one another via SSH key.

I've tested both DeepSeek R1 3-bit and 2-Bit but end up with the same MLX error + stack trace from multiple hosts after the shards are downloaded. I was just wondering if you had any suggestions for troubleshooting here?

Traceback (most recent call last):
  File "/Users/administrator/mlx-distributed/pipeline_generate.py", line 112, in <module>
    for response in stream_generate(
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/utils.py", line 529, in stream_generate
    for n, (token, logprobs) in enumerate(token_generator):
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/utils.py", line 307, in generate_step
    y, logprobs = _step(y)
                  ^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/utils.py", line 279, in _step
    logits = model(y[None], cache=prompt_cache)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/models/deepseek_v3.py", line 474, in __call__
    out = self.model(inputs, cache, mask)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/models/deepseek_v3.py", line 435, in __call__
    mask = create_attention_mask(h, cache)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/models/base.py", line 50, in create_attention_mask
    if cache is not None and cache[0] is not None:
                             ~~~~~^^^
IndexError: list index out of range
    mask = create_attention_mask(h, cache)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/mlx_lm/models/base.py", line 50, in create_attention_mask
    if cache is not None and cache[0] is not None:
                             ~~~~~^^^
IndexError: list index out of range

@leozusa
Copy link

leozusa commented Mar 5, 2025

Can two of the new Mac Studios with M3 Ultra and max 512Gb of unified memory, and networked using Thunderbolt 5, run the non-quantized R1 version? (saw the news and got curious)

@awni
Copy link
Author

awni commented Mar 5, 2025

You could run the 8 bit model with 1T RAM. That's quantized but perf should be about the same as the original fp8.

@jundaz
Copy link

jundaz commented Mar 8, 2025

Is there a limit of setting the available ram for GPU? Just wondering for the coming 512gb mac studio how much I can squeeze out for GPU alone, I assume that if I can leave only something like 16gb for os on 2 machines and get 496x2 vram for deepseek r1 I can run the full version with fp16 on core attention and fp8 on the rest of the params?
Also can mlx utilize multiple tb5 connect bandwidth? Since the mac studio comes with multiple tb5 port it would be nice if we can use all of them.

@fengyy0111
Copy link

fengyy0111 commented Mar 10, 2025

I used two 192G Mac Studio to run the DeepSeeker R1-3bit model, using the following command: "mpirun-np 2-- hostfile hosts. txt python3 pipine_generation. py -- prompt" What number is larger 6.9 or 6.11? "-- model mlx community/DeepSeeker R1-3bit".
I have limited the memory usage limit on both devices, but the remote device crashed due to high memory usage. How can I solve this problem?
Uploading 截屏2025-03-10 13.54.00.png…

@awni
Copy link
Author

awni commented Mar 10, 2025

but the remote device crashed due to high memory usage

Could you share the error message? I don't see it in your post.

Did you set the sysctl like so for both machines?

sudo sysctl iogpu.wired_limit_mb=180000

@fengyy0111
Copy link

Yes, I have set up sysctl for both devices, which allows me to control remote device downloads of models. However, downloads usually encounter errors and display network issues. I am in China, is it due to regional restrictions?
`mpirun -np 3 --hostfile hosts.txt python3 pipeline_generate.py --prompt "What number is larger 6.9 or 6.11?" --model mlx-community/DeepSeek-R1-3bit
/Users/zhangchi/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: urllib3/urllib3#3020
warnings.warn(
/Users/zhangchi/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: urllib3/urllib3#3020
warnings.warn(
Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 22525.80it/s]
Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 46707.17it/s]
Fetching 70 files: 17%|█▋ | 12/70 [06:54<37:14, 38.52s/it]--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

HNP daemon : [prterun-Mac-Studio-6-41295@0,0] on node Mac-Studio-6
Remote daemon: [prterun-Mac-Studio-6-41295@0,1] on node Mac-Studio-8

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------`

@zengqingfu1442
Copy link

zengqingfu1442 commented Mar 17, 2025

Does the new Mac Studios with M3 Ultra and max 512Gb of unified memory can run native FP8 deepseek-r1 model? Does the new Mac Studios with M3 Ultra and max 512Gb of unified memory support FP8?

@awni
Copy link
Author

awni commented Mar 17, 2025

You can run a 4-bit quantized model on the 512GB machine. 8-bit is too big.

@zengqingfu1442
Copy link

You can run a 4-bit quantized model on the 512GB machine. 8-bit is too big.

If i hvae 4 such Mac Studios, each with 512GB, how can i run the 8-bit model distributed on the 4 machines?

@awni
Copy link
Author

awni commented Mar 18, 2025

  • If each machine has 512 you only need 2 of them (and I wouldn't recommend using more because it will be slower)
  • The above setup should work.. though we'd need to make an 8-bit quant for that.

@awni
Copy link
Author

awni commented Mar 18, 2025

You can make the 8-bit quant like so once ml-explore/mlx-lm#32 lands.

mlx_lm.convert --hf-path deepseek-ai/DeepSeek-R1 -q --q-bits 8 --upload-repo mlx-community/DeepSeek-R1-8bit

@zengqingfu1442
Copy link

zengqingfu1442 commented Mar 18, 2025

  • If each machine has 512 you only need 2 of them (and I wouldn't recommend using more because it will be slower)
  • The above setup should work.. though we'd need to make an 8-bit quant for that.

Why does 4 machines slower than 2 machines? Because the connection speed of Thunderbolt 5 is too slow? I thought that more machines mean more kvcache space and it would be faster.

@awni
Copy link
Author

awni commented Mar 18, 2025

With pipeline parallelism (which is used here).. if you have enough RAM to fit the model then you only are adding communication latency as you add more machines.

@zengqingfu1442
Copy link

With pipeline parallelism (which is used here).. if you have enough RAM to fit the model then you only are adding communication latency as you add more machines.

Does mlx-lm support tensor parallelism?

@awni
Copy link
Author

awni commented Mar 18, 2025

Sort of. There is a PR for it.. but it's too slow right now to be practical. Something we are working on / hoping to improve for the future.

@jiyzhang
Copy link

404 for the url
https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

The run script can be downloaded at
https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/pipeline_generate.py

@Basten7
Copy link

Basten7 commented Apr 11, 2025

Good News

pipeline_generate.py work very well with other DeepSeek model "DeepSeek-V2.5-1210-3bit "

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-3bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 85.378 tokens-per-sec
Generation: 776 tokens, 17.794 tokens-per-sec
Peak memory: 55.234 GB

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-4bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 80.473 tokens-per-sec
Generation: 901 tokens, 17.410 tokens-per-sec
Peak memory: 70.257 GB

Less good News

1°) When I run mlx_distributed_deepseek.py
error message :

except statement is broken in "distributed_run.py"

Edit around line 175. Find:
in the file "except e:"
replace with
"except Exception as e:"

2°) And when I run this command: mlx.distributed_config --verbose --hosts
error message :

/miniconda3/envs/mlxmpi/lib/python3.11/site-packages/mlx/distributed_run.py", line 507, in prepare_tb_ring
connected_to = items[0]["domain_uuid_key"]
~~~~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'domain_uuid_key'

@zengqingfu1442
Copy link

Does mlx support gguf format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment