awni/mlx_distributed_deepseek.md

Last active April 23, 2025 20:30

Star (71) You must be signed in to star a gist
Fork (13) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6.js"></script>
Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.

Download ZIP

Run DeepSeek R1 or V3 with MLX Distributed

Raw

mlx_distributed_deepseek.md

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:

[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

zengqingfu1442 commented Mar 17, 2025 •

edited

Loading

Does the new Mac Studios with M3 Ultra and max 512Gb of unified memory can run native FP8 deepseek-r1 model? Does the new Mac Studios with M3 Ultra and max 512Gb of unified memory support FP8?

Author

awni commented Mar 17, 2025

You can run a 4-bit quantized model on the 512GB machine. 8-bit is too big.

zengqingfu1442 commented Mar 18, 2025

You can run a 4-bit quantized model on the 512GB machine. 8-bit is too big.

If i hvae 4 such Mac Studios, each with 512GB, how can i run the 8-bit model distributed on the 4 machines?

Author

awni commented Mar 18, 2025

If each machine has 512 you only need 2 of them (and I wouldn't recommend using more because it will be slower)
The above setup should work.. though we'd need to make an 8-bit quant for that.

Author

awni commented Mar 18, 2025

You can make the 8-bit quant like so once ml-explore/mlx-lm#32 lands.

mlx_lm.convert --hf-path deepseek-ai/DeepSeek-R1 -q --q-bits 8 --upload-repo mlx-community/DeepSeek-R1-8bit

zengqingfu1442 commented Mar 18, 2025 •

edited

Loading

If each machine has 512 you only need 2 of them (and I wouldn't recommend using more because it will be slower)

The above setup should work.. though we'd need to make an 8-bit quant for that.

Why does 4 machines slower than 2 machines? Because the connection speed of Thunderbolt 5 is too slow? I thought that more machines mean more kvcache space and it would be faster.

Author

awni commented Mar 18, 2025

With pipeline parallelism (which is used here).. if you have enough RAM to fit the model then you only are adding communication latency as you add more machines.

zengqingfu1442 commented Mar 18, 2025

With pipeline parallelism (which is used here).. if you have enough RAM to fit the model then you only are adding communication latency as you add more machines.

Does mlx-lm support tensor parallelism?

Author

awni commented Mar 18, 2025

Sort of. There is a PR for it.. but it's too slow right now to be practical. Something we are working on / hoping to improve for the future.

jiyzhang commented Mar 31, 2025

404 for the url
https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

The run script can be downloaded at
https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/pipeline_generate.py

Basten7 commented Apr 11, 2025 •

edited

Loading

Good News

pipeline_generate.py work very well with other DeepSeek model "DeepSeek-V2.5-1210-3bit "

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-3bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 85.378 tokens-per-sec
Generation: 776 tokens, 17.794 tokens-per-sec
Peak memory: 55.234 GB

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-4bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 80.473 tokens-per-sec
Generation: 901 tokens, 17.410 tokens-per-sec
Peak memory: 70.257 GB

Less good News

1°) When I run mlx_distributed_deepseek.py
error message :

except statement is broken in "distributed_run.py"

Edit around line 175. Find:
in the file "except e:"
replace with
"except Exception as e:"

2°) And when I run this command: mlx.distributed_config --verbose --hosts
error message :

/miniconda3/envs/mlxmpi/lib/python3.11/site-packages/mlx/distributed_run.py", line 507, in prepare_tb_ring
connected_to = items[0]["domain_uuid_key"]
~~~~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'domain_uuid_key'

zengqingfu1442 commented Apr 15, 2025

Does mlx support gguf format?

awni/mlx_distributed_deepseek.md

Setup

Run

zengqingfu1442 commented Mar 17, 2025 • edited Loading

awni commented Mar 17, 2025

zengqingfu1442 commented Mar 18, 2025

awni commented Mar 18, 2025

awni commented Mar 18, 2025

zengqingfu1442 commented Mar 18, 2025 • edited Loading

awni commented Mar 18, 2025

zengqingfu1442 commented Mar 18, 2025

awni commented Mar 18, 2025

jiyzhang commented Mar 31, 2025

Basten7 commented Apr 11, 2025 • edited Loading

zengqingfu1442 commented Apr 15, 2025

zengqingfu1442 commented Mar 17, 2025 •

edited

Loading

zengqingfu1442 commented Mar 18, 2025 •

edited

Loading

Basten7 commented Apr 11, 2025 •

edited

Loading