On every machine in the cluster install openmpi
and mlx-lm
:
conda install conda-forge::openmpi
pip install -U mlx-lm
Next download the pipeline parallel run script. Download it to the same path on every machine:
curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py
Make a hosts.json
file on the machine you plan to launch the generation. For two machines it should look like this:
[
{"ssh": "hostname1"},
{"ssh": "hostname2"}
]
Also make sure you can ssh hostname
from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.
Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:
sudo sysctl iogpu.wired_limit_mb=180000
Run the generation with a command like the following:
mlx.launch \
--hostfile path/to/hosts.json \
--backend mpi \
path/to/pipeline_generate.py \
--prompt "What number is larger 6.9 or 6.11?" \
--max-tokens 128 \
--model mlx-community/DeepSeek-R1-4bit
For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.
Good News
pipeline_generate.py work very well with other DeepSeek model "DeepSeek-V2.5-1210-3bit "
mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-3bit --prompt "Generate a python script"
==========
Prompt: 21 tokens, 85.378 tokens-per-sec
Generation: 776 tokens, 17.794 tokens-per-sec
Peak memory: 55.234 GB
mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-4bit --prompt "Generate a python script"
==========
Prompt: 21 tokens, 80.473 tokens-per-sec
Generation: 901 tokens, 17.410 tokens-per-sec
Peak memory: 70.257 GB
Less good News
1°) When I run mlx_distributed_deepseek.py
error message :
except statement is broken in "distributed_run.py"
Edit around line 175. Find:
in the file "except e:"
replace with
"except Exception as e:"
2°) And when I run this command: mlx.distributed_config --verbose --hosts
error message :
/miniconda3/envs/mlxmpi/lib/python3.11/site-packages/mlx/distributed_run.py", line 507, in prepare_tb_ring
connected_to = items[0]["domain_uuid_key"]
~~~~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'domain_uuid_key'