-
-
Save DerekChia/9e7d81bb5435704d93c605fadcaa8473 to your computer and use it in GitHub Desktop.
10 Minutes to cuDF and CuPy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 10 Minutes to cuDF and CuPy\n", | |
"\n", | |
"This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Converting a cuDF DataFrame or Series to a CuPy Array\n", | |
"\n", | |
"If we want to convert a cuDF `DataFrame` to a CuPy `ndarray`, the best way is to use the [DLPack interface](https://github.com/dmlc/dlpack)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import time\n", | |
"\n", | |
"import numpy as np\n", | |
"import cupy as cp\n", | |
"import cudf\n", | |
"from numba import cuda" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", | |
"Wall time: 812 µs\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/dlpack.py:83: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.\n", | |
" return cpp_dlpack.to_dlpack(gdf_cols)\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"cupy.core.core.ndarray" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"nelem = 10000\n", | |
"df = cudf.DataFrame({'a':range(nelem),\n", | |
" 'b':range(500, nelem + 500),\n", | |
" 'c':range(1000, nelem + 1000)}\n", | |
" )\n", | |
"\n", | |
"%time arr_cupy = cp.fromDlpack(df.to_dlpack())\n", | |
"type(arr_cupy)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The best way to convert a cuDF `Series` to a CuPy `ndarray` is to either pass the underlying Numba `DeviceNDArray` to `cupy.asarray` to leverage the [CUDA Array Interface](https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html) or leverage the `dlpack` interface for conversions. We can also pass the `Series` itself, but this will be far slower. We're working on adding the `__cuda_array_interface__` attribute to `Series`, so eventually you'll be able to pass the Series directly with low latency (see [this issue](https://github.com/rapidsai/cudf/issues/2433) to track our progress)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", | |
"Wall time: 463 µs\n", | |
"CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", | |
"Wall time: 311 µs\n", | |
"CPU times: user 2.49 s, sys: 56 ms, total: 2.54 s\n", | |
"Wall time: 2.55 s\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"cupy.core.core.ndarray" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"col = 'a'\n", | |
"\n", | |
"%time cola_cupy = cp.asarray(df[col].data.mem)\n", | |
"%time cola_cupy = cp.fromDlpack(df[col].to_dlpack())\n", | |
"%time cola_cupy = cp.asarray(df[col])\n", | |
"type(cola_cupy)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From here, we can proceed with normal CuPy workflows, such as reshaping the array, getting the diagonal, or calculating the norm." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 0, 1, 2, ..., 197, 198, 199],\n", | |
" [ 200, 201, 202, ..., 397, 398, 399],\n", | |
" [ 400, 401, 402, ..., 597, 598, 599],\n", | |
" ...,\n", | |
" [9400, 9401, 9402, ..., 9597, 9598, 9599],\n", | |
" [9600, 9601, 9602, ..., 9797, 9798, 9799],\n", | |
" [9800, 9801, 9802, ..., 9997, 9998, 9999]])" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"reshaped_arr = cola_cupy.reshape(50, 200)\n", | |
"reshaped_arr" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 0, 201, 402, 603, 804, 1005, 1206, 1407, 1608, 1809, 2010,\n", | |
" 2211, 2412, 2613, 2814, 3015, 3216, 3417, 3618, 3819, 4020, 4221,\n", | |
" 4422, 4623, 4824, 5025, 5226, 5427, 5628, 5829, 6030, 6231, 6432,\n", | |
" 6633, 6834, 7035, 7236, 7437, 7638, 7839, 8040, 8241, 8442, 8643,\n", | |
" 8844, 9045, 9246, 9447, 9648, 9849])" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"reshaped_arr.diagonal()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(577306.967739)" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cp.linalg.norm(reshaped_arr)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Converting a CuPy Array to a cuDF DataFrame or Series\n", | |
"\n", | |
"We can also convert a CuPy `ndarray` to a cuDF `DataFrame` or `Series`. We can use the same `dlpack` interface from above, or rely on the `__cuda_array_interface__` and use cuDF's `from_gpu_matrix`. Either way, we'll need to make sure that our CuPy array is Fortran contiguous in memory (if it's not already). We can either transpose the array or simply coerce it to be Fortran contiguous beforehand.\n", | |
"\n", | |
"We can check whether our array is Fortran contiguous by using `cp.isfortran` or looking at the [flags](https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.flags) of the array." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"False" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cp.isfortran(reshaped_arr)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this case, we'll need to convert it before going to cuDF." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/dlpack.py:36: UserWarning: WARNING: cuDF from_dlpack() assumes column-major (Fortran order) input. If the input tensor is row-major, transpose it before passing it to this function.\n", | |
" res, valids = cpp_dlpack.from_dlpack(pycapsule_obj)\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 0 1 2 3 4 5 6 ... 199\n", | |
"0 0 1 2 3 4 5 6 ... 199\n", | |
"1 200 201 202 203 204 205 206 ... 399\n", | |
"2 400 401 402 403 404 405 406 ... 599\n", | |
"3 600 601 602 603 604 605 606 ... 799\n", | |
"4 800 801 802 803 804 805 806 ... 999\n", | |
"[192 more columns]\n" | |
] | |
} | |
], | |
"source": [ | |
"reshaped_arr = cp.asfortranarray(reshaped_arr)\n", | |
"reshaped_df = cudf.from_dlpack(reshaped_arr.toDlpack())\n", | |
"print(reshaped_df.head())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 0 1 2 3 4 5 6 ... 199\n", | |
"0 0 1 2 3 4 5 6 ... 199\n", | |
"1 200 201 202 203 204 205 206 ... 399\n", | |
"2 400 401 402 403 404 405 406 ... 599\n", | |
"3 600 601 602 603 604 605 606 ... 799\n", | |
"4 800 801 802 803 804 805 806 ... 999\n", | |
"[192 more columns]\n" | |
] | |
} | |
], | |
"source": [ | |
"reshaped_df = cudf.DataFrame.from_gpu_matrix(reshaped_arr)\n", | |
"print(reshaped_df.head())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If the array is not already contiguous, we'll need to create a contiguous array with `ascontiguousarray`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0 0\n", | |
"1 201\n", | |
"2 402\n", | |
"3 603\n", | |
"4 804\n", | |
"5 1005\n", | |
"6 1206\n", | |
"7 1407\n", | |
"8 1608\n", | |
"9 1809\n", | |
"[40 more rows]\n", | |
"dtype: int64\n" | |
] | |
} | |
], | |
"source": [ | |
"diag_data = cp.ascontiguousarray(reshaped_arr.diagonal())\n", | |
"print(cudf.Series(diag_data))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Interweaving CuDF and CuPy for Smooth PyData Workflows\n", | |
"\n", | |
"RAPIDS libraries and the entire GPU PyData ecosystem are developing quickly, but sometimes a one library may not have the functionality you need. One example of this might be taking the row-wise sum (or mean) of a Pandas DataFrame. cuDF's support for row-wise operations isn't mature, so you'd need to either transpose the DataFrame or write a UDF and explicitly calculate the sum across each row. Transposing could lead to hundreds of thousands of columns (which cuDF wouldn't perform well with) depending on your data's shape, and writing a UDF can be time intensive.\n", | |
"\n", | |
"By leveraging the interoperability of the GPU PyData ecosystem, this operation becomes very easy. Let's take the row-wise sum of our previously reshaped cuDF DataFrame." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 0 1 2 3 4 5 6 ... 199\n", | |
"0 0 1 2 3 4 5 6 ... 199\n", | |
"1 200 201 202 203 204 205 206 ... 399\n", | |
"2 400 401 402 403 404 405 406 ... 599\n", | |
"3 600 601 602 603 604 605 606 ... 799\n", | |
"4 800 801 802 803 804 805 806 ... 999\n", | |
"[192 more columns]\n" | |
] | |
} | |
], | |
"source": [ | |
"print(reshaped_df.head())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can just transform it to a CuPy array via `dlpack` and use the `axis` argument of `sum`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/dlpack.py:83: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.\n", | |
" return cpp_dlpack.to_dlpack(gdf_cols)\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 19900, 59900, 99900, 139900, 179900, 219900, 259900,\n", | |
" 299900, 339900, 379900, 419900, 459900, 499900, 539900,\n", | |
" 579900, 619900, 659900, 699900, 739900, 779900, 819900,\n", | |
" 859900, 899900, 939900, 979900, 1019900, 1059900, 1099900,\n", | |
" 1139900, 1179900, 1219900, 1259900, 1299900, 1339900, 1379900,\n", | |
" 1419900, 1459900, 1499900, 1539900, 1579900, 1619900, 1659900,\n", | |
" 1699900, 1739900, 1779900, 1819900, 1859900, 1899900, 1939900,\n", | |
" 1979900])" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"new_arr = cp.fromDlpack(reshaped_df.to_dlpack())\n", | |
"new_arr.sum(axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"With just that single line, we're able to seamlessly move between data structures in this ecosystem, giving us enormous flexibility without sacrificing speed." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Converting a cuDF DataFrame to a CuPy Sparse Matrix.\n", | |
"\n", | |
"We can also convert a `DataFrame` or `Series` to a CuPy sparse matrix. We might want to do this if downstream processes expect CuPy sparse matrices as inputs.\n", | |
"\n", | |
"The sparse matrix data structure is defined by three dense arrays, which we could create manually from an existing cuDF `DataFrame` or `Series`. Luckily, we don't need to do that. We can simply leverage `dlpack` again. We'll define a small helper function for cleanliness." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def cudf_to_cupy_sparse_matrix(data, sparseformat='column'):\n", | |
" \"\"\"Converts a cuDF object to a CuPy Sparse Column matrix.\n", | |
" \"\"\"\n", | |
" if sparseformat not in ('row', 'column',):\n", | |
" raise ValueError(\"Let's focus on column and row formats for now.\")\n", | |
" \n", | |
" _sparse_constructor = cp.sparse.csc_matrix\n", | |
" if sparseformat == 'row':\n", | |
" _sparse_constructor = cp.sparse.csr_matrix\n", | |
"\n", | |
" return _sparse_constructor(cp.fromDlpack(data.to_dlpack()))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can define a sparsely populated dataframe to illustrate this conversion to either sparse matrix format." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = cudf.DataFrame()\n", | |
"nelem = 10000\n", | |
"nonzero = 1000\n", | |
"for i in range(20):\n", | |
" arr = cp.random.normal(5, 5, nelem)\n", | |
" arr[cp.random.choice(arr.shape[0], nelem-nonzero, replace=False)] = 0\n", | |
" df['a' + str(i)] = arr" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" a0 a1 a2 a3 a4 a5 a6 ... a19\n", | |
"0 0.0 0.0 0.5816000025446924 0.0 0.0 0.0 0.0 ... 0.0\n", | |
"1 3.741213474871467 0.0 0.0 0.6527538036321419 0.0 0.0 0.0 ... 0.0\n", | |
"2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0\n", | |
"3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0\n", | |
"4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0\n", | |
"[12 more columns]\n" | |
] | |
} | |
], | |
"source": [ | |
"print(df.head())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/dlpack.py:83: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.\n", | |
" return cpp_dlpack.to_dlpack(gdf_cols)\n" | |
] | |
} | |
], | |
"source": [ | |
"sparse_data = cudf_to_cupy_sparse_matrix(df)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From here, we could continue our workflow with a CuPy sparse matrix.\n", | |
"\n", | |
"For a full list of the functionality built into these libraries, we encourage you to check out the API docs for [cuDF](https://docs.rapids.ai/api/cudf/nightly/) and [CuPy](https://docs-cupy.chainer.org/en/stable/index.html)." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.8" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment