Tianao Ge getianao

pip install ipydrawio[all] jupyterlab_widgets jupyterlab_image_editor jlab-enhanced-cell-toolbar \
jlab-enhanced-launcher jupyterlab_latex jupyter-server-proxy jupyter-vscode-proxy

sudo systemctl restart jupyterhub.service

install jupyter-remote-desktop-proxy

https://github.com/jupyterhub/jupyter-remote-desktop-proxy

Ampere (GA10x GPU): 6144 KB L2 Cache (12 32-bit memory controllers (384-bit total), 512 KB of L2 cache is paired with each 32-bit memory controller) Each SM: 128 CUDA Cores, 4 3rd-generation Tensor Cores, a 256 KB Register File, 128 KB of L1/Shared Memory. each SM has 4 partitions (a 64 KB Register File, one 3rd-generation Tensor Core, an L0 instruction cache, one warp scheduler, one dispatch unit, and sets of math and other units). The four partitions share a combined 128 KB L1 data cache/shared memory subsystem.

Turing and Volta SMs support concurrent execution of FP32 and INT32 operations

Linux Command

Disable GUI

Switch to "text mode": sudo systemctl isolate multi-user.target

Switch to "graphical mode": sudo systemctl isolate graphical.target

Create User

sudo useradd -m -s /bin/bash tge # -g sudo 
sudo passwd tge

ncu --list-sets  # The configuration for sets. A set defines a set of sections.
ncu --list-sections  # The configuration for sections. A section defines a set of metrics.
ncu --query-metrics   # All individual metrics.
ncu --query-metrics-mode suffix --metrics <metrics list> # Check various suffixes for a base metric name.

ncu_cli

V2raya Setup

# download image
docker pull mzz2017/v2raya

# run v2raya
docker run -d \
  --restart=always \
  --privileged \

GPU Terminology

Nvidia/CUDA	AMD OpenCL	Note
Task	Task

Docker Cheatsheet

Show docker info

List images
```
docker image ls
```
List containers

	# Show route
	netstat -nr

	# Show DNS
	cat /etc/resolv.conf

	# Default connection
	sudo /sbin/route change default 10.6.0.1 # 10.0.0.1

	# Forward server address via VPN



	#define BLOCK_SIZE 1024
	#define STRIDE 16

	__global__ void kernel(float A, float B) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx * STRIDE < BLOCK_SIZE)
	B[idx] = A[idx * STRIDE];
	// STRIDE * 4 bytes stride read (STRIDE * 4 bytes float)

	### AMDGPU

	cmake -G 'Ninja' \
	-DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11" \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt;libcxx;libcxxabi;" \
	-DLLVM_TARGETS_TO_BUILD="AMDGPU;X86;NVPTX" \
	-DLLVM_ENABLE_ASSERTIONS=On \
	../llvm