1. Chongzhi is a GPU server with 3 Nvidia TEX A5000 GPUs and 2
      AMD CPUs.
      2. Arnold is a GPU server with two Intel Xeon Gold 5220R
      processors (each processor has 24-core, 48 threads, 2.20Ghz,
      35.75M Cache), 2TB total RAM, around 8TB disk, and 10 Nvidia
      Quadro TRX 8000 48 GB GDD R6 GPUs.    
      3. Majda is a GPU server with 4 Nvidia A100 80G GPUs, 512 total
      RAM, 2 Intel Xeon Silver 4310 CPUs (each processor has 12-core, 24
      threads).
    
The following uses Arnold as an example: it is the same for
      Chongzhi and Majda. 
    
If you have a linux desktop in math department, simply use ssh (assume your math account ID is dave72 and your linux desktop ID is euler):
euler ~ % ssh arnold
dave72@arnold's password:
Suppose we want to connect to arnold from an off-campus computer. From a linux/apple computer, open the terminal and connect to banach first (assume you have a macbook and your username is dave72):
MacBook-Pro:~ dave% ssh dave72@banach.math.purdue.edu
dave72@banach.math.purdue.edu's password:
then connect to arnold (you cannot ssh to arnold.math.purdue.edu directly from an off-campus computer):
banach ~ % ssh arnold
dave72@arnold's password:
If you have a Windows computer, you need to install a SSH client
      such as PuTTY
    
There is no scheduler installed so try to avoid using up all
      GPUs. Also, avoid any intensive CPU jobs on arnold. 
    
Use top to check current CPU usage. 
    
arnold ~ % top
In order to check up the current usage of GPUs, you can use
      nvidia=smi
    
arnold ~ % nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:1A:00.0 Off | Off |
| 33% 24C P8 24W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 Off | 00000000:1B:00.0 Off | Off |
| 33% 25C P8 22W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 8000 Off | 00000000:1C:00.0 Off | Off |
| 33% 27C P8 31W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 8000 Off | 00000000:1D:00.0 Off | Off |
| 33% 27C P8 24W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Quadro RTX 8000 Off | 00000000:1E:00.0 Off | Off |
| 33% 27C P8 33W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Quadro RTX 8000 Off | 00000000:3D:00.0 Off | Off |
| 33% 24C P8 28W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Quadro RTX 8000 Off | 00000000:3E:00.0 Off | Off |
| 33% 27C P8 25W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Quadro RTX 8000 Off | 00000000:3F:00.0 Off | Off |
| 33% 24C P8 22W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 8 Quadro RTX 8000 Off | 00000000:40:00.0 Off | Off |
| 33% 27C P8 20W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 9 Quadro RTX 8000 Off | 00000000:41:00.0 Off | Off |
| 33% 26C P8 23W / 260W | 3MiB / 49152MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
    
Module is used on arnold. 
    
arnold ~ % module avail
You will see a list of installed softwares. Use module to load
      them. For example, if magma is needed, 
    
arnold ~ % module load magma/2.20-10
Download the
        testing code. Matlab 2023 is needed, and it is available on
      Arnold and Majda. This is an example of accelerating a simple 3D
      Poisson solver on Majda. See Section 2.8 in MA
        615 notes for details of the simple eigenvector method to
      invert Laplacian, which has N^{4/3} complexity for a 3D problem.
      See also this paper for more
      details. Beware that GPU acceleration can be observed only for
      large enough problems, e.g., 100^3 might be too small to see the
      acceleration.   
    
First, always remember to check which GPU device is available,
      since there is no queue of submitting jobs and everything is
      interactive.
    
majda ~ % nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 47C P0 92W / 300W | 12858MiB / 81251MiB | 26% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100 80GB PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 55C P0 106W / 300W | 3224MiB / 81251MiB | 34% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 48C P0 92W / 300W | 3478MiB / 81251MiB | 30% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100 80GB PCIe Off | 00000000:E3:00.0 Off | 0 |
| N/A 38C P0 67W / 300W | 50936MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1631494 C python 2641MiB |
| 0 N/A N/A 1631585 C python 2641MiB |
| 0 N/A N/A 1631629 C python 2641MiB |
| 0 N/A N/A 1631718 C python 2639MiB |
| 0 N/A N/A 1634638 C python 2291MiB |
| 1 N/A N/A 1634639 C python 3221MiB |
| 2 N/A N/A 1634637 C python 3475MiB |
| 3 N/A N/A 1555380 C ...r2023a/bin/glnxa64/MATLAB 50933MiB |
+-----------------------------------------------------------------------------+
In this case, GPU number 3 looks available while the other three
      are being used. In Matlab, the device number would be 4 (GPU 0
      will be labeled as 1 in Matlab). The demo code set default device
      number ID as 1.
    
Open matlab in command line mode:
    
majda ~ % matlab -nodisplay
< M A T L A B (R) >
Copyright 1984-2023 The MathWorks, Inc.
R2023a Update 2 (9.14.0.2254940) 64-bit (glnxa64)
April 17, 2023
Warning: X does not support locale C.UTF-8
To get started, type doc.
For product information, visit www.mathworks.com.
>> run ('Poisson3Ddemo.m')
This is a code solving 3D Poison on a grid of size 200 by 200 by 200
scheme is 2nd order centered difference
GPU computation: starting to load matrices/data
GPU computation: loading finished and GPU computing started
The ell-2 norm residue is 7.009260e-11
The GPU online computation time is 1.805100e-02
On Majda, for a 1000^3 grid, online computation will cost about 0.8 second of GPU computing time:
~ % matlab -nodisplay
< M A T L A B (R) >
Copyright 1984-2023 The MathWorks, Inc.
R2023a Update 2 (9.14.0.2254940) 64-bit (glnxa64)
April 17, 2023
>> run ('Poisson3Ddemo.m')
This is a code solving 3D Poison on a grid of size 1000 by 1000 by 1000
scheme is 2nd order centered difference
GPU computation: starting to load matrices/data
GPU computation: loading finished and GPU computing started
The ell-2 norm residue is 4.851762e-09
The GPU online computation time is 7.683490e-01
The same method also applies to very high order finite element
      method on cartesian meshes. See this page.
      
    
Keep in mind that you should NOT do large CPU jobs on GPU
        servers. Test large CPU jobs on your own desktops or CPU
        servers. If running the demo code on a computer without any
      GPU device, the code will do computation on CPU (you can also
      simply set Param.device = 'cpu' in the demo code):
    
~ % matlab -nodisplay
< M A T L A B (R) >
Copyright 1984-2023 The MathWorks, Inc.
R2023a Update 2 (9.14.0.2254940) 64-bit (glnxa64)
April 17, 2023
>> run ('Poisson3Ddemo.m')
This is a code solving 3D Poison on a grid of size 200 by 200 by 200
scheme is 2nd order centered difference
The ell-2 norm residue is 6.990211e-11
The CPU online computation time is 1.212430e-01
For each GPU machine, e.g., Majda, install Jax in your local
      account via conda, which is a tool of managing software.
    
First, create an environment with name "myenv" (you can set myenv to any other name). Then activate the environment "myenv" and install Jax under the environment "myenv".
~ % conda create -n myenv
....
Proceed (Preparing transaction: done
Verifying transaction: done
Executing transaction: done
~ % conda activate myenv
(myenv) % pip install --upgrade "jax[cuda12]"
Next, download two Python Jax demo codes for solving a 3D Poisson equation using second order finite difference: Jax_double.py is for double precision computing and the Jax_single.py is for single precision.
      For double precision, a problem size as large as 1000^3 should be
      fine.
(myenv) % python Jax_double.py
Available GPUs:
W external/xla/xla/service/platform_util.cc:206] unable to create StreamExecutor for CUDA:0: failed..
[CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3)]
Choosing to use GPU id= 2
Solving Poisson of size n^3 with n= 1000
precision: float64
Computational Time is 1.3882017135620117
ell 2 error: 2.2818226843766946e-05
Remark: Be aware that GPU id can be out of range in Python due to various reasons. For example, on Majda, in Python, there are supposed to be four GPUs: id=0, id=1, id=2, id=3. In the example above, GPU (id=0) was being intensively used, thus available devices became " [CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3)]", and in this case using "jax.default_device=jax.devices("gpu")[3]" would induce a device index out of range error. The remedy is to use "jax.default_device=jax.devices("gpu")[2]" instead, i.e., id=3 becomes id=2.
      For single precision, we can push to a problem size as large as
      1300^3.
(myenv) % python Jax_single.py
Available GPUs:
[CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3)]
Choosing to use GPU id= 2
Solving Poisson of size n^3 with n= 1300
The preparation computation precision
precision: float64
The Poisson solver computation precision
precision: float32
Computational Time is 1.3602116107940674
ell 2 error: 0.00050684914
To exit the environment:
(myenv) % conda deactivate
Author: Xiangxiong Zhang.