Microsecond Latency, Real-Time,
Multi-Input/Output Control using GPU Processing
Nikolaus Rath
March 20th, 2013
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 2 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 3 / 23
Fusion keeps the Sun Burning
Nuclear fusion is the process 2H 3H
that keeps the sun burning.
4He + 3.5 MeV
Very hot hydrogen atoms n + 14.1 MeV
(the “plasma”) collide to
form helium, releasing lots of
energy
Would be great to replicate
this on earth. Plenty of fuel
available, and no risk of
nuclear meltdown.
Challenges: heat things to
millions of degrees (not so
hard), and keep them
confined (very hard)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 4 / 23
At Millions of Degrees, Small Plasmas Evaporate
Away
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 5 / 23
Magnetic Fields Constrain Plasma Movement to
One Dimension
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 6 / 23
Closed Magnetic Fields Can Confine Plasmas
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 7 / 23
Tokamaks Confine Plasmas Using Magnetic Fields
Orange, Magenta, Green: magnetic field generating coils
Violet: plasma; Blue: single magnetic field line (example)
1 meter radius, 1 million °C, 15000 Ampere current
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 8 / 23
Self Generated Fields Cause Instabilities
Electric currents (which generate magnetic fields) flow not just
in the coils, but also in the plasma itself
The plasma thus modifies the fields that confine it
... sometimes in a self-amplifying way – instability
Typical shape: rotating, helical deformation. Timescale: 50
microseconds.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 9 / 23
Only High-Speed Feedback Control Can Preserve
Confinement
Sensors detect deformations due to plasma currents
Control coils dynamically push back – “feedback control”
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 10 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 11 / 23
Real-Time Performance is Determined By
Latency and Sampling Period
latency
sampling S GPU Processing Pipelines S sample paket
period SSS S Analog Output
Digitizer S
S SS
Latency is response time of feedback system
Sampling period determines smoothness
Algorithmic complexity limits latency, not sampling period
Need both latency and sampling period in the order of few
microseconds
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 12 / 23
Control Algorithm is Implemented in One Kernel
CPU GPU CPU GPU
Read input data Send parameters Read data
Process data to GPU memory Process data
Send data to GPU memory Start GPU kernel Compute
Start GPU kernel A result a
Compute Process results
Wait for GPU kernel A result a
Read results from Compute
result b
GPU Memory
Process results ...
Send new data to
Write output data
GPU memory
Start GPU kernel B Wait for GPU kernel
Wait for GPU kernel B Compute
Read results from result b
GPU Memory ...
... Time
Write output data
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 13 / 23
Redundant PCIe Transfers have to be Avoided To
Reduce Latency
Traditional
Data bounces through
host RAM
PCIe bus has multi GB/s
throughput
Transfer setup takes
several µs
Okay if data chunks
are big, transfer and
processing takes long
Bad if latency is longer
than transfer time
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 14 / 23
Redundant PCIe Transfers have to be Avoided To
Reduce Latency
New
Peer-to-peer transfers
eliminate need for
bounce buffer
Good performance
even for small
amounts of data
Can be implemented
in software (kernel)
Required peer-to-peer
capable root-complex
present in most mid-
to high-end
mainboards.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 14 / 23
Peer-to-peer PCIe transfers are set up by sharing
BARs
GPU GPU Memory A/D Module D/A Module
0x01 0x02 0x03 DMA 0x05 0x06 DMA 0x08 0x09
Controller 0x03 Controller 0x01
BARs BARs BARs
writes reads Initialized from
BIOS by CPU
PCIe devices communicate via “BARs” in the PCI address space
GPU can map part of its memory into a BAR
AD/DA modules can transfer to/from arbitrary PCI address
CPU establishes communication by telling AD/DA modules
about GPU BAR.
Required some trickery in the past, but with CUDA 5 now
officially supported.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 15 / 23
Example: Userspace
/* Allocate buffer with extra space for 64kb alignment */
CUdeviceptr dev_addr;
cuMemAlloc(&dev_addr, size + 0xFFFF);
/* Prepare mapping */
CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;
cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS,
dev_addr);
/* Align to 64kb */
dev_addr += 0xFFFF;
dev_addr &= ~0xFFFF;
/* Call custom kernel module to get bus address,
* @fd refers to open device file */
struct rdma_info s;
s.dev_addr = dev_addr;
s.p2pToken = tokens.p2pToken;
s.vaSpaceToken = tokens.vaSpaceToken;
s.size = size;
ioctl(fd, RDMA_TRANSLATE_TOKEN, &s)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 16 / 23
Example: Kernelspace
long rtm_t_dma_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg) {
nvidia_p2p_page_table_t *page_table;
// ....
switch(cmd){
case RDMA_TRANSLATE_TOKEN: {
COPY_FROM_USER(&rdma_info, varg, sizeof(struct rdma_info));
nvidia_p2p_get_pages(rdma_info.p2pToken, rdma_info.vaSpaceToken,
rdma_info.dev_addr, rdma_info.size,
&page_table, rdma_free_callback, tdev);
rdma_info.bus_addr = page_table->pages[0]->physical_address;
COPY_TO_USER(varg, &rdma_inf, sizeof(struct rdma_info));
return 0;
}
// Other ioctls
}
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 17 / 23
Userspace Continued
/* Call custom kernel module to get bus address,
* @fd refers to open device file */
rtm_t_rdma_info s;
s.dev_addr = dev_addr;
ioctl(fd, RTM_T_TRANSLATE_TOKEN, &s)
/* Retrieve bus address */
uint64_t bus_addr;
bus_addr = s.bus_addr;
/* Send bus address to digitizer */
init_rtm_t(bus_addr, other, stuff, here);
// Start GPU kernel
// Kernel polls input buffer
// Wait for kernel to complete
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 18 / 23
Outline
1 Motivation
2 GPU Control System Architecture
3 Performance
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 19 / 23
The HBT-EP Plasma Control System was Built with
Commodity Hardware.
Hardware:
Workstation PC
NVIDIA GeForce GTX 580
D-TACQ ACQ196 A-D Converter
(96 channels, 16 bit)
2 D-TACQ AO32CPCI D-A Converter
(2 x 32 channels, 16 bit)
Standard Linux host system
(no real-time kernel required!)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 20 / 23
P2P Transfers Reduce Latency by 50%
24
22
Latency [us] 20
18
16
14 GPU RAM
Host RAM
12 4 6 8 Samp1li0ng Per1io2d [us]14 16 18 20
102
Optimal latency when using host memory: 16 µs
Optimal latency when using GPU memory: 10 µs
50% difference does not mean having to wait twice as long, it
is the difference between things blowing up or not.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 21 / 23
GPU Beats CPU in Computational and Real-Time
Performance even in the Microsecond Regime
Performance tested 70 GPU
60 CPU
with repeated matrix 50
Sampling Period [us]
application 40
GPU beats CPU down 30
to 5 µs 20
10
Missed samples 030 40 50 60 70 80 90 100
counted in 1000 runs Matrix Size
Missed samples with 103 CPU
GPU
GPU: None, with CPU: 102
up to 2.5% Count 101
100
10-10.0 0.5 1.0 1.5 2.0 2.5
Missed Samples [%]
22 / 23
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013
Summary
1 The advantages of GPUs are not restricted to large problems
requiring long calculations.
2 Even when processing kB sized batches under microsecond
latency constraints, GPUs can be faster than CPUs, while at the
same time offering better real-time performance.
3 In these regimes, data transfer overhead becomes the
dominating factor, and using peer to peer transfers improves
performance by more than 50%.
4 A GPU based real-time control system has been developed at
Columbia University and tested for the control of magnetically
confined plasmas. Contact us for details.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 23 / 23
Outline
4 Backup Slides
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 6
Latency and Sampling Period are Measured
Experimentally by Copying Square Waves
Volt 0.20 A Shot 76504
0.15
0.10 B Control Input
0.05 2385 2390 2395 Control Output
0.00 Sample Clock
0.05 Time [us] 2400 2405
0.10
0.15
0.20 2380
Control algorithm set up to copy input to output 1:1
Blue trace is input square wave
Green trace is output square wave
Output lags behind input by control system latency
Red trace is sampling interval (sampling on downward edge)
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 2 / 6
Plasma Physics Results: Dominant Mode Amplitude
Reduced by up to 60%
Amplitude 0.24 No FB
0.16 g=144
0.08 50 5 g=577
0.00 20 15 10 Frequency [kHz] 10 15 20
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 3 / 6
Self Generated Fields Cause Instabilities
Electric currents (which generate magnetic fields) flow not just
in the coils, but also in the plasma itself
The plasma thus modifies the fields that confine it
... sometimes in a self-amplifying way – instability
Typical shape: rotating, helical deformation. Timescale: 50
microseconds.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 4 / 6
Feedback Control uses Measurements to Determine
Control Signals
Input Controller Control Signal / Actuators Physical System Output
Control Output Interaction
Measurements / Control Input Physical
Interaction
Sensors
Goal: keep system in specific state
If system is perfectly known, can calculate required control
signals (open-loop control)
In practice, need to use measurements to determine effects and
update signals: feedback control
A control system acquires measurements, performs computations,
and generates control output to manipulate the system state.
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 5 / 6
Data Passthrough Establishes 8 µs Lower Latency Limit
16
Latency [us] 14
12
10 GPU RAM
Host RAM
8
0 2 4 Sampling 6Period [us] 8 10 12
Control system uses same buffer to write input and read output
No GPU processing, so no difference between host and GPU
memory
Jump: 4 µs required for A-D conversion and data push
Offset: 4 µs required for data pull and D-A conversion
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 6 / 6