The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore Microsecond Latency, Real-Time, Multi-Input/Output Control ...

View in Fullscreen

Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing Nikolaus Rath March 20th, 2013 N. Rath (Columbia University) µs Latency Control using ...

Like this book? You can publish your book online for free in a few minutes!

http://anyflip.com/hugc/rgcl/

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by , 2016-08-25 02:51:03

Microsecond Latency, Real-Time, Multi-Input/Output Control ...

Pages:

1 - 30

Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing Nikolaus Rath March 20th, 2013 N. Rath (Columbia University) µs Latency Control using ...

Microsecond Latency, Real-Time,
Multi-Input/Output Control using GPU Processing

Nikolaus Rath

March 20th, 2013

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 23

Outline

1 Motivation
2 GPU Control System Architecture
3 Performance

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 2 / 23

Outline

1 Motivation
2 GPU Control System Architecture
3 Performance

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 3 / 23

Fusion keeps the Sun Burning

Nuclear fusion is the process 2H 3H
that keeps the sun burning.
4He + 3.5 MeV
Very hot hydrogen atoms n + 14.1 MeV
(the “plasma”) collide to
form helium, releasing lots of
energy

Would be great to replicate
this on earth. Plenty of fuel
available, and no risk of
nuclear meltdown.

Challenges: heat things to
millions of degrees (not so
hard), and keep them
confined (very hard)

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 4 / 23

At Millions of Degrees, Small Plasmas Evaporate
Away

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 5 / 23

Magnetic Fields Constrain Plasma Movement to
One Dimension

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 6 / 23

Closed Magnetic Fields Can Confine Plasmas

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 7 / 23

Tokamaks Confine Plasmas Using Magnetic Fields

Orange, Magenta, Green: magnetic field generating coils
Violet: plasma; Blue: single magnetic field line (example)
1 meter radius, 1 million °C, 15000 Ampere current

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 8 / 23

Self Generated Fields Cause Instabilities

Electric currents (which generate magnetic fields) flow not just
in the coils, but also in the plasma itself
The plasma thus modifies the fields that confine it
... sometimes in a self-amplifying way – instability
Typical shape: rotating, helical deformation. Timescale: 50
microseconds.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 9 / 23

Only High-Speed Feedback Control Can Preserve
Confinement

Sensors detect deformations due to plasma currents
Control coils dynamically push back – “feedback control”

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 10 / 23

Outline

1 Motivation
2 GPU Control System Architecture
3 Performance

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 11 / 23

Real-Time Performance is Determined By
Latency and Sampling Period

latency

sampling S GPU Processing Pipelines S sample paket
period SSS S Analog Output

Digitizer S
S SS

Latency is response time of feedback system
Sampling period determines smoothness
Algorithmic complexity limits latency, not sampling period
Need both latency and sampling period in the order of few
microseconds

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 12 / 23

Control Algorithm is Implemented in One Kernel

CPU GPU CPU GPU

Read input data Send parameters Read data
Process data to GPU memory Process data
Send data to GPU memory Start GPU kernel Compute
Start GPU kernel A result a
Compute Process results
Wait for GPU kernel A result a
Read results from Compute
result b
GPU Memory
Process results ...
Send new data to
Write output data
GPU memory
Start GPU kernel B Wait for GPU kernel

Wait for GPU kernel B Compute
Read results from result b

GPU Memory ...

... Time

Write output data

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 13 / 23

Redundant PCIe Transfers have to be Avoided To
Reduce Latency

Traditional

Data bounces through
host RAM

PCIe bus has multi GB/s
throughput

Transfer setup takes
several µs

Okay if data chunks
are big, transfer and
processing takes long

Bad if latency is longer
than transfer time

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 14 / 23

Redundant PCIe Transfers have to be Avoided To
Reduce Latency

New

Peer-to-peer transfers
eliminate need for
bounce buffer

Good performance
even for small
amounts of data

Can be implemented
in software (kernel)

Required peer-to-peer
capable root-complex
present in most mid-
to high-end
mainboards.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 14 / 23

Peer-to-peer PCIe transfers are set up by sharing
BARs

GPU GPU Memory A/D Module D/A Module

0x01 0x02 0x03 DMA 0x05 0x06 DMA 0x08 0x09
Controller 0x03 Controller 0x01
BARs BARs BARs

writes reads Initialized from
BIOS by CPU

PCIe devices communicate via “BARs” in the PCI address space

GPU can map part of its memory into a BAR

AD/DA modules can transfer to/from arbitrary PCI address

CPU establishes communication by telling AD/DA modules
about GPU BAR.

Required some trickery in the past, but with CUDA 5 now
officially supported.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 15 / 23

Example: Userspace

/* Allocate buffer with extra space for 64kb alignment */
CUdeviceptr dev_addr;
cuMemAlloc(&dev_addr, size + 0xFFFF);

/* Prepare mapping */
CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;
cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS,

dev_addr);

/* Align to 64kb */
dev_addr += 0xFFFF;
dev_addr &= ~0xFFFF;

/* Call custom kernel module to get bus address,
* @fd refers to open device file */

struct rdma_info s;
s.dev_addr = dev_addr;
s.p2pToken = tokens.p2pToken;
s.vaSpaceToken = tokens.vaSpaceToken;
s.size = size;
ioctl(fd, RDMA_TRANSLATE_TOKEN, &s)

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 16 / 23

Example: Kernelspace

long rtm_t_dma_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg) {

nvidia_p2p_page_table_t *page_table;
// ....
switch(cmd){

case RDMA_TRANSLATE_TOKEN: {

COPY_FROM_USER(&rdma_info, varg, sizeof(struct rdma_info));
nvidia_p2p_get_pages(rdma_info.p2pToken, rdma_info.vaSpaceToken,

rdma_info.dev_addr, rdma_info.size,
&page_table, rdma_free_callback, tdev);

rdma_info.bus_addr = page_table->pages[0]->physical_address;
COPY_TO_USER(varg, &rdma_inf, sizeof(struct rdma_info));
return 0;
}
// Other ioctls
}

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 17 / 23

Userspace Continued

/* Call custom kernel module to get bus address,
* @fd refers to open device file */

rtm_t_rdma_info s;
s.dev_addr = dev_addr;
ioctl(fd, RTM_T_TRANSLATE_TOKEN, &s)

/* Retrieve bus address */
uint64_t bus_addr;
bus_addr = s.bus_addr;

/* Send bus address to digitizer */
init_rtm_t(bus_addr, other, stuff, here);

// Start GPU kernel

// Kernel polls input buffer

// Wait for kernel to complete

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 18 / 23

Outline

1 Motivation
2 GPU Control System Architecture
3 Performance

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 19 / 23

The HBT-EP Plasma Control System was Built with
Commodity Hardware.

Hardware:
Workstation PC
NVIDIA GeForce GTX 580
D-TACQ ACQ196 A-D Converter
(96 channels, 16 bit)
2 D-TACQ AO32CPCI D-A Converter
(2 x 32 channels, 16 bit)
Standard Linux host system
(no real-time kernel required!)

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 20 / 23

P2P Transfers Reduce Latency by 50%

24

22

Latency [us] 20

18

16

14 GPU RAM
Host RAM
12 4 6 8 Samp1li0ng Per1io2d [us]14 16 18 20
102

Optimal latency when using host memory: 16 µs

Optimal latency when using GPU memory: 10 µs

50% difference does not mean having to wait twice as long, it
is the difference between things blowing up or not.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 21 / 23

GPU Beats CPU in Computational and Real-Time
Performance even in the Microsecond Regime

Performance tested 70 GPU
60 CPU
with repeated matrix 50
Sampling Period [us]
application 40

GPU beats CPU down 30
to 5 µs 20
10

Missed samples 030 40 50 60 70 80 90 100

counted in 1000 runs Matrix Size

Missed samples with 103 CPU
GPU
GPU: None, with CPU: 102

up to 2.5% Count 101

100

10-10.0 0.5 1.0 1.5 2.0 2.5
Missed Samples [%]
22 / 23
N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013

Summary

1 The advantages of GPUs are not restricted to large problems
requiring long calculations.

2 Even when processing kB sized batches under microsecond
latency constraints, GPUs can be faster than CPUs, while at the
same time offering better real-time performance.

3 In these regimes, data transfer overhead becomes the
dominating factor, and using peer to peer transfers improves
performance by more than 50%.

4 A GPU based real-time control system has been developed at
Columbia University and tested for the control of magnetically
confined plasmas. Contact us for details.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 23 / 23

Outline

4 Backup Slides

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 / 6

Latency and Sampling Period are Measured
Experimentally by Copying Square Waves

Volt 0.20 A Shot 76504
0.15
0.10 B Control Input
0.05 2385 2390 2395 Control Output
0.00 Sample Clock
0.05 Time [us] 2400 2405
0.10
0.15
0.20 2380

Control algorithm set up to copy input to output 1:1

Blue trace is input square wave

Green trace is output square wave

Output lags behind input by control system latency

Red trace is sampling interval (sampling on downward edge)

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 2 / 6

Plasma Physics Results: Dominant Mode Amplitude
Reduced by up to 60%

Amplitude 0.24 No FB
0.16 g=144
0.08 50 5 g=577
0.00 20 15 10 Frequency [kHz] 10 15 20

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 3 / 6

Self Generated Fields Cause Instabilities

Electric currents (which generate magnetic fields) flow not just
in the coils, but also in the plasma itself
The plasma thus modifies the fields that confine it
... sometimes in a self-amplifying way – instability
Typical shape: rotating, helical deformation. Timescale: 50
microseconds.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 4 / 6

Feedback Control uses Measurements to Determine
Control Signals

Input Controller Control Signal / Actuators Physical System Output
Control Output Interaction

Measurements / Control Input Physical
Interaction

Sensors

Goal: keep system in specific state

If system is perfectly known, can calculate required control
signals (open-loop control)

In practice, need to use measurements to determine effects and
update signals: feedback control

A control system acquires measurements, performs computations,
and generates control output to manipulate the system state.

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 5 / 6

Data Passthrough Establishes 8 µs Lower Latency Limit

16

Latency [us] 14

12

10 GPU RAM
Host RAM
8

0 2 4 Sampling 6Period [us] 8 10 12

Control system uses same buffer to write input and read output

No GPU processing, so no difference between host and GPU
memory

Jump: 4 µs required for A-D conversion and data push

Offset: 4 µs required for data pull and D-A conversion

N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 6 / 6

Click to View FlipBook Version