The release notes for the CUDA Toolkit can be found online at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html.
- Improved kernel launch latency (using the <<< >>> syntax and the cudaLaunchKernel API) for both multithreaded and multi-GPU code by up to a factor of 2 compared to CUDA 9.0.
- Added support for unified memory with address translation services (ATS) on IBM POWER9.
- Added arithmetic operators for the __half2 data type and a volatile assignment operator for the __half data type.
- Added version 6.2 of the Parallel Thread Execution instruction set architecture (ISA). For details about new instructions (activemask, FP16, and atomics) and deprecated instructions, see Parallel Thread Execution ISA Version 6.2 in the PTX documentation.
- IPC functionality is now supported on Windows.
- Added P2P write and read bandwidth and latency metrics to the p2pBandwidthLatencyTest sample.
- Thrust now uses CUB v1.7.5.
- Added some performance optimizations in Thrust for the templated complex type.
- Added support for new operating systems. For a list of operating systems supported by CUDA, see the following information in the installation guides:
- Changed CUDA_DEVICE_ORDER==FASTEST_FIRST to enumerate GPUs in descending order of performance.
- Added a new driver API cuStreamGetCtx to retrieve the context associated with a stream. This API is primarily used by the multidevice cooperative launch runtime API to ensure that the specified function's module is loaded in the right context.
- Added support for full core dump generation on Linux by using named pipes for MPS-based CUDA applications and CUDA applications that are not based on MPS.
- Added these new helper APIs for cooperative groups:
- grid_dim() to get the 3-dimensional grid size
- block_dim() to get the 3-dimensional block size