The following checklist will help you determine how to improve the performance of your driver:
Avoid using I/O ranges in your hardware design. Use Memory mapped ranges instead as they are accessed significantly faster.
If the problem persists, then there is a hardware design problem. You will not be able to increase performance by using any software design method, writing a Kernel PlugIn, or even by writing a full kernel driver.
PCI target access is usually slower than PCI master access (bus-master DMA). For large data transfers, bus-master DMA access is preferable. Read 11.2. Performing Direct Memory Access (DMA) to learn how to use WinDriver to implement bus master DMA.
There are two PCI transfer options: reading/writing to FIFO (string transfers), or reading/writing to different memory blocks.
In the case of reading/writing to memory blocks, the data is transferred to/from a memory address in the PC from/to a memory address in the card, and then both the card’s memory address and the PC’s memory address are incremented for the next transfer. This way, the data is transferred between the address in the PC and the same relative address in the card.
In the case of reading/writing to FIFO, the data is transferred to/from a memory address in the PC from/to a single address in the card, and then only the PC’s memory address is incremented for the next transfer. This way the data is transferred between incremented memory addresses in the PC and the same FIFO in the card’s memory.
For more information on PCI transfers, please refer to the description of WinDriver’s WD_Transfer()
function, which executes a read/write instruction to an I/O port or to a memory address.
As a general rule, transfers to memory-mapped regions are faster than transfers to I/O-mapped regions, because WinDriver enables you to access memory-mapped regions directly from the user mode, without the need for a function call, as explained in 11.1.3.1. Using Direct Access to Memory-Mapped Regions.
In addition, the WinDriver APIs enable you to improve the performance of your I/O and memory data transfers by using block (string) transfers and by grouping several data transfers into a single function call, as explained in 11.1.3.2. Block Transfers and Grouping Multiple Transfers (Burst Transfer).
When using the low-level WD_xxx() APIs, the user-mode and kernel-mode mappings of the card’s physical memory regions are returned by WD_CardRegister() within the pTransAddr and pUserDirectAddr fields of the pCardReg->Card.Item[i]
card resource item structures. The pTransAddr result should be used as a base address in calls to WD_Transfer() or WD_MultiTransfer() or when accessing memory directly from a Kernel PlugIn driver (Chapter 12: Understanding the Kernel PlugIn).
To access the memory directly from your user-mode process, use pUserDirectAddr as a regular pointer. Whatever the method you select to access the memory on your card, it is important to align the base address according to the size of the data type, especially when issuing string transfer commands. Otherwise, the transfers are split into smaller portions.
The easiest way to align data is to use basic types when defining a buffer, i.e.:
Block Transfers and Grouping Multiple Transfers (Burst Transfer)
The main methods for transferring large amounts of data between PCI-device memory addresses and a host machine’s random-access memory (RAM) are block transfers (which may or may not result in PCI burst transfers), and bus-master DMA.
Block transfers are easier to implement, but DMA is much more effective and reliable for transferring large amounts of data, as explained in 11.2. Performing Direct Memory Access (DMA).
You can use the WinDriver WDC_ReadAddrBlock() and WDC_WriteAddrBlock() functions (or the low-level WD_Transfer() function with a string command) to perform block (string) transfers — i.e., transfer blocks of data from the device memory (read) or to the device memory (write). You can use also WDC_MultiTransfer() (or the low-level WD_MultiTransfer() function) to group multiple block transfers into a single function call. This is more efficient than performing multiple single transfers. The WinDriver block-transfer functions use assembler string instructions (such as REP MOVSD
, or a 64-bit MMX instruction for 64-bit transfers) to move a block of memory between PCI-mapped memory on the device and the host’s RAM. From a software perspective, this is the most that can be done to attempt to initiate PCI burst transfers.
The hardware uses PCI burst mode to perform burst transfers — i.e., transfer the data in “bursts” of block reads/writes, resulting in a small performance improvement compared to the alternative of single WORD transfers. Some host controllers implement burst transfers by grouping access to successive PCI addresses into PCI bursts.
The host-side software has no way to control whether a target PCI transfer is issued as a burst transfer. The most the host can do is initiate transfers using assembler string instructions — as done by the WinDriver block-transfer APIs — but there’s no guarantee that this will translate into burst transfers, as this is entirely up to the hardware. Most PCI host controllers support PCI burst mode for write transfers. It is generally less common to find similar burst-mode support for PCI readtransfers.
To sum it up, to transfer large amounts of data to/from memory addresses or I/O addresses (which by definition cannot be accessed directly, as opposed to memory addresses), use the following methods to improve performance by reducing the function calls overhead and context switches between the user and kernel modes:
Performing 64-Bit Data Transfers
The ability to perform actual 64-bit transfers is dependent on the existence of support for such transfers by the hardware, CPU, bridge, etc., and can be affected by any of these factors or their specific combination.
WinDriver supports 64-bit PCI data transfers on the supported 64-bit platforms, as well as on Windows and Linux 32-bit x86 platforms. If your PCI hardware (card and bus) is 64-bit, the ability to perform 64-bit data transfers on 32-bit platforms will enable you to utilize your hardware’s broader bandwidth, even if your host operating system is only 32-bit.
However, note that:
- The ability to perform actual 64-bit transfers requires that such transfers be supported by the hardware — including the CPU, the PCI card, the PCI host controller, and the PCI bridge — and it can be affected by any of these components or their specific combination.
- The conventional wisdom among hardware engineers is that performing two 32-bit DWORD transfers is more efficient than performing a single 64-bit QWORD transfer; the reason is that the 64-bit transfer requires an additional CPU cycle to negotiate a 64-bit transfer mode, and this cycle can be used, instead, to perform a second 32-bit transfer. Therefore, performing 64-bit transfers is generally more advisable if you wish to transfer more than 64 bits of data in a single burst.
This innovative technology makes possible data transfer rates previously unattainable on 32-bit platforms. Drivers developed using WinDriver will attain significantly better performance results than drivers written with the WDK or other driver development tools.
To date, such tools do not enable 64-bit data transfer on x86 platforms running 32-bit operating systems. Jungo’s benchmark performance testing results for 64-bit data transfer indicate a significant improvement of data transfer rates compared to 32-bit data transfer, guaranteeing that drivers developed with WinDriver will achieve far better performance than 32-bit data transfer normally allows.
You can perform 64-bit data transfers using any of the following methods:
You can also perform 64-bit transfers to/from the PCI configuration space using WDC_PciReadCfg64() and WDC_PciReadCfgBySlot64() / WDC_PciWriteCfgBySlot64() .
The best way to improve the performance of large PCI memory data transfers is by using bus-master direct memory access (DMA), and not by performing block transfers (which as explained above, may or may not result in PCI burst transfers).
Most PCI architectures today provide DMA capability, which enables data to be transferred directly between memory-mapped addresses on the PCI device and the host’s RAM, freeing the CPU from involvement in the data transfer and thus improving the host’s performance.
DMA data-buffer sizes are limited only by the size of the host’s RAM and the available memory.
Performing Direct Memory Access (DMA)
This section describes how to use WinDriver to implement bus-master Direct Memory Access (DMA) for devices capable of acting as bus masters. Such devices have a DMA controller, which the driver should program directly.
DMA is a capability provided by some computer bus architectures — including PCI and PCIe — which allows data to be sent directly from an attached device to the memory on the host, freeing the CPU from involvement with the data transfer and thus improving the host’s performance.
A DMA buffer can be allocated in two ways:
- Contiguous buffer — A contiguous block of memory is allocated.
- Scatter/Gather — The allocated buffer can be fragmented in the physical memory and does not need to be allocated contiguously. The allocated physical memory blocks are mapped to a contiguous buffer in the calling process’s virtual address space, thus enabling easy access to the allocated physical memory blocks.
The programming of a device’s DMA controller is hardware specific. Normally, you need to program your device with the local address (on your device), the host address (the physical memory address on your PC) and the transfer count (the size of the memory block to transfer), and then set the register that initiates the transfer.
WinDriver provides you with API for implementing both contiguous-buffer DMA and Scatter/Gather DMA (if supported by the hardware) — see the description of WDC_DMAContigBufLock(), and WDC_DMABufUnlock() .
The following sections include code samples that demonstrate how to use WinDriver to implement Scatter/Gather DMA and contiguous-buffer DMA, and an explanation on how to preallocate contiguous DMA buffers on Windows.
The sample routines demonstrate using either an interrupt mechanism or a polling mechanism to determine DMA completion.
The sample routines allocate a DMA buffer and enable DMA interrupts (if polling is not used) and then free the buffer and disable the interrupts (if enabled) for each DMA transfer. However, when you implement your actual DMA code, you can allocate DMA buffer(s) once, at the beginning of your application, enable the DMA interrupts (if polling is not used), then perform DMA transfers repeatedly, using the same buffer(s), and disable the interrupts (if enabled) and free the buffer(s) only when your application no longer needs to perform DMA.
Implementing Scatter/Gather DMA
Following is a sample routine that uses WinDriver’s WDC API to allocate a Scatter/Gather DMA buffer and perform bus-master DMA transfers.
C Example
A more detailed example, which is specific to the enhanced support for PLX chipsets can be found in the
WinDriver/samples/c/plx/lib/plx_lib.c
library file and
WinDriver/samples/c/plx/diag_lib/plx_diag_lib.c
diagnostics library file (which utilizes the plx_lib.c
DMA API).
BOOL DMARoutine(WDC_DEVICE_HANDLE hDev, DWORD dwDMABufSize,
UINT32 u32LocalAddr, BOOL fPolling, BOOL fToDev)
{
PVOID pBuf;
WD_DMA *pDma = NULL;
BOOL fRet = FALSE;
/* Allocate a user-mode buffer for Scatter/Gather DMA */
pBuf = malloc(dwDMABufSize);
if (!pBuf)
return FALSE;
/* Lock the DMA buffer and program the DMA controller */
if (!DMAOpen(hDev, pBuf, u32LocalAddr, dwDMABufSize, fToDev, &pDma))
goto Exit;
/* Enable DMA interrupts (if not polling) */
if (!fPolling)
{
if (!MyDMAInterruptEnable(hDev, MyDmaIntHandler, pDma))
goto Exit; /* Failed enabling DMA interrupts */
}
/* Flush the CPU caches (see documentation of WDC_DMASyncCpu()) */
WDC_DMASyncCpu(pDma);
/* Start DMA - write to the device to initiate the DMA transfer */
MyDMAStart(hDev, pDma);
/* Wait for the DMA transfer to complete */
MyDMAWaitForCompletion(hDev, pDma, fPolling);
/* Flush the I/O caches (see documentation of WDC_DMASyncIo()) */
WDC_DMASyncIo(pDma);
fRet = TRUE;
Exit:
DMAClose(hDev, pDma, fPolling);
free(pBuf);
return fRet;
}
/* DMAOpen: Locks a Scatter/Gather DMA buffer */
BOOL DMAOpen(WDC_DEVICE_HANDLE hDev, PVOID pBuf, UINT32 u32LocalAddr, DWORD dwDMABufSize, BOOL fToDev, WD_DMA **ppDma)
{
DWORD dwStatus, i;
DWORD dwOptions = fToDev ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
/* Lock a Scatter/Gather DMA buffer */
dwStatus = WDC_DMASGBufLock(hDev, pBuf, dwOptions, dwDMABufSize, ppDma);
if (WD_STATUS_SUCCESS != dwStatus)
{
printf("Failed locking a Scatter/Gather DMA buffer. Error 0x%lx - %s\n",
dwStatus, Stat2Str(dwStatus));
return FALSE;
}
/* Program the device's DMA registers for each physical page */
MyDMAProgram((*ppDma)->Page, (*ppDma)->dwPages, fToDev, u32LocalAddr);
return TRUE;
}
/* DMAClose: Unlocks a previously locked Scatter/Gather DMA buffer */
void DMAClose(WDC_DEVICE_HANDLE hDev, WD_DMA *pDma, BOOL fPolling)
{
/* Disable DMA interrupts (if not polling) */
if (!fPolling)
MyDMAInterruptDisable(hDev);
/* Unlock and free the DMA buffer */
WDC_DMABufUnlock(pDma);
C# Example
A more detailed example, which is specific to the enhanced support for PLX chipsets can be found in the WinDriver/samples/c/plx/dotnet/lib/
library and WinDriver/samples/c/plx/dotnet/diag/
diagnostics library.
bool DMARoutine(PCI_Device dev, DWORD dwDMABufSize, UINT32 u32LocalAddr, bool fPolling, bool fToDev, IntHandler MyDmaIntHandler)
{
bool fRet = false;
IntPtr pBuf = IntPtr.Zero;
Log log = new Log(new Log.TRACE_LOG(TraceLog),
new Log.ERR_LOG(ErrLog));
WD_DMA wdDma = new WD_DMA();
DmaBuffer dmaBuf = new DmaBufferSG(dev, log);
/* Allocate user mode buffer for Scatter/Gather DMA */
pBuf = Marshal.AllocHGlobal((int)dwDMABufSize);
if (pBuf == IntPtr.Zero)
{
fRet = false;
goto Exit;
}
/* Lock the DMA buffer and program the DMA controller */
if (!DMAOpen(ref dmaBuf, pBuf, u32LocalAddr, dwDMABufSize, fToDev, ref wdDma))
{
fRet = false;
goto Exit;
}
/* Enable DMA interrupts (if not polling) */
if (!fPolling)
{
if (!MyDMAInterruptEnable(dev, MyDmaIntHandler, ref wdDma))
{
fRet = false;
goto Exit; /* Failed enabling DMA interrupts */
}
}
/* Flush the CPU caches (see documentation of WDC_DMASyncCpu()) */
wdc_lib_decl.WDC_DMASyncCpu(dmaBuf.pWdDma);
/* Start DMA - write to the device to initiate the DMA transfer */
MyDmaStart(dev, ref wdDma);
/* Wait for the DMA transfer to complete */
MyDMAWaitForCompletion(dev, ref wdDma, fPolling);
/* Flush the I/O caches (see documentation of WDC_DMASyncIo()) */
wdc_lib_decl.WDC_DMASyncIo(dmaBuf.pWdDma);
fRet = true;
Exit:
DMAClose(dev, dmaBuf, fPolling);
if (pBuf != IntPtr.Zero)
Marshal.FreeHGlobal(pBuf);
return fRet;
}
/* OpenDMA: Locks a Scatter/Gather DMA buffer */
public bool DMAOpen(ref DmaBuffer dmaBuf, IntPtr pBuf, UINT32 u32LocalAddr, DWORD bufSize, bool fToDev, ref WD_DMA wdDma)
{
IntPtr pDma = dmaBuf.pWdDma;
WDC_DEVICE_HANDLE hDev = dmaBuf.DeviceHandle;
DWORD dwOptions = fToDev ? (DWORD)WD_DMA_OPTIONS.DMA_TO_DEVICE :
(DWORD)WD_DMA_OPTIONS.DMA_FROM_DEVICE;
wdDma = MarshalDMA(pDma);
/* Lock a Scatter/Gather DMA buffer */
if (wdc_lib_decl.WDC_DMASGBufLock(hDev, pBuf, dwOptions, bufSize, ref pDma) != (DWORD)wdc_err.WD_STATUS_SUCCESS)
return false;
/* Program the device's DMA registers for each physical page */
MyDMAProgram(fToDev, wdDma.Page[0], wdDma.dwPages, u32LocalAddr);
return true;
}
/* DMAClose: Unlocks a previously locked Scatter/Gather DMA buffer */
public void DMAClose(PCI_Device dev, DmaBuffer dmaBuf, bool fPolling)
{
/* Disable DMA interrupts (if not polling) */
if (!fPolling)
MyDMAInterruptDisable(dev);
/* Unlock and free the DMA buffer */
wdc_lib_decl.WDC_DMABufUnlock(dmaBuf.pWdDma);
}
Notice the difference between ref WD_DMA wdDma and IntPtr pWdDma. The former is used by C# user functions, while the latter is used by WDC API C functions, that can only take a pointer, not a reference.
What Should You Implement?
In the code sample above, it is up to you to implement the following MyDMAxxx() routines, according to your device’s specification:
- MyDMAProgram(): Program the device’s DMA registers. Refer the device’s data sheet for the details.
- MyDMAStart(): Write to the device to initiate DMA transfers.
- MyDMAInterruptEnable() and MyDMAInterruptDisable(): Use WDC_IntEnable() / WDC_IntDisable() (respectively) to enable/disable the software interrupts and write/read the relevant register(s) on the device in order to physically enable/disable the hardware DMA interrupts
(see 11.3. Performing Direct Memory Access (DMA) transactions for details regarding interrupt handling with WinDriver.)
- MyDMAWaitForCompletion(): Poll the device for completion or wait for “DMA DONE” interrupt.
- MyDmaIntHandler: The device’s interrupt handler. IntHandler is a pointer to a function prototype of your choice.
When using the basic WD_xxx API to allocate a Scatter/Gather DMA buffer that is larger than 1MB, you need to set the [DMA_LARGE_BUFFER] (DMA_LARGE_BUFFER) flag in the call to WD_DMALock() and allocate memory for the additional memory pages.
However, when using WDC_DMASGBufLock() to allocate the DMA buffer, you do not need any special implementation for allocating large buffers, since the function handles this for you.
Implementing Contiguous-Buffer DMA
Following is a sample routine that uses WinDriver’s WDC API to allocate a contiguous DMA buffer and perform bus-master DMA transfers.
For more detailed, hardware-specific, contiguous DMA examples, refer to the following enhanced-support chipset sample library files:
- PLX —
WinDriver/samples/c/plx/lib/plx_lib.c
and
WinDriver/samples/c/plx/diag_lib/plx_diag_lib.c
(which utilizes the plx_lib.c
DMA API)
- Xilinx Bus Master DMA (BMD) design —
WinDriver/samples/c/xilinx/bmd_design/bmd_lib.c
C Example
BOOL DMARoutine(WDC_DEVICE_HANDLE hDev, DWORD dwDMABufSize,
UINT32 u32LocalAddr, BOOL fPolling, BOOL fToDev)
{
PVOID pBuf = NULL;
WD_DMA *pDma = NULL;
BOOL fRet = FALSE;
/* Allocate a DMA buffer and open DMA for the selected channel */
if (!DMAOpen(hDev, &pBuf, u32LocalAddr, dwDMABufSize, fToDev, &pDma))
goto Exit;
/* Enable DMA interrupts (if not polling) */
if (!fPolling)
{
if (!MyDMAInterruptEnable(hDev, MyDmaIntHandler, pDma))
goto Exit; /* Failed enabling DMA interrupts */
}
/* Flush the CPU caches (see documentation of WDC_DMASyncCpu()) */
WDC_DMASyncCpu(pDma);
/* Start DMA - write to the device to initiate the DMA transfer */
MyDMAStart(hDev, pDma);
/* Wait for the DMA transfer to complete */
MyDMAWaitForCompletion(hDev, pDma, fPolling);
/* Flush the I/O caches (see documentation of WDC_DMASyncIo()) */
WDC_DMASyncIo(pDma);
fRet = TRUE;
Exit:
DMAClose(hDev, pDma, fPolling);
return fRet;
}
/* DMAOpen: Allocates and locks a contiguous DMA buffer */
BOOL DMAOpen(WDC_DEVICE_HANDLE hDev, PVOID *ppBuf, UINT32 u32LocalAddr,
DWORD dwDMABufSize, BOOL fToDev, WD_DMA **ppDma)
{
DWORD dwStatus;
DWORD dwOptions = fToDev ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
/* Allocate and lock a contiguous DMA buffer */
dwStatus = WDC_DMAContigBufLock(hDev, ppBuf, dwOptions, dwDMABufSize, ppDma);
if (WD_STATUS_SUCCESS != dwStatus)
{
printf("Failed locking a contiguous DMA buffer. Error 0x%lx - %s\n",
dwStatus, Stat2Str(dwStatus));
return FALSE;
}
/* Program the device's DMA registers for the physical DMA page */
MyDMAProgram((*ppDma)->Page, (*ppDma)->dwPages, fToDev, u32LocalAddr);
return TRUE;
}
/* DMAClose: Frees a previously allocated contiguous DMA buffer */
void DMAClose(WDC_DEVICE_HANDLE hDev, WD_DMA *pDma, BOOL fPolling)
{
/* Disable DMA interrupts (if not polling) */
if (!fPolling)
MyDMAInterruptDisable(hDev);
/* Unlock and free the DMA buffer */
WDC_DMABufUnlock(pDma);
}
C# Example
bool DMARoutine(PCI_Device dev, DWORD dwDMABufSize, UINT32 u32LocalAddr,
bool fPolling, bool fToDev, IntHandler MyDmaIntHandler)
{
bool fRet = false;
IntPtr pBuf = IntPtr.Zero;
Log log = new Log(new Log.TRACE_LOG(TraceLog),
new Log.ERR_LOG(ErrLog));
WD_DMA wdDma = new WD_DMA();
DmaBuffer dmaBuf = new DmaBufferContig(dev, log);
/* Lock the DMA buffer and program the DMA controller */
if (!DMAOpen(ref dmaBuf, pBuf, u32LocalAddr, dwDMABufSize,
fToDev, ref wdDma))
{
fRet = false;
goto Exit;
}
/* Enable DMA interrupts (if not polling) */
if (!fPolling)
{
if (!MyDMAInterruptEnable(dev, MyDmaIntHandler, ref wdDma))
{
fRet = false;
goto Exit; /* Failed enabling DMA interrupts */
}
}
/* Flush the CPU caches (see documentation of WDC_DMASyncCpu()) */
wdc_lib_decl.WDC_DMASyncCpu(dmaBuf.pWdDma);
/* Start DMA - write to the device to initiate the DMA transfer */
MyDmaStart(dev, ref wdDma);
/* Wait for the DMA transfer to complete */
MyDMAWaitForCompletion(dev, ref wdDma, fPolling);
/* Flush the I/O caches (see documentation of WDC_DMASyncIo()) */
wdc_lib_decl.WDC_DMASyncIo(dmaBuf.pWdDma);
fRet = true;
Exit:
DMAClose(dev, dmaBuf, fPolling);
return fRet;
}
/* OpenDMA: Locks a Contiguous DMA buffer */
public bool DMAOpen(ref DmaBuffer dmaBuf, IntPtr pBuf, UINT32 u32LocalAddr,
DWORD bufSize, bool fToDev, ref WD_DMA wdDma)
{
IntPtr pDma = dmaBuf.pWdDma;
WDC_DEVICE_HANDLE hDev = dmaBuf.DeviceHandle;
DWORD dwOptions = fToDev ? (DWORD)WD_DMA_OPTIONS.DMA_TO_DEVICE :
(DWORD)WD_DMA_OPTIONS.DMA_FROM_DEVICE;
wdDma = MarshalDMA(pDma);
/* Lock a Scatter/Gather DMA buffer */
if (wdc_lib_decl.WDC_DMAContigBufLock(hDev, ref pBuf, dwOptions,
bufSize, ref pDma) != (DWORD)wdc_err.WD_STATUS_SUCCESS)
return false;
/* Program the device's DMA registers for each physical page */
MyDMAProgram(fToDev, wdDma.Page[0], wdDma.dwPages, u32LocalAddr);
return true;
}
/* DMAClose: Unlocks a previously locked Scatter/Gather DMA buffer */
public void DMAClose(PCI_Device dev, DmaBuffer dmaBuf, bool fPolling)
{
/* Disable DMA interrupts (if not polling) */
if (!fPolling)
MyDMAInterruptDisable(dev);
/* Unlock and free the DMA buffer */
wdc_lib_decl.WDC_DMABufUnlock(dmaBuf.pWdDma);
}
What Should You Implement?
In the code sample above, it is up to you to implement the following MyDMAxxx() routines, according to your device’s specification:
- MyDMAProgram(): Program the device’s DMA registers. Refer the device’s data sheet for the details.
- MyDMAStart(): Write to the device to initiate DMA transfers.
- MyDMAInterruptEnable() and MyDMAInterruptDisable(): Use WDC_IntEnable() / WDC_IntDisable() (respectively) to enable/disable the software interrupts and write/read the relevant register(s) on the device in order to physically enable/disable the hardware DMA interrupts
(see 11.3. Performing Direct Memory Access (DMA) transactions for details regarding interrupt handling with WinDriver.)
- MyDMAWaitForCompletion(): Poll the device for completion or wait for “DMA DONE” interrupt.
- MyDmaIntHandler: The device’s interrupt handler. IntHandler is a pointer to a function prototype of your choice.
Preallocating Contiguous DMA Buffers on Windows
WinDriver doesn’t limit the size of the DMA buffer that can be allocated using its DMA APIs. However, the success of the DMA allocation is dependent on the amount of available system resources at the time of the allocation. Therefore, the earlier you try to allocate the buffer, the better your chances of succeeding.
WinDriver for Windows allows you to configure your device INF file to preallocate contiguous DMA buffers at boot time, thus increasing the odds that the allocation(s) will succeed. You may preallocate a maximum of 512 buffers: — 256 host-to-device buffers and/or 256 device-to-host buffers.
There are 2 ways to preallocate contiguous DMA buffers on Windows: directly from the DriverWizard, or manually via editing the INF file.
Directly from DriverWizard:
- In DriverWizard, start a new project, select a device from the list and click Generate .INF file.
- Check Preallocate Host-To-Device DMA Buffers and/or Preallocate Device-To-Host DMA Buffers to enable the text boxes under each checkbox.
- Adjust the Size, Count and Flags parameters as desired. The supported WinDriver DMA flags are documented in the [WD_DMA_OPTIONS] (WD_DMA_OPTIONS) enum.
- Click Next, and you will then be prompted to choose a filename for your .INF file. After choosing a filename, the INF file will be created and ready to use, with your desired parameters.
⚠ Attention: The Size and Flags fields must be hexadecimal numbers, formatted with the “0x” prefix, as shown below.
DriverWizard INF File Information
Manually by editing an existing INF file:
- Add the required configuration under the
UpdateRegistryDevice
registry key in your device INF file, as shown below.
The examples are for configuring preallocation of eight DMA buffers but you may, of-course, select to preallocate just one buffer (or none at all). To preallocate unidirectional buffers, add these lines:
; Host-to-device DMA buffer:
; Host-to-device DMA buffer:
HKR,, “DmaToDeviceCount”,0x00010001,0x04 ; Number of preallocated
; DMA_TO_DEVICE buffers
HKR,, “DmaToDeviceBytes”,0x00010001,0x100000 ; Buffer size, in bytes
HKR,, “DmaToDeviceOptions”,0x00010001,0x41 ; DMA flags (0x40=DMA_TO_DEVICE
; + 0x1=DMA_KERNEL_BUFFER_ALLOC
; Device-to-host DMA buffer:
HKR,, “DmaFromDeviceCount”,0x00010001,0x04 ; Number of preallocated
; DMA_FROM_DEVICE buffers
HKR,, “DmaFromDeviceBytes”,0x00010001,0x100000 ; Buffer size, in bytes
HKR,, “DmaFromDeviceOptions”,0x00010001,0x21 ; DMA flags (0x20=DMA_FROM_DEVICE
; + 0x1=DMA_KERNEL_BUFFER_ALLOC)
- Edit the buffer sizes and add flags to the options masks in the INF file, as needed. Note, however, that the direction flags and the [DMA_KERNEL_BUFFER_ALLOC] (DMA_KERNEL_BUFFER_ALLOC) flag must be set as shown in Step 1.
The supported WinDriver DMA flags for the dwOptions
field of the WD_DMA struct are documented in [WD_DMA_OPTIONS] (WD_DMA_OPTIONS).
DmaFromDeviceCount
and DmaFromDeviceCount
values are supported from version 12.4.0. If those values aren’t set value of 1 will be assumed.
The Wizard-generated and relevant sample WinDriver device INF files already contain the unidirectional buffers configuration lines, so you only need to remove the comment indicator ; at the start of each line. The examples are for configuring preallocation of 8 DMA buffers (4 for each direction), but you may, of-course, select to preallocate just one buffer (or none at all, by leaving the above code commented out).
In your code, the first n calls (if you configured the INF file to preallocate n DMA buffers) to the contiguous-DMA-lock function — WDC_DMAContigBufLock()
— should set parameter values that match the buffer configurations in the INF file:
⚠ Attention: For a device-to-host buffer, the DMA-options mask parameter (dwOptions
/ pDma
->dwOptions
) should contain the same DMA flags set in the DmaFromDeviceOptions
registry key value, and the buffer-size parameter (dwDMABufSize
/ pDma
->dwBytes
) should be set to the value of the DmaFromDeviceBytes
registry key value. For a host-to-device buffer, the DMA-options mask parameter (dwOptions
/ pDma
->dwOptions
) should contain the same flags set in the DmaToDeviceOptions
registry key value, and the buffer-size parameter (dwDMABufSize
/ pDma
->dwBytes
) should be set to the value of the DmaToDeviceBytes
registry key value.
In calls to WDC_DMAContigBufLock()
, the DMA configuration information is provided via dedicated function parameters — dwDMABufSize
and dwOptions
. In calls to WD_DMALock()
, the information is provided within the fields of the WD_DMA
struct pointed to by the pDma
parameter — pDma
->dwBytes
and pDma
->dwOptions
.
When using WDC_DMAContigBufLock()
you don’t need to explicitly set the [DMA_KERNEL_BUFFER_ALLOC] (DMA_KERNEL_BUFFER_ALLOC) flag (which must be set in the INF-file configuration) because the function sets this flag automatically.
When using the low-level WinDriver WD_DMALock()
function, the DMA options are set in the function’s pDma
->dwOptions
parameter — which must also include the [DMA_KERNEL_BUFFER_ALLOC] (DMA_KERNEL_BUFFER_ALLOC) flag — and the buffer size is set in the pDma
->dwBytes
parameter.
If the buffer preallocation fails due to insufficient resources, you may need to increase the size of the non-paged pool (from which the memory is allocated).
Performing Direct Memory Access (DMA) transactions
This section describes how to use WinDriver to implement bus-master Direct Memory Access (DMA) transactions for devices capable of acting as bus masters. Such devices have a DMA controller, which the driver should program directly.
DMA is a capability provided by some computer bus architectures — including PCI and PCIe — which allows data to be sent directly from an attached device to the memory on the host, freeing the CPU from involvement with the data transfer and thus improving the host’s performance.
To understand the use of WinDriver’s DMA transaction API, you must be familiar with the following concepts:
DMA transaction
A DMA transaction is a complete I/O operation, such as a single read or write request from an application.
DMA transfer
A DMA transfer is a single hardware operation that transfers data from computer memory to a device or from the device to computer memory.
A single DMA transaction always consists of at least one DMA transfer, but a transaction can consist of many transfers.
A DMA transaction buffer can be allocated in two ways:
- Contiguous buffer — A contiguous block of memory is allocated.
- Scatter/Gather (SG) — The allocated buffer can be fragmented in the physical memory and does not need to be allocated contiguously. The allocated physical memory blocks are mapped to a contiguous buffer in the calling process’s virtual address space, thus enabling easy access to the allocated physical memory blocks.
The programming of a device’s DMA controller is hardware specific. Normally, you need to program your device with the local address (on your device), the host address (the physical memory address on your PC) and the transfer count (the size of the memory block to transfer), and then set the register that initiates the transfer. WinDriver provides you with an API for implementing both Contiguous-Buffer DMA transactions and Scatter/Gather DMA transactions (if supported by the hardware) — see the description of WDC_DMATransactionContigInit(), WDC_DMATransactionSGInit(), WDC_DMATransferCompletedAndCheck(), WDC_DMATransactionRelease() and WDC_DMATransactionUninit().
DMA transaction diagram
The following sections include code samples that demonstrate how to use WinDriver to implement a Scatter/Gather DMA transaction (see 11.2.1. Implementing Scatter/Gather DMA) and a Contiguous-Buffer DMA transaction (see 11.2.2. Implementing Contiguous-Buffer DMA).
The sample routines demonstrate using either an interrupt mechanism or a polling mechanism to determine DMA completion.
Implementing Scatter/Gather DMA transactions
Following is a sample routine that uses WinDriver’s WDC API to initialize a Scatter/Gather DMA transaction buffer and perform bus-master DMA transfers.
For more detailed, hardware-specific, Scatter Gather DMA transaction examples, refer to the following enhanced-support chipset (Chapter 9: Enhanced Support for Specific Chipsets) sample library files:
- PLX —
WinDriver/samples/c/plx/lib/plx_lib.c
, WinDriver/samples/c/plx/diag_lib/plx_diag_lib.c
(which utilizes the plx_lib.c DMA API) and WinDriver/samples/c/plx/9656/p9656_diag.c
- XDMA design —
WinDriver/samples/c/xilinx/xdma/xdma_lib.c
C Example
#define MAX_TRANSFER_SIZE 0xFFFF /* According to the device limits */
typedef struct {
UINT32 u32PADR; /* PCI address */
UINT32 u32LADR; /* Local address */
UINT32 u32SIZ; /* Transfer size */
UINT32 u32DPR; /* Next descriptor pointer */
} MY_DMA_TRANSFER_ELEMENT; /* DMA transfer element */
BOOL DMATransactionRoutine(WDC_DEVICE_HANDLE hDev, DWORD dwDMABufSize,
UINT32 u32LocalAddr, BOOL fPolling, BOOL fToDev)
{
PVOID pBuf;
WD_DMA *pDma = NULL;
BOOL fRet = FALSE;
/* Allocate a user-mode buffer for a Scatter/Gather DMA transaction */
pBuf = malloc(dwDMABufSize);
if (!pBuf)
return FALSE;
/* Initialize the DMA transaction */
if (!DMATransactionInit(hDev, pBuf, u32LocalAddr, dwDMABufSize, fToDev, &pDma,
fPolling))
{
goto Exit;
}
if (!DMATransactionExecute(hDev, pDma, fPolling))
{
goto Exit;
}
/* Wait for the DMA transaction to complete */
MyDMAWaitForCompletion(hDev, pDma, fPolling);
if (!DMATransactionRelease(hDev, pDma))
{
goto Exit;
}
/* The DMA transaction can be reused as many times as you like
* Notice: after each call to DMATransactionExecute
* (WDC_DMATransactionExecute) there must be a call to
* DMATransactionRelease (WDC_DMATransactionRelease) */
fRet = TRUE;
Exit:
DMAClose(hDev, pDma, fPolling);
free(pBuf);
return fRet;
}
/* DMATransactionInit: Initializes a Scatter/Gather DMA transaction buffer */
BOOL DMATransactionInit(WDC_DEVICE_HANDLE hDev, PVOID pBuf, UINT32 u32LocalAddr,
DWORD dwDMABufSize, BOOL fToDev, WD_DMA **ppDma, BOOL fPolling)
{
DWORD dwStatus, dwNumCmds, i;
DWORD dwOptions = fToDev ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
WD_TRANSFER *pTrans;
WDC_INTERRUPT_PARAMS interruptsParams;
BOOL fRet = FALSE;
/* Enable DMA interrupts (if not polling) */
if (!fPolling)
{
pTrans = GetMyTransCmds(&dwNumCmds);
interruptsParams.pTransCmds = pTrans;
interruptsParams.dwNumCmds = dwNumCmds;
interruptsParams.dwOptions = INTERRUPT_CMD_COPY;
interruptsParams.funcIntHandler = MyDmaIntHandler;
interruptsParams.pData = hDev;
interruptsParams.fUseKP = FALSE;
}
/* Initialize a Scatter/Gather DMA transaction buffer */
dwStatus = WDC_DMATransactionSGInit(hDev, pBuf, dwOptions, dwDMABufSize,
ppDma, &interruptsParams, MAX_TRANSFER_SIZE,
sizeof(MY_DMA_TRANSFER_ELEMENT));
if (WD_STATUS_SUCCESS != dwStatus)
{
printf("Failed to initialize a Scatter/Gather DMA transaction buffer. Error "
"0x%lx - %s\n", dwStatus, Stat2Str(dwStatus));
goto Exit;
}
fRet = TRUE;
Exit:
return fRet;
}
/* DMATransactionExecute: Executes the DMA transaction */
BOOL DMATransactionExecute(WDC_DEVICE_HANDLE hDev, WD_DMA *pDma, BOOL fPolling)
{
BOOL fRet = FALSE;
DWORD dwPage;
UINT64 u64Offset;
dwStatus = WDC_DMATransactionExecute(pDma, DMATransactionProgram, hDev);
if (dwStatus != WD_STATUS_SUCCESS)
{
printf("Failed to execute DMA transaction. Error 0x%lx - %s\n",
dwStatus, Stat2Str(dwStatus));
goto Exit;
}
fRet = TRUE;
Exit:
return fRet;
}
/* DMATransactionProgram: Programs the device's DMA registers and starts the
* DMA */
void DMATransactionProgram(WDC_DEVICE_HANDLE hDev)
{
PDEV_CTX pDevCtx = (PDEV_CTX)WDC_GetDevContext(hDev);
WD_DMA *pDma = pDevCtx->pDma;
BOOL fToDev = pDevCtx->fToDev;
DWORD dwPage;
UINT64 u64Offset;
/* The value of u64Offset is the required offset on the device */
u64Offset = pDevCtx->u32LocalAddr + pDma->dwBytesTransferred;
for (dwPage = 0; dwPage < pDma->dwPages; dwPage++)
{
/* Use pDma->Page array, u64Offset, fToDev,...
* to program the device's DMA registers for each physical page */
u64Offset += pDma->Page[dwPage].dwBytes;
}
/* Enable interrupts on hardware */
MyDMAInterruptHardwareEnable(hDev);
/* Start DMA - write to the device to start the DMA transfer */
MyDMAStart(hDev, pDma);
}
/* MyDmaIntHandler: Interrupt handler */
void DLLCALLCONV MyDmaIntHandler(PVOID pData)
{
PDEV_CTX pDevCtx = (PDEV_CTX)WDC_GetDevContext(hDev);
WD_DMA *pDma = pDevCtx->pDma;
DWORD dwStatus;
dwStatus = WDC_DMATransferCompletedAndCheck(pDma, TRUE);
if (dwStatus == WD_STATUS_SUCCESS)
{
/* DMA transaction completed */
}
else if (dwStatus != WD_MORE_PROCESSING_REQUIRED)
{
/* DMA transfer failed */
}
else /* dwStatus == WD_MORE_PROCESSING_REQUIRED */
{
/* DMA transfer completed (but the transaction is not completed) */
/* If the fRunCallback parameter given to
* WDC_DMATransferCompletedAndCheck() is TRUE, then the
* DMATransactionProgram function is automatically called
* synchronously by WDC_DMATransferCompletedAndCheck */
}
}
/* DMATransactionRelease: Releases the DMA transaction */
BOOL DMATransactionRelease(WD_DMA *pDma)
{
DWORD dwStatus = WDC_DMATransactionRelease(pDma);
return (dwStatus == WD_STATUS_SUCCESS) ? TRUE : FALSE;
}
/* DMAClose: Deletes a previously initiated Scatter/Gather DMA transaction */
void DMAClose(WDC_DEVICE_HANDLE hDev, WD_DMA *pDma, BOOL fPolling)
{
/* Disable DMA interrupts (if not polling) */
if (!fPolling)
MyDMAInterruptDisable(hDev);
WDC_DMATransactionUninit(pDma);
}
C# Example
A detailed example, which is specific to the enhanced support for PLX chipsets can be found in the WinDriver/samples/c/plx/dotnet/lib/
library and WinDriver/samples/c/plx/dotnet/diag/
diagnostics library.
What Should You Implement?
In the code sample above, it is up to you to implement the following MyDMAxxx() routines and MY_DMA_TRANSFER_ELEMENT structure according to your device’s specification and your needs:
- MY_DMA_TRANSFER_ELEMENT: A DMA transfer element structure according to your device’s specification.
- MyDMAStart(): Write to the device to initiate DMA transfers.
- MyDMAInterruptHardwareEnable(): Write/read the relevant register(s) on the device in order to physically enable the hardware DMA interrupts (see Section 10.4 for details regarding interrupt handling with WinDriver).
- MyDMAInterruptDisable(): Use WDC_IntDisable() to disable the software interrupts and write/read the relevant register(s) on the device in order to physically disable the hardware DMA interrupts (see Section 11.4 for details regarding interrupt handling with WinDriver).
- MyDMAWaitForCompletion(): Poll the device for completion or wait for a “DMA transfer done” interrupt.
- MyDmaIntHandler: The device’s interrupt handler. IntHandler is a pointer to a function prototype of your choice.
Implementing Contiguous-Buffer DMA transactions
Following is a sample routine that uses WinDriver’s WDC API to initialize a Contiguous- Buffer DMA transaction and perform bus-master DMA transfers.
For more detailed, hardware-specific, contiguous DMA examples, refer to the following enhanced-support chipset sample library files:
- PLX —
WinDriver/samples/c/plx/lib/plx_lib.c
and WinDriver/samples/c/plx/diag_lib/plx_diag_lib.c
(which utilizes the plx_lib.c
DMA API)
- XDMA design —
WinDriver/samples/c/xilinx/xdma/xdma_lib.c
C Example
#define DMA_ALIGNMENT_REQUIREMENT 0xFF /* According to the device requirements */
BOOL DMATransactionRoutine(WDC_DEVICE_HANDLE hDev, DWORD dwDMABufSize,
UINT32 u32LocalAddr, BOOL fPolling, BOOL fToDev)
{
PVOID pBuf = NULL;
WD_DMA *pDma = NULL;
BOOL fRet = FALSE;
/* Initialize the DMA transaction */
if (!DMATransactionInit(hDev, &pBuf, u32LocalAddr, dwDMABufSize, fToDev,
&pDma, fPolling))
{
goto Exit;
}
if (!DMATransactionExecute(hDev, pDma, fPolling))
{
goto Exit;
}
/* Wait for the DMA transaction to complete */
MyDMAWaitForCompletion(hDev, pDma, fPolling);
if (!DMATransactionRelease(hDev, pDma))
{
goto Exit;
}
/* The DMA transaction can be reused as many times as you like
* Notice: after each call to DMATransactionExecute
* (WDC_DMATransactionExecute) there must be a call to
* DMATransactionRelease (WDC_DMATransactionRelease) */
fRet = TRUE;
Exit:
DMAClose(hDev, pDma, fPolling);
free(pBuf);
return fRet;
}
/* DMATransactionInit: Initializes a Contiguous-Buffer DMA transaction buffer */
BOOL DMATransactionInit(WDC_DEVICE_HANDLE hDev, PVOID *ppBuf, UINT32 u32LocalAddr,
DWORD dwDMABufSize, BOOL fToDev, WD_DMA **ppDma, BOOL fPolling)
{
DWORD dwStatus, dwNumCmds, i;
DWORD dwOptions = fToDev ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
WD_TRANSFER *pTrans;
WDC_INTERRUPT_PARAMS interruptsParams;
BOOL fRet = FALSE;
/* Enable DMA interrupts (if not polling) */
if (!fPolling)
{
pTrans = GetMyTransCmds(&dwNumCmds);
interruptsParams.pTransCmds = pTrans;
interruptsParams.dwNumCmds = dwNumCmds;
interruptsParams.dwOptions = INTERRUPT_CMD_COPY;
interruptsParams.funcIntHandler = MyDmaIntHandler;
interruptsParams.pData = hDev;
interruptsParams.fUseKP = FALSE;
}
/* Initialize a Contiguous-Buffer DMA transaction buffer */
dwStatus = WDC_DMATransactionContigInit(hDev, ppBuf, dwOptions, dwDMABufSize,
ppDma, &interruptsParams, DMA_ALIGNMENT_REQUIREMENT);
if (WD_STATUS_SUCCESS != dwStatus)
{
printf("Failed to initialize a Contiguous-Buffer DMA transaction buffer. Error "
"0x%lx - %s\n", dwStatus, Stat2Str(dwStatus));
goto Exit;
}
fRet = TRUE;
Exit:
return fRet;
}
/* DMATransactionExecute: Executes the DMA transaction */
BOOL DMATransactionExecute(WDC_DEVICE_HANDLE hDev, WD_DMA *pDma, BOOL fPolling)
{
BOOL fRet = FALSE;
DWORD dwPage;
UINT64 u64Offset;
dwStatus = WDC_DMATransactionExecute(pDma, DMATransactionProgram, hDev);
if (dwStatus != WD_STATUS_SUCCESS)
{
printf("Failed to execute DMA transaction. Error 0x%lx - %s\n",
dwStatus, Stat2Str(dwStatus));
goto Exit;
}
fRet = TRUE;
Exit:
return fRet;
}
/* DMATransactionProgram: Programs the device's DMA registers and starts the
* DMA */
void DMATransactionProgram(WDC_DEVICE_HANDLE hDev)
{
PDEV_CTX pDevCtx = (PDEV_CTX)WDC_GetDevContext(hDev);
WD_DMA *pDma = pDevCtx->pDma;
BOOL fToDev = pDevCtx->fToDev;
UINT64 u64Offset;
/* The value of u64Offset is the required offset on the device */
u64Offset = pDevCtx->u32LocalAddr + pDma->dwBytesTransferred;
/* Use pDma->Page[0], u64Offset, fToDev,...
* to program the device's DMA registers */
/* Enable interrupts on hardware */
MyDMAInterruptHardwareEnable(hDev);
/* Start DMA - write to the device to start the DMA transfer */
MyDMAStart(hDev, pDma);
}
/* MyDmaIntHandler: Interrupt handler */
void DLLCALLCONV MyDmaIntHandler(PVOID pData)
{
PDEV_CTX pDevCtx = (PDEV_CTX)WDC_GetDevContext(hDev);
WD_DMA *pDma = pDevCtx->pDma;
DWORD dwStatus;
dwStatus = WDC_DMATransferCompletedAndCheck(pDma, TRUE);
if (dwStatus == WD_STATUS_SUCCESS)
{
/* DMA transaction completed */
}
else if (dwStatus == WD_MORE_PROCESSING_REQUIRED)
{
/* DMA transfer completed (but the transaction is not completed) */
/* If the fRunCallback parameter given to
* WDC_DMATransferCompletedAndCheck() is TRUE, then the
* DMATransactionProgram function is automatically called
* synchronously by WDC_DMATransferCompletedAndCheck */
}
else
{
/* DMA transfer failed */
}
}
/* DMATransactionRelease: Releases the DMA transaction */
BOOL DMATransactionRelease(WD_DMA *pDma)
{
DWORD dwStatus = WDC_DMATransactionRelease(pDma);
return (dwStatus == WD_STATUS_SUCCESS) ? TRUE : FALSE;
}
/* DMAClose: Deletes a previously initiated Contiguous-Buffer DMA transaction */
void DMAClose(WDC_DEVICE_HANDLE hDev, WD_DMA *pDma, BOOL fPolling)
{
/* Disable DMA interrupts (if not polling) */
if (!fPolling)
MyDMAInterruptDisable(hDev);
WDC_DMATransactionUninit(pDma);
}
C# Example
A detailed example, which is specific to the enhanced support for PLX chipsets can be found in the WinDriver/samples/c/plx/dotnet/lib/
library and WinDriver/samples/c/plx/dotnet/diag/
diagnostics library.
What Should You Implement?
In the code sample above, it is up to you to implement the following MyDMAxxx() routines and MY_DMA_TRANSFER_ELEMENT structure according to your device’s specification and your needs:
- MY_DMA_TRANSFER_ELEMENT: DMA transfer element structure according to your device’s specification.
- MyDMAStart(): Write to the device to initiate DMA transfers.
- MyDMAInterruptHardwareEnable(): Write/read the relevant register(s) on the device in order to physically enable the hardware DMA interrupts (see Section 11.4 for details regarding interrupt handling with WinDriver).
- MyDMAInterruptDisable(): Use WDC_IntDisable() to disable the software interrupts and write/read the relevant register(s) on the device in order to physically disable the hardware DMA interrupts (see Section 11.4 for details regarding interrupt handling with WinDriver).
- MyDMAWaitForCompletion(): Poll the device for completion or wait for a “DMA transfer done” interrupt.
- MyDmaIntHandler: The device’s interrupt handler. IntHandler is a pointer to a function prototype of your choice.
DMA between PCI devices and NVIDIA GPUs with GPUDirect (Linux only)
What is GPUDirect for RDMA?
GPUDirect for RDMA is a feature available on selected NVIDIA GPUs that allows performing Direct Memory Access (DMA) between GPUs and PCI Express devices. Check out NVIDIA’s website to make sure your GPU does support GPUDirect for this purpose.
NVIDIA GPUDirect RDMA is currently supported only on Linux. We strive to add support for further GPUs and OSes in the future, please contact [email protected] for further inquiries and suggestions.
System Requirements
The following system requirements:
- NVIDIA GPU that supports GPUDirect.
- PCIe device controlled by WinDriver.
Software Prerequisites
The following software prerequisities:
- Installed NVIDIA kernel drivers for your GPU.
- Installed CUDA version that supports GPUDirect (we have tested with version 10) along with the NVCC compiler.
WinDriver installation
Unpack WinDriver from the tar file.
cd WinDriver/redist
Make sure your WinDriver kernel module is linked with NVIDIA’s kernel module to allow GPUDIRECT.
./configure –with-gpudirect-source=<>/kernel
sudo make && sudo make install
Moving DMA from CPU to GPU
We strongly recommend making sure you have already implemented and tested a “regular” DMA routine between your device and the computer RAM before moving on to implementing a GPUDirect DMA routine.
See previous sections for more info (11.2. Performing Direct Memory Access (DMA) or
11.3. Performing Direct Memory Access (DMA) transactions).
Assuming you have already implemented a DMA routine in your WinDriver-based code, perform the following steps to perform DMA to the GPU memory instead of the main memory.
Modify Compilation
If compiling using Make/CMake, follow this instructions and see a detailed example below:
- Change your makefile to compile your app with the CUDA compiler (
nvcc
) instead of gcc
.
- Change your makefile to link your app with with the CUDA compiler (
nvcc
) instead of ld
.
- Add
-lcuda
to your linker flags(LFLAGS
) in order to link with the CUDA shared libraries.
- Remove
-fno-pie and -m$(USER_BITS)
from your linker flags.
Modify your code
Add to your code.
#include {cuda.h}
#include {cuda_runtime.h}
Instead of using regular malloc()
to allocate the memory for the pBuf parameter in the function WDC_DMASGBufLock()
, use cudaMalloc()
.
Make sure that the dwOptions parameter of WDC_DMASGBufLock()
contains the [DMA_GPUDIRECT] (DMA_GPUDIRECT) flag.
Add the following code to enable synchronous memory operations with your DMA buffer (pDma->pBuf
in this example)
int flag = 1;
if (CUDA_SUCCESS != cuPointerSetAttribute(&flag,
CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, (CUdeviceptr)pDma->pBuf))
{
printf("cuDeviceGet failed\n");
return;
}
Use CUDA functions to access the GPU memory such as: cudaMemcpy(), cudaFree() etc. Using regular memory management functions on pointers to GPU memory might lead to crashes.
⚠ Attention: Calling WDC_DMASGBufLock()
with [DMA_GPUDIRECT] (DMA_GPUDIRECT) flag with a buffer allocated with regular memory buffer (not allocated using cudaMalloc()
) will result in an Internal System Error (Status 0x20000007).
CMake Example
If using CMake to compile your code – use the following recipe as a guide.
cmake_minimum_required(VERSION 3.5)
set(WD_BASEDIR ~/WinDriver) #change according to your installation path
project(my_wd_gpudirect_project C)
include(${WD_BASEDIR}/include/wd.cmake)
include_directories(
${WD_BASEDIR}
${WD_BASEDIR}/include
)
set(SRCS my_wd_gpudirect_project.c)
add_executable(my_wd_gpudirect_project ${SRCS} ${SAMPLE_SHARED_SRCS})
#link with and libwdapi1640 and libcuda
target_link_libraries(my_wd_gpudirect_project wdapi${WD_VERSION} cuda)
set_target_properties(my_wd_gpudirect_project PROPERTIES
RUNTIME_OUTPUT_DIRECTORY "${ARCH}/"
)
#remove definitions to allow compilation with nvcc
remove_definitions("-Wno-unused-result -Wno-write-strings ")
set(CMAKE_SHARED_LIBRARY_LINK_C_FLAGS "")
#change compiler to nvcc
set(CMAKE_C_COMPILER /usr/local/cuda-10.2/bin/nvcc)
#add GPUDIRECT definition to compilation
target_compile_definitions(my_wd_gpudirect_project PRIVATE GPUDIRECT)