• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Understanding your memory

Ketxxx

Heedless Psychic
Joined
Mar 4, 2006
Messages
11,521 (1.75/day)
Location
Kingdom of gods
System Name Ravens Talon
Processor AMD R7 3700X @ 4.4GHz 1.3v
Motherboard MSI X570 Tomahawk
Cooling Modded 240mm Coolermaster Liquidmaster
Memory 2x16GB Klevv BoltX 3600MHz & custom timings
Video Card(s) Powercolor 6800XT Red Devil
Storage 250GB Asgard SSD, 1TB Integral SSD, 2TB Seagate Barracuda
Display(s) 27" BenQ Mobiuz
Case NZXT Phantom 530
Audio Device(s) Asus Xonar DX 7.1 PCI-E
Power Supply 1000w Supernova
Software Windows 10 x64
Benchmark Scores Fast. I don't need epeen.
Its been a while since I churned out any information to really help the overclocker, so in here I figured I'd jot down some details to help those understand their RAM and how it works. The info is taken directly out of my portfolio which I have various works of both my own, and things of general interest I made files of for reference. The hope is that by knowing in detail how your RAM works, you will come to understand what causes it to error, so what may seem as memory limitation right now, may suddenly become clear at the end of reading this and you will have a good idea what to try to push further. Do remember you will need access to advanced tweaking options though, and the information contained here-in is not particularly straightforward and most will probably find the need to reference and re-read whats in here at least a few times before it sticks, some info contained within really isnt needed, but for the sake of completeness i left it in anyway. Anyhoo, off we go.


As valid signal windows shrink, signal integrity (SI) becomes a dominant factor in ensuring that memory interfaces perform flawlessly.

Chip and PCB-level design techniques can improve Simultaneous Switching Output (SSO) characteristics, making it easier to achieve the signal integrity required in wider memory interfaces. Features that are integrated on the FPGA silicon die, such as Digitally Controlled Impedance (DCI), simplify the PCB layout design and enhance performance.

Optimizing timing in DDR SDRAM interfaces

Shrinking data periods and significant memory timing uncertainties are making timing closure a real challenge in today's higher performance electronic systems. Several design practices help in preserving the opening of the valid data window. For example, in the case of interfaces with DDR2 SDRAM devices, the JEDEC standard allows the memory device suppliers to have a substantial amount of skew on the data transmitted to the memory controller. There are several components to this skew factor, including output access time, package and routing skew, and data-to-strobe skew. In the case where the memory controller is using a fixed phase-shift to register data across the entire interface, the sum of the skew uncertainties must be accounted for in the timing budget. If the worst-case sum of skew uncertainties is high, it reduces the data valid window and thereby limits the guaranteed performance for the interface.

Assuming that data capture is based on the timing of the DQS signals, leveraging the source-synchronous nature of the interface, it is possible to compute the memory valid data window across the entire data bus as follows:

TMEMORY_VDW = TDATA_PERIOD - TDQSCK - TDQSQ - TQHS

1687 ps - 900 ps - 300 ps - 400 ps = 87 ps

This equation sets the first condition for the data capture timing budget; the memory valid data window must be larger than the sampling window of the memory controller receiver, including setup and hold times and all the receiver timing uncertainties. Capturing data using a fixed delay or a fixed phase shift across the entire data bus is no longer sufficient to register data reliably. However, there are several other methods available. Among all the different options, the "direct clocking" data capture method is a very efficient way to register data bits and transfer them into the memory controller clock domain. This method consists of detecting transitions on DQS signals and implementing the appropriate delay on data bits to center-align DQ signals with the memory controller clock. This technique also has the advantage of making the clock domain transfers from the memory to the controller efficient and reliable. When the calibration is used for:

* The entire interface, the delay on data bits is automatically adjusted independently of the system parameters.
* On one byte of data, the data hold skew factor, TDQSQ, and the strobe-to-data distortion parameter, TQHS, are removed from the timing budget equation.
* On one bit of data, the DQS-to-DQ skew uncertainty, TDQSCK, is accounted for in addition to the data hold skew factor, TDQSQ, and the strobe-to-data distortion parameter, TQHS. In this case, the valid data window is equal to the data period as provided by the memory device.

In high data rate systems that are subject to variations in voltage and temperature, dynamic calibration is required. In leading edge interfaces, performing the calibration sequence periodically makes this scheme independent of voltage and temperature variations at all times.

Increasing bandwidth with wider data buses:
distributed power and ground pins

Meeting higher bandwidth requirements can be achieved by widening data buses. Interfaces of 144 or 288 bits are not uncommon nowadays. Memory controller device packages with many IOs are required to achieve those wide buses. Numerous bits switching simultaneously can create signal integrity problems. The SSO limit is specified by the device vendor and represents the number of pins that the user can use simultaneously per bank in the device. This limit increases for devices that are architected to support a large number of IOs and contain distributed power and ground pins. These packages offer better immunity against crosstalk.

These two devices have been used in an experiment emulating a 72-bit memory interface. The worst-case noise level on a user pin is six times smaller in the package with the distributed power and ground pins using SSTL 1.8V – the standard for DDR2 interfaces – with external terminations. For wide interfaces, crosstalk and data dependent jitter can be major contributors to the timing budget. For wide interfaces, crosstalk and data dependent jitter can be major contributors to the timing budget and cause setup and hold time violations. The data dependent jitter depends on the transitions on the data bus (examples of possible data transitions: "Z-0-1-Z", "1-1-1-1" or "0-1-0-1" transitions). For design using I/Os extensively, distributed power and ground packages and careful PCB design are elements that add to the stability and robustness of the electronic system.

The challenge of capacitive loading on the bus

When designing a large memory interface system, cost, density, throughput, and latency are key factors in determining the choice of interface architecture. In order to achieve the desired results, one solution is to use multiple devices driven by a common bus for address and command signals. This corresponds, for example, to the case of a dense unbuffered DIMM interface. One interface with two 72-bit unbuffered DIMMs can have a load of up to 36 receivers on the address and command buses, assuming that each single rank DIMM has 18 components. The maximum load recommended by JEDEC standards and encountered in common systems is two unbuffered DIMMs. The resulting capacitive loading on the bus is extremely large. It causes these signals to have edges that take more than one clock period to rise and fall resulting in setup and hold violations at the memory device. The capacitive loads range from 2 for the registered DIMM to 36 for the unbuffered DIMM.

These eye diagrams clearly show the effect of loading on the address bus; the registered DIMMs offer a wide open valid window on the ADDRCMD bus. The eye opening for one DIMM appears to be still good at 267 MHz; however, with 32 loads, the ADDRCMD valid window is collapsed, and the conventional implementation is no longer sufficient to interface reliably to the 2 unbuffered DIMMs.

The timing characteristics of falling-edges on the same signal under the same loading conditions using IBIS simulations are shown in Fig 3.

This simple test case illustrates that the loading causes the edges to slow down significantly and the eye to close itself past a certain frequency. In systems where the load on the bus cannot be reduced, lowering the frequency of operation is one way to keep the integrity of the signals acceptable.

Each load has a small capacitance that adds up on the bus. However, the driver has a fixed or limited current drive strength. Because the voltage is inversely proportional to the capacitive load as shown in the following equation; once the driver is saturated, rising and falling edges become slower:


The result is a limitation of the maximum clock frequency that can be achieved with a fixed configuration: there is an instance when the edges are slow to the point of limiting the performance of the interface. This limitation is presented in Fig 4, which shows the experimental analysis of performance limitation versus capacitive load on the bus (the actual caption associated with this figure is: Maximum possible clock rate based on address bus loading in Xilinx FPGAs with DDR2 SDRAM devices).

There are several ways to resolve the capacitive loading issue:

* Duplicate the signals that have an excessive load across the interface. For example, replicating address and command signals every 4 or 8 loads can be very efficient in ensuring high quality signal integrity characteristics on these signals.
* In applications where adding one clock cycle of latency on the interface is applicable, using Registered DIMMs can be a good option. These DIMMs use a register to buffer heavily loaded signals like address and command signals. In exchange for one additional latency cycle in the address and command signals, these modules drastically reduce the load on control and address signals by a factor of 4 to 18, thereby helping with the capacitive loading problem.
* Use the design technique based on two clock periods on address and command signals. More details on this method are presented in the following section.

Using two clock periods
The use of unbuffered DIMMs can be required:

* When latency is the preponderant performance factor. If the memory accesses are short and at random locations in the memory array, adding one clock cycle of latency will degrade the data bus utilization and the overall performance of the interface decreases. In this case, a slower interface with minimal cycles of latency can be more efficient than an interface running faster with one more clock cycle of latency. However, when the memory is accessed in several bursts of adjacent locations, the faster clock rate compensates for the initial latency, and the overall performance increases.
* When cost is sensitive and the additional premium for using a registered DIMM is not a possibility.
* When hardware is already available but has to support a deeper memory interface or a faster clock rate.
* When the number of pins for the memory controller is fixed by an existing PCB or a feature set, and the additional pinout for registered DIMMs is not available.

In these cases, the design technique using two clock periods to transmit signals on heavily loaded buses can be utilized to resolve the capacitive loading on the address and command bus effectively. However, the controller will be able to present data only every two clock cycles to the memory, reducing the efficiency of the interface in certain applications.

The principle for the two period clocking on address and command buses consists of pre-launching command and address signals (ADDRCMD) one clock period in advance of loading data bits and keeps these signals valid for two clock periods. This leaves more time for the address and command signals to rise and meet the setup and hold time memory requirements. The control signals, such as Chip Select signals, that have a load limited to components of one rank of a DIMM, are used to indicate the correct time for the memory to load address and command signals. The design technique has been successfully tested and characterized in multiple systems.

Reduce the BOM and simplify PCB layout: use on-chip terminations!
One way to reduce the design complexity and bill-of-materials (BOM) for a board is to use on-chip terminations. The JEDEC industry standard for DDR2 SDRAM has defined the On-Die Termination (ODT feature). This feature provides an embedded termination apparatus on the die of the device that eliminates the need for external PCB terminations on data, strobe and data mask (DQ, DQS and DM) signals. However, the other signals still require external termination resistors on the PCB.

For DDR2, memory vendors have also increased the quality of signal integrity of the signals on DIMM modules compared to the original DDR1 devices. For example, they match all flight times and loading on a given signal, reducing the effect of stubs and reflection on the transmission lines. But using ODT also requires running additional signal integrity simulations, because the configuration of terminations and loading on the bus changes based on the number of populated sockets. Based on the JEDEC matrix for write operations or read operations, the memory controller should be able to turn the ODT terminations on or off depending on how the memory bus is loaded. JEDEC recommends in the termination reference matrix that in the case where both sockets are loaded with DIMMs with 2 ranks each, only the front side of the DIMM on the second slot be ODT enabled. IBIS simulations are the safest way to determine which ODT terminations need to be turned on.

Running the interface twice as fast as the internal memory controller
Feature-rich IC devices can facilitate meeting timing for memory interfaces. For example, in cases of high-speed interfaces, reducing the operating frequency of the interface by 50 percent can save design time or enable meeting timing on a complex controller state machine. This can be done using dedicated SERDES features in FPGAs, for example. This design technique is very advantageous when a large burst of contiguous locations is accessed. Depending on the SERDES configuration, there is a possibility that one clock cycle of latency is inserted in the interface. Although the size of the bus is multiplied by two, the logic design can leverage the inherent structure of parallelism in the FPGA device and take advantage of a slower switching rate to meet timing more easily and to consume less dynamic power. The state machine of the controller runs slower, allowing the use of a more complex controller logic that can increase overall efficiency and optimize bus utilization. This makes the logic design easier to place and route, and the end result is a more flexible and robust system that is less susceptible to change.

Conclusion
With rising clocking rates and shrinking valid windows, parallel buses for memory interfaces are becoming more challenging for designers. All the stages of the design and implementation should be considered carefully to tune the interface parameters so as to determine the optimal settings. Signal integrity simulations and simultaneous switching output checks are key factors in tailoring the interface. The feature-rich silicon resources in devices on which memory controllers are implemented, such as process, voltage, and temperature compensated delays; dedicated routing and registers; and specific clocking resources can also help in maintaining or improving the memory interface performance targets.

there are diagrams to go with all this, but for now i dont think their needed. i'll answer any questions some may have, as longs they dont require me writing a mini essay to answer :p

anyway; for those that managed to follow that, hopefully you can see the purpose of the info, and hopefully its lead you to a further understanding of your memory weather it be DDR or DDR2.
 

pt

not a suicide-bomber
Joined
Mar 11, 2006
Messages
8,956 (1.36/day)
Location
Portugal
Processor AMD Turion 64 X2 Mobile TL-60 (Trinidad)
Motherboard ASUS F3Ka (ATI RS690M)
Cooling stock
Memory Nanya 2x1GB ddr2 667@5-5-5-15-2T
Video Card(s) ATI Mobility Radeon HD2600 512MB DDR2@ 580mhz/486mhz
Storage 160GB on laptop+250GB external
Display(s) ASUS 15.4
Case Asus Laptop F3Ka chassis
Audio Device(s) on-board
Power Supply 1:30minutes battery
Software "genui xp", 'cause i hated vista
nice :)
thanks
 

Ketxxx

Heedless Psychic
Joined
Mar 4, 2006
Messages
11,521 (1.75/day)
Location
Kingdom of gods
System Name Ravens Talon
Processor AMD R7 3700X @ 4.4GHz 1.3v
Motherboard MSI X570 Tomahawk
Cooling Modded 240mm Coolermaster Liquidmaster
Memory 2x16GB Klevv BoltX 3600MHz & custom timings
Video Card(s) Powercolor 6800XT Red Devil
Storage 250GB Asgard SSD, 1TB Integral SSD, 2TB Seagate Barracuda
Display(s) 27" BenQ Mobiuz
Case NZXT Phantom 530
Audio Device(s) Asus Xonar DX 7.1 PCI-E
Power Supply 1000w Supernova
Software Windows 10 x64
Benchmark Scores Fast. I don't need epeen.
be honest, how much of that did u manage to understand? :p
 

Zebbo

Mushkin Tech Rep
Joined
Sep 11, 2004
Messages
131 (0.02/day)
Location
End of the Redline
Dave hey nice bit of info and nice work with this but to be honest none of the "regular" users or should I say overclockers really dont need this much info ;)

It really makes it easier for people to understand how technology works but as in practice... Well, you get the point :laugh:
 

Ketxxx

Heedless Psychic
Joined
Mar 4, 2006
Messages
11,521 (1.75/day)
Location
Kingdom of gods
System Name Ravens Talon
Processor AMD R7 3700X @ 4.4GHz 1.3v
Motherboard MSI X570 Tomahawk
Cooling Modded 240mm Coolermaster Liquidmaster
Memory 2x16GB Klevv BoltX 3600MHz & custom timings
Video Card(s) Powercolor 6800XT Red Devil
Storage 250GB Asgard SSD, 1TB Integral SSD, 2TB Seagate Barracuda
Display(s) 27" BenQ Mobiuz
Case NZXT Phantom 530
Audio Device(s) Asus Xonar DX 7.1 PCI-E
Power Supply 1000w Supernova
Software Windows 10 x64
Benchmark Scores Fast. I don't need epeen.
oh there is a point in it ;), if people can fully understand it :D (or learn to) its as i said, to really make use of the gained knowledge you need access to advanced bios configuration options (such as DQS drive strength, CKE\ODT fine delay, etc)
 

pt

not a suicide-bomber
Joined
Mar 11, 2006
Messages
8,956 (1.36/day)
Location
Portugal
Processor AMD Turion 64 X2 Mobile TL-60 (Trinidad)
Motherboard ASUS F3Ka (ATI RS690M)
Cooling stock
Memory Nanya 2x1GB ddr2 667@5-5-5-15-2T
Video Card(s) ATI Mobility Radeon HD2600 512MB DDR2@ 580mhz/486mhz
Storage 160GB on laptop+250GB external
Display(s) ASUS 15.4
Case Asus Laptop F3Ka chassis
Audio Device(s) on-board
Power Supply 1:30minutes battery
Software "genui xp", 'cause i hated vista
be honest, how much of that did u manage to understand? :p

just read in diagonal (a bit busy today) but honestly not much :p
 
Joined
Nov 10, 2005
Messages
1,540 (0.23/day)
Location
Athens - Hellas
Processor C2D E7600
Motherboard GA EG41M ES2L
Memory 2GB ADATA 800MHZ
Video Card(s) ASUS HD4350
Storage OCZ VERTEX TURBO 32GB + 2 MORE..
Display(s) SM226BW
Audio Device(s) 7.1 HD / X-540
Software 7X86 ULT
REALLY nice job Ketxxx..
lets see if we can have a bit more "juice" out of our sticks.
 

Canuto

New Member
Joined
Jul 8, 2006
Messages
2,153 (0.33/day)
Location
Portugal
Processor Pentium D 930 @3.6Ghz
Motherboard Biostar 945P-A7A(8.0)
Cooling Stock cooling, 1x92mm outtake fan
Memory Infineon 2x512Mb DDR2-533 @641Mhz 4-4-4-12
Video Card(s) Powercolor X550 512Mb Hipermemory @510/261Mhz
Storage Seagate Barracuda 7200.7 200Gb
Display(s) iMax 17" LCD
Audio Device(s) Realtek AC'97 ALC655
Power Supply LC Power 550w Silent Giant GREEN POWER
Software Windows Vista Ultimate x86
How about adding this to the wiki Ket?
 

Ketxxx

Heedless Psychic
Joined
Mar 4, 2006
Messages
11,521 (1.75/day)
Location
Kingdom of gods
System Name Ravens Talon
Processor AMD R7 3700X @ 4.4GHz 1.3v
Motherboard MSI X570 Tomahawk
Cooling Modded 240mm Coolermaster Liquidmaster
Memory 2x16GB Klevv BoltX 3600MHz & custom timings
Video Card(s) Powercolor 6800XT Red Devil
Storage 250GB Asgard SSD, 1TB Integral SSD, 2TB Seagate Barracuda
Display(s) 27" BenQ Mobiuz
Case NZXT Phantom 530
Audio Device(s) Asus Xonar DX 7.1 PCI-E
Power Supply 1000w Supernova
Software Windows 10 x64
Benchmark Scores Fast. I don't need epeen.
if i did that id have to re-write it. as i said in my first post, its just taken from my portfolio which has both my own work and stuff i found and just pasted into a file for reference. I spose i could tidy it up in my first post to make the paste job easier to follow and stick diagrams in. I'll admit if I re-write it it would be much easier to understand, but I dont have time to re-write it at the mo with uni, learning C++, java etc + a mass of paperwork for grants.
 

Wile E

Power User
Joined
Oct 1, 2006
Messages
24,318 (3.81/day)
System Name The ClusterF**k
Processor 980X @ 4Ghz
Motherboard Gigabyte GA-EX58-UD5 BIOS F12
Cooling MCR-320, DDC-1 pump w/Bitspower res top (1/2" fittings), Koolance CPU-360
Memory 3x2GB Mushkin Redlines 1600Mhz 6-8-6-24 1T
Video Card(s) Evga GTX 580
Storage Corsair Neutron GTX 240GB, 2xSeagate 320GB RAID0; 2xSeagate 3TB; 2xSamsung 2TB; Samsung 1.5TB
Display(s) HP LP2475w 24" 1920x1200 IPS
Case Technofront Bench Station
Audio Device(s) Auzentech X-Fi Forte into Onkyo SR606 and Polk TSi200's + RM6750
Power Supply ENERMAX Galaxy EVO EGX1250EWT 1250W
Software Win7 Ultimate N x64, OSX 10.8.4
Awesome f-in post Ket! Add another vote for "add to wiki". lol
 
Top