As who3D indicates, the memory bottleneck is one of the biggest performance hits in any machine. This is particularly true in graphics apps, which tend to be very memory hungry. Instructions are executed in pipeline in a CPU. The number of pipeline stages varies between architectures, but a grossly simplified version is something like: fetch instruction, decode instruction, fetch data, perform instruction, store data, update program counter. Each stage will take some number of clock cycles. Modern CPUs can execute multiple instructions at once because they stagger the instructions by pipeline stage. When the first instruction moves from the fetch to the decode stage, the next instructon can enter the fetch stage, and so on. Thus, the more deeply pipelined the architecture, the more instructions can execute at once. However, sometimes a stage like fetch data can take a LARGE number of clock cycles because of the relatively slow access to memory. When this happens, the entire pipline will wind up stalling until the fetch is completed and that instruction moves on to the next stage. This a major reason that CPU makers try to squeeze larger and larger on-chip caches onto CPUs. The cache stores the most recently accessed data, so that hopefully the next time the CPU needs a piece of data, it will be in the fast-to-access cache and the CPU won't have to go get the instruction from main memory. Generally, a system has multiple caches, organized in levels. The fastest to access are the smallest, and they get progressively larger and slower to access. So, there may be up to 3 levels of cache before you get to main memory. Unfortunately, most people's use of modern graphical applications overwhelms on-chip (and even off-chip) caches pretty quickly, so there is a lot of main-memory access. This is where algorithm designers can do a little to help. If you can design algorithms that have good locality of memory access, then caches will help more. In graphics apps, you are still going to have to access much more memory than will fit in a cache. But if you have multiple operations to do that will hit a particular memory location, you want to try to design an algorithm that will do all those operations at once to each piece of data, rather than doing one operation to all the data, then the second operation to all the data, etc. For example, in 2D image processing, suppose you are performing a simple emboss operation and converting to grayscale. You could do that faster by iterating over all the pixels in the image and for each pixel, calculating RGB values for the emboss operation and the grayscale operation, rather than iterating over the pixels twice and doing the emboss operation in the first pass and the grayscale operation in the second, for a number of reasons. One, you will not have the loop overhead twice, but more importantly, you will probably have to load each pixel's data from main memory only once instead of twice (assuming the image is larger than will fit in the cache, of course, which it likely is). OK, wow, I just rambled on a great deal about that. Hope it was interesting to someone :) -Jeremy

Forum: Poser - OFFICIAL

Subject: Shocking Disappointment or Joy delivered?