• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

CISC vs RISC - Does it affect cooling?

There you go, no example necessary.

At this point I am fairly convinced you never had an example in mind to being with.

Implementing a instruction, takes transistors. CISC has more instructions.

As I explained above, the difference this makes in real world is practically zero. Half or more of most CPU's transistor budget these days is just the cache for crying out loud.

EPYC 7742 a 64 core x86 CPU is 32 billion transistors and Graviton2 an ARM based CPU, also 64 cores, is 30 billion transistors. Not quite what you'd except is it ?
 
Does anyone know if that's actually true or not?
No. Most Intel Atoms are passively cooled, for example.

All processors have thermal/power design envelopes to meet. It just happens that most CISC processors target high power applications.

Superscalar processors (like Ryzen and Core) have CISC frontends with RISC execution units.


The fundamental difference between CISC and RISC is that CISC takes care of a lot of memory operations internally where RISC doesn't. That allows complex operations to be micromanaged in integrated circuits to accelerate it. Because of those circuits, for example, CPUs can do a much better job at hardware transcoding video streams than ASICs do which RISC architectures like ARM uses. They tend to sacrifice accuracy for performance.
 
Last edited:
Only on the front end for decoding though, the back end is RISC (K.)

I'd say RISC won the battle between CISC, and RISC when the backend of a x86 processor decodes to a internal RISC ISA.

Again, waters are muddied to Oblivion. But still, implementing that takes extra transistors: point stands.

At this point I am fairly convinced you never had an example in mind to being with.

You dismissed the only example I could provide besides logical ones as "pointless without power consumption figures"

So until I establish less instructions is less power expensive than more, showing you assembly instruction listings is pointless.

As I explained above, the difference this makes in real world is practically zero.

...

Alrighty then. Sure. Nevermind. ISA doesn't take much die space at all. Ignore google.
 
Last edited:
So until I establish less instructions is less power expensive than more, showing you assembly instruction listings is pointless.

Less instructions does generally mean less power but that's meaningless if want to talk power consumption in real world use cases. When you write something the purpose is to get something done, that something needs a minimum amount of computation, computation which isn't fundamentally different depending on the ISA used.

In other words if something needs say X instructions in a CISC processor there is very little chance that it would need less than X instructions in a RISC architecture, because by definition the ISA is less capable and on average you'll need either the same amount or more instructions to get the same thing done. RISC is supposedly more efficient when the computation is less complicated but life's a bitch and that's not always the case, in fact most of the time it's not, computers do all sorts of complicated shit nowadays. As I keep saying this was true once but not anymore. And by the way, RISC doesn't really mean less instructions necessary.

The reason you can't convince me that something like ARM is somehow intrinsically more power efficient is because of this :

void func(double &a, double &b)
{
b=a*a;
}

The assembler output for the function above looks like this for x86 and ARM64 on GCC with no flags :

push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rax, QWORD PTR [rbp-8]
movsd xmm1, QWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
movsd xmm0, QWORD PTR [rax]
mulsd xmm0, xmm1
mov rax, QWORD PTR [rbp-16]
movsd QWORD PTR [rax], xmm0
nop
pop rbp
ret

sub sp, sp, #16
str x0, [sp, 8]
str x1, [sp]
ldr x0, [sp, 8]
ldr d1, [x0]
ldr x0, [sp, 8]
ldr d0, [x0]
fmul d0, d1, d0
ldr x0, [sp]
str d0, [x0]
nop
add sp, sp, 16
ret

If for something this basic there is almost no difference in terms of the instructions required how could you still argue one is more efficient than the other in any real manner ?

Alrighty then. Sure. Nevermind. ISA doesn't take much die space at all. Ignore google.

I just gave you an example of two comparable CPUs one x86 and one ARM that are within 6% of each other in terms of transistors used. That's as close as you can ever get in terms of a fair comparison, I did my best to provide evidence that the ISA has little impact in the way CPUs are actually built with google and all. What does make a difference is what their built for, where are they going to be used and what are the constraints.

I guess you could tell me that this 6%, incidentally, must be because of the different ISAs ...
 
Last edited:
I just gave you an example of two comparable CPUs one x86 and one ARM that are within 6% of each other in terms of transistors used.

Yes. Clockspeeds aside, I have been repeatedly stating this debate is purely conceptual as the lines between CISC and RISC are nowadays purely academic. Nearly everything but MIPS and a few odd-ducks blurs the lines now.

Also this. RISC ain't what it used to be. POWER is nearly at CISC level instruction quantity. Back in the day, some RISCs lacked hardware inteter multiply functions. That obviously changed.
 
Ahh, but with a cisc processor, it wouldn't have to break up the multiply:

function(x, y, *lower, *higher)
movq %rx,%rax #Store x into %rax
mulq %y #multiplies %y to %rax
#mulq stores high and low values into rax and rdx.
movq %rax,(%r8) #Move low into &lower
movq %rdx,(%r9) #Move high answer into &higher

That's for a 64 bit multiply, in x64 i86 code.
The Fmul instruction is 7 clocks max, IIRC. (Last time it mattered to me was ~90's, lol)

A sequential add scales as the operand.

That's the complex part of the CISC; dedicated math units and multiply instructions and more.

Now, doing it as an ADD instruction might be faster in some architectures, it probably isn't.

Smaller multiply ops are easier.

EDIT: I see we're saying the same thing, in the code.

EDIT2: The ARM processors are no longer what I'd consider a RISC processor at all; I've missed all the added functionality with the later updates to the core technology.
I found the newer opcode listings; the list I posted above is no longer definitive.

I think we're down to arguing the same argument as Intel vs M68k, from 30 years ago. :)

The architecture is different; some like one, others like others.

Switching assembler languages was always like switching to Spanish or German, in my brain; ARM is like learning French, by comparison.

PIC for me was like learning a trade language; few words, but you can get the important stuff through. (You giva me money, I giva you beer.) :)

If you write in C or other languages, it really doesn't matter anyway; the compiler is the only one who knows where it goes. :D
 
Last edited:
Well, I dug out LG G Stylo phone with a cracked screen. Updating system software so I can hook it up to the PC (hopefully) get files off it and go from there.
Will obtain system specs of the phone and get some temp readouts. Then configure a way to cool it. The battery is in the way, this will take some modifications.
Right up my alley, will be a fun side project. Android is upgrading 1 of 54 items..... it's gonna take a while lol.

I don't know what you guys are talking about and the effects of writing the letter C and how it affects cooling situations with these types of processors.
 
The problem is that most of the assembly is for the function (prologue and exit)...
Code:
x86 SSE                           ARM                 Description
push rbp                          sub sp, sp, #16     loading the function  (#16 is 16 bytes, 8 for each double parameter)
mov rbp, rsp                                          points to top of stack which is effectively the same as #16 for ARM above
mov QWORD PTR [rbp-8], rdi        str x0, [sp, 8]     a pointer
mov QWORD PTR [rbp-16], rsi       str x1, [sp]        b pointer
mov rax, QWORD PTR [rbp-8]        ldr x0, [sp, 8]     move the value at (a) pointer to register
movsd xmm1, QWORD PTR [rax]       ldr d1, [x0]        move the register value to the floating point register (1)
mov rax, QWORD PTR [rbp-8]        ldr x0, [sp, 8]     move the value at (a) pointer to register
movsd xmm0, QWORD PTR [rax]       ldr d0, [x0]        move the register value to the floating point register (0)
mulsd xmm0, xmm1                  fmul d0, d1, d0     multiply the two registers (0*1), with the result being stored in (0)
mov rax, QWORD PTR [rbp-16]       ldr x0, [sp]        move the value at (b) pointer to register
movsd QWORD PTR [rax], xmm0       str d0, [x0]        move the floating point register (0) result to (b) address
nop                               nop                 no operation...not sure why GCC is adding this
pop rbp                           add sp, sp, 16      retiring the function
ret                               ret                 call return
You'd have to add more code that actually does work (especially things that aren't basic math) to see x86's advantage. A good example would be an AVX instruction.
 
Last edited:
A good example would be an AVX instruction.

AVX is used almost exclusively for basic math, well let's call it just math. Here's a function that adds floats with NEON and with AVX :

Code:
void add_float(float* dst, float* src1, float* src2, int count)
{
     for (int i = 0; i < count; i += 4)
     {
         float32x4_t in1, in2, out;
         in1 = vld1q_f32(src1);
         src1 += 4;
         in2 = vld1q_f32(src2);
         src2 += 4;
         out = vaddq_f32(in1, in2);
         vst1q_f32(dst, out);
         dst += 4;
     }
}
Code:
void add_float(float* dst, float* src1, float* src2, int count)
{
     for (int i = 0; i < count; i += 8)
     {
         __m256 in1 = _mm256_loadu_ps(src1);
         src1 += 4;
         __m256 in2 = _mm256_loadu_ps(src2);
         src2 += 4;
         __m256 out = _mm256_add_ps(in1, in2);
         _mm256_storeu_ps(dst, out);  
         dst += 8;
     }
}
Code:
add_float(float*, float*, float*, int):
        sub     sp, sp, #176
        str     x0, [sp, 24]
        str     x1, [sp, 16]
        str     x2, [sp, 8]
        str     w3, [sp, 4]
        str     wzr, [sp, 172]
.L6:
        ldr     w1, [sp, 172]
        ldr     w0, [sp, 4]
        cmp     w1, w0
        bge     .L7
        ldr     x0, [sp, 16]
        str     x0, [sp, 32]
        ldr     x0, [sp, 32]
        ldr     q0, [x0]
        str     q0, [sp, 144]
        ldr     x0, [sp, 16]
        add     x0, x0, 16
        str     x0, [sp, 16]
        ldr     x0, [sp, 8]
        str     x0, [sp, 40]
        ldr     x0, [sp, 40]
        ldr     q0, [x0]
        str     q0, [sp, 128]
        ldr     x0, [sp, 8]
        add     x0, x0, 16
        str     x0, [sp, 8]
        ldr     q0, [sp, 144]
        str     q0, [sp, 64]
        ldr     q0, [sp, 128]
        str     q0, [sp, 48]
        ldr     q1, [sp, 64]
        ldr     q0, [sp, 48]
        fadd    v0.4s, v1.4s, v0.4s
        str     q0, [sp, 112]
        ldr     x0, [sp, 24]
        str     x0, [sp, 104]
        ldr     q0, [sp, 112]
        str     q0, [sp, 80]
        ldr     x0, [sp, 104]
        ldr     q0, [sp, 80]
        str     q0, [x0]
        ldr     x0, [sp, 24]
        add     x0, x0, 16
        str     x0, [sp, 24]
        ldr     w0, [sp, 172]
        add     w0, w0, 4
        str     w0, [sp, 172]
        b       .L6
.L7:
        nop
        add     sp, sp, 176
        ret
Code:
add_float(float*, float*, float*, int):
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 200
        mov     QWORD PTR [rsp-80], rdi
        mov     QWORD PTR [rsp-88], rsi
        mov     QWORD PTR [rsp-96], rdx
        mov     DWORD PTR [rsp-100], ecx
        mov     DWORD PTR [rsp+196], 0
.L6:
        mov     eax, DWORD PTR [rsp+196]
        cmp     eax, DWORD PTR [rsp-100]
        jge     .L7
        mov     rax, QWORD PTR [rsp-88]
        mov     QWORD PTR [rsp-72], rax
        mov     rax, QWORD PTR [rsp-72]
        vmovups ymm0, YMMWORD PTR [rax]
        vmovaps YMMWORD PTR [rsp+136], ymm0
        add     QWORD PTR [rsp-88], 16
        mov     rax, QWORD PTR [rsp-96]
        mov     QWORD PTR [rsp-64], rax
        mov     rax, QWORD PTR [rsp-64]
        vmovups ymm0, YMMWORD PTR [rax]
        vmovaps YMMWORD PTR [rsp+104], ymm0
        add     QWORD PTR [rsp-96], 16
        vmovaps ymm0, YMMWORD PTR [rsp+136]
        vmovaps YMMWORD PTR [rsp-24], ymm0
        vmovaps ymm0, YMMWORD PTR [rsp+104]
        vmovaps YMMWORD PTR [rsp-56], ymm0
        vmovaps ymm0, YMMWORD PTR [rsp-24]
        vaddps  ymm0, ymm0, YMMWORD PTR [rsp-56]
        vmovaps YMMWORD PTR [rsp+72], ymm0
        mov     rax, QWORD PTR [rsp-80]
        mov     QWORD PTR [rsp+64], rax
        vmovaps ymm0, YMMWORD PTR [rsp+72]
        vmovaps YMMWORD PTR [rsp+8], ymm0
        vmovaps ymm0, YMMWORD PTR [rsp+8]
        mov     rax, QWORD PTR [rsp+64]
        vmovups YMMWORD PTR [rax], ymm0
        nop
        add     QWORD PTR [rsp-80], 32
        add     DWORD PTR [rsp+196], 8
        jmp     .L6
.L7:
        nop
        leave
        ret

Again, the differences are small, modern x86 and ARM are very alike to the point it's not worth saying one definitely has an advantage. Of course the AVX version is probably more power efficient since the loops needs to go over less instructions as they process 8 instead of 4 floats at a time but that's not because of an ISA philosophy difference, ARM just hasn't implemented 256bit instructions.
 
Last edited:
Neon kind of defeats the point of RISC :)
At adds instructions as well as hardware to back it up.
 
Of course the AVX version is probably more power efficient since the loops needs to go over less instructions as they process 8 instead of 4 floats at a time but that's not because of an ISA philosophy difference, ARM just hasn't implemented 256bit instructions.
But that's exactly the point: x86 will get it done faster using less power. ARM will take substantially more time to do it.
 
Last edited:
Back
Top