I understand what Branch Predictor is supposed to do fundamentally but since I have never programmed software I don't know how much it affects IPC.
To understand why branch predictors are important you need to know a little more about CPUs, in particular super scalar CPUs. Most CPUs now are super scalar where each core has a pipeline; various stages in which CPU instructions are processed. What a super scalar CPU does is that once an instruction is loaded through the CPU, it will get put into the pipeline and the next instruction will start getting processed even though the last instruction isn't complete and is still working its way through the pipeline.
...but what in the word does this have to do with branch prediction?! ... a lot.
When you have a bunch of instructions lined up in the pipeline, it's really important to find out exactly which instruction is going to execute. The branch predictor figures out if the instruction will branch or not branch. A branch is like a "jump to subroutine" instruction. For example on the HCS12 assembly, the BNE instruction stands for "branch no equal" which means when register A and register B are not equal, branch to the memory location in the first operand of the instruction. So you run into a case where dependent on data (which may not be calculated yet,) where you have to determine to branch or not.
When a branch prediction misses, you get a pipeline stall. Basically the CPU has to toss away everything in the pipeline and start from the miss-predicted branch. As a result, you take a performance hit for every miss-prediction and the bigger the pipeline (cough, Netburst and Bulldozer, cough) the longer the stall.
Say I were to remove the Branch Predictor from a Haswell core. By what factor would IPC drop? What will be the effect on die size?
You would destroy performance depending on the workload. Branch heavy workloads would suffer, compute heavy workloads would not.
Say I were to take the Branch Predictor in Haswell and magically make it perfect. By what factor would IPC increase?
Depends on the workload because the improvement is highly conditional upon the code that's running, so there is no meaningful number that could be provided here.
Do GPU cores use Branch Predictors as well?
GPU cores tend to be much more basic than CPU cores. Since stream processors I don't think are pipelined (if they are, they're very short,) so it wouldn't need it due to the nature of the core. No pipeline = no branch predictor.
The branch predictor will never be perfect, it can only predict the outcome of something because what it needs to know may not eve have occurred yet, so a "perfect" branch predictor is one with a pipeline of 1, or in other words, no pipeline. No pipeline means no changes that invisible to the CPU therefore all branch operations will be correct.
Edit: It's also harder to predict branches when the pipeline is really long, every extra stage opens the possibility for a more complex set of instructions to invalidate the prediction and the longer the pipeline the worse the stall.
Edit 2: Branch prediction happens in the CPU itself. Even a driver developer doesn't have to think about this because the CPU just makes it happen. This is handled entirely in hardware without any intervention.