"Slapping on a lot of L3" is one of the most scalable answers AMD currently has. L1 cache is one of the most expensive in terms of die size to increase usually due to its width and L2 isnt much better either. Also L1 and L2 are a per core change where as the L3 is something either every core can share or be dedicated to a single thread on the fly so in single/lightly threaded workloads you may be commiting a lot of die size for only a small performance improvement.
That's not how it works at all.
Even though L3 is cheaper per MB, L2 and L3 work very differently, as L3 only contains recently discarded cache lines. Most workloads see no benefit beyond a certain L3 size (or even a performance penalty), because at that point you'll almost only get hits on instruction cache lines, which is why we mostly see this benefit e.g. low-resolution gaming and certain other edge-cases. Additionally, adding even more L3 would see even less benefits.
A larger L2 on the other hand, let's the prefetcher work completely differently, and going from e.g. 1->2 MB L2 would for most workloads have a larger impact than adding 64 MB of L3. But L2 is much more deeply connected to the pipeline, so adding more isn't trivial.
In regards to Branch prediction it is always an area being improved upon but there are always going to be prediction failures causing a cache miss.
You are mixing up the terms a bit here.
A branch misprediction doesn't necessarily cause a cache miss, while it certainly
can indirectly, if the block of code is dense and contains no function calls or pointers needing to be dereferenced, then the penalty will only be the stalled and flushed pipeline and decoding overhead. There can however be a function call there that isn't in cache, leading to even a sequence of cache misses, so in sense cause a chain-reaction.
And obviously, cache misses can happen without any misprediction if the front-end simply can't keep up. This is a case where larger L2 will help somewhat, but not more L3, as L3 is just discarded L2, so if the front-end is the bottleneck more L3 isn't going to make you prefetch more.
While the front-end is always improved upon, AMD have stayed a couple of steps behind Intel for the entire Zen family. Their back-end is much better though, which is why we see Zen excel with large batch loads, but Intel retains an edge when it comes to responsiveness in user-interactive applications.
IMO until there is a major rework of IF with something like the Patent I posted earlier this will always be THE key limitation to the design and will only become more of an issues with an increase with core counts per CCD
I fail to see the correlation between performance data, specs and how the IF should be
the key limitation in the design. Zen 5 scales brilliantly with heavy AVX-512 loads on all cores, which is very data intensive but easy on the front-end, but it comes up short in logic heavy code. And while every tiny reduction in latency will help a tiny bit, improving the front-end and its associated cache which is fed into the pipeline (L2) will have several orders of magnitude more impact, as it doesn't just impact the cache hits in L2 (cache misses in L1), it also impacts the when the front-end doesn't fail - so the front-end can keep up the prefetching so it doesn't cause any cache miss, which is something L3 doesn't help at all.