Video encoding is really complicated without doing some research.
Most encoders compress information by taking a reference frame and then use motion vectors to estimate certain blocks of the next frame. The difference between the previous frame and the estimated frame is encoded instead. Some even take information from the next frame...
Now, a frame itself is encoded in some colour space. A very naive colour space would be RGB. The three values per pixel contain the same amount of information.
We know a pixel next to another pixel is very likely to be the same in colour and we humans see better black/white (luminance) picture than colour itself. Thus it would be wise to encode more in the black/white picture than in colour.
For example YUV colour space. Y contains the picture in black/white, while U and V (chroma) just add some colour.
You often see a colour space Y'CbCr and then a number: for example 4:2:0 when encoding. This is sub-sampling This means in a block of 2 lines of 4, you'll get info of all 4 Y pixels in each line, 2 CbCr pixels on the first row and 0 CbCr pixels on the second row. You'll have quite high compression, but still high image quality.
As you see with this subsampling method, you need to have a fixed amount of pixels for it to work.
Now, everything is tied to blocks/resolution. A common way to lower bitrate, is by lowering the resolution, as you'll keep your 'optimal' parameters of the rest of the system (block size, motion estimation etc...) for the specific content of the scene.
I agree it does not need to be resolution that gets lowered, but it is the easiest way to do it.
My apologies if I got anything wrong, it's been a while since I've worked with this
