I have necro posted because I feel this question is still tossed around every few months as of 2020.
I am a former Biochemist and currently a solo game developer who has studied the background processes over 5 years.
The Frames Per Second is a direct reflection of the time required to produce 1 frame of a rendered image in 1 second. The term Render is the process to make a result through a collection of sub-processes. Whether it is a [still image] or a [Real Time render] , the Frame Per Second is determined by the all the steps required to produce 1 pixel on the screen. The objects that are present within a given Three Dimensional coordinate system will remain invisible unless the area between per 3 vertices are commanded to. The area between 3 vertices is called a triangle. Every 2D and 3D object is made up of a collection of triangles, some overlapping and some sharing adjacent vertices to produce a surface area. Each surface area is then mapped on a new 2D system labeled UV ( L for 3D).
Identical to the XYZ system, the UVL system is used specifically to describe a surface area. The UVL system undergoes many shader calculations to produce a still image or even a scripted image. This final image is translated on the surface area of a 3D object in a XYZ coordinate system. The image produced on each area within the object's collection of triangles is called a Texel. If a game object is made up of 3 triangles, it respectively has 1 Texel. This texel is then converted to a pixel(s) on screen. According to the size of this single triangle in view of the Render Camera, this single texel can be translated up to 100% of the pixels available on the Height and Width of the screen (Monitor Resolution). The GPU aka graphic card's responsibility is to quickly read Saved Information about the TEXEL's of an Object, find out its position in 3D space, then render it using however many pixels this object encompasses in 2D space(screen space). If this real time Object contains several thousands of texels, the object's size and distance directly impact the time to render a 2D image on screen.
For example, lets say we have a rainbow image used to describe the surface area of 1 million texels of a single object. The further away this object is placed from the Rendering Camera, the fewer pixels it will require to render a 2D result on screen; however, the time to produce the result increase dramatically -- i call this pixel fighting. If the color of the rainbow was to be described in 1 pixel on screen, what color will the GPU pick, Red, blue, white? In other words, if each of the 1 million texels contained a single color fitted in 1 pixel, what color will the GPU choose to represent that Rainbow Object? The time required is a direct reflection of the GPU's speed to calculate a result for 1 million texels jammed into pixel. As this object moves closer the to rendering camera -- the bigger it becomes in 2D space and the more pixels it begins to encompass -- the easier it is to render. In relation, the fewer texels, the quicker it is to produce the "Correct" result within X many pixels. If an object is a single color, red, regardless of where it is in 3D space, it will always be red. The GPU will know to render Red regardless of its size, position in relation to the Render Camera, and the amount of texels it contains. This is why when an entire game is 1 shade of grey, or in our case 1 shade of red, the Frames Per Second is lightning quick. The Frames Per Second is a direct reflection of the number of texels fighting for however many pixels on screen.
If a Black Cloud moves over a red ball, consider how many pixels is being used to view this interaction (in other words, how close are these 2 objects to the camera), then how many texels are fighting to represent a particular color? Next,take this reality and expand on it for an entire game. How often are objects Pixel Fighting? Don't forget to include the influence of the Lights and their Hue. Are they casting shadows, if so, add in the Time it takes to calculated the shadow and its color. Is this pixel fighting with whatever object this shadow is casting on? Also consider the post process effects such as blurs, glows, distortions. Are the Objects Transparent, are they changing in shape? The GPU handles the result of each pixel. To determine what each pixel should represent is entirely on the speed of the GPU to sort out all the Colors in a given pixel. Is an object overlapping or intersecting? What color is it?
In short,
Frames Per Second is the result of pixel fighting and the Time required for the CPU to tell the GPU what to do.
Memory is just a way to short cut processes that have already been made. For example, every time this 1 object appears on screen, the CPU doesnt have to keep telling the GPU that this object is Blue. Instead, it stores that info in the memory to quick reference that result. The Computer has its own memory to remember the scripts its reading, the GPU has its own memory to remember the data describing the object.
Im afraid this is all i am going to give (which is SHIT TON), any more and it will be giving away the secret sauce to how to develop games better. Not all things can be taught.