How does the video processor work?
The speed of the game is the responsibility of all team members, regardless of the position. We, 3D programmers, have ample opportunities to control the performance of the video processor: we can optimize the shaders, sacrifice picture quality for speed, use more cunning rendering techniques ... However, there is an aspect that we can not fully control, and this is the graphics resources of the game.
We hope that artists will create resources that not only look good, but also will be effective in rendering. If the artists learn a little more about what is happening inside the video processor, this can have a big impact on the frame rate of the game. If you are an artist and want to understand why aspects such as draw calls, LODs and MIP textures are important for performance, then read this article. To take into account the impact that your graphics resources have on game performance, you need to know how the polygon meshes get from the 3D editor to the game screen. This means that you need to understand the operation of the video processor, the chip that controls the graphics card and is responsible for the 3D real-time rendering. Armed with this knowledge, we will look at the most frequent performance problems, understand why they are a problem, and explain how to deal with them.
Before we begin, I would like to emphasize that I will purposely simplify much for the sake of brevity and clarity. In many cases, I generalize, describe only the most typical cases, or simply discard certain concepts. In particular, for the sake of simplicity, the ideal version of the video processor described in the article is most similar to the previous generation (the DX9 era). However, when it comes to performance, all the arguments below are applicable to modern PC hardware and consoles (but perhaps not to all mobile video processors). If you understand everything written in the article, then it will be much easier for you to cope with the variations and difficulties that you will encounter in the future if you want to understand more deeply.
Part 1: The rendering pipeline from a bird's eye viewTo display a polygon mesh on the screen, it must pass through the video processor for processing and rendering. Conceptually this way is very simple: the grid is loaded, the vertices are grouped into triangles, the triangles are converted into pixels, each pixel is assigned a color, and the final image is ready. Let's take a closer look at what happens at each stage.
After exporting the grid from the 3D editor (Maya, Max, etc.), geometry is usually loaded into the game engine in two parts: a vertex buffer (Vertex Buffer, VB) containing a list of grid vertices with the properties associated with them (position, UV coordinates , normal, color, etc.), and an index buffer (Index Buffer, IB), which lists the vertices from VB, connected in triangles.
Together with these grid geometry buffers, a material is also assigned that determines its appearance and behavior under different lighting conditions. For the video processor, this material takes the form of specially written shaders - programs that determine the way vertex processing and the color of the final pixels. When choosing material for the grid, you need to adjust various material parameters (for example, the value of the base color or the choice of texture for different maps: albedo, roughness, normal maps, etc.). All of them are transferred to the shader programs as input data.
The grid and material data are processed by different stages of the video processor pipeline to create pixels of the final target renderer (the image into which the video processor writes). This target renderer can later be used as a texture in subsequent shaders and / or displayed on the screen as a final image of the frame.
For the purposes of this article, the important parts of the video processor pipeline will be the following, from top to bottom:
Input Assembly. The video processor reads the vertex and index buffers from memory, determines how the vertices forming the triangles are connected and transfers the rest to the pipeline.
Vertex Shading. The vertex shader is executed for each of the grid vertices, processing on a separate vertex at a time. Its main task is to convert the vertex, get its position and use the current camera and viewport settings to calculate its location on the screen.
Rasterization. After the vertex shader is executed for each vertex of the triangle and the video processor knows where it appears on the screen, the triangle is rasterized - it is converted to a set individual pixels. The values of each vertex are UV coordinates, vertex color, normal, etc. - are interpolated by the pixels of the triangle. Therefore, if one vertex of a triangle has a black color and the other is white, then a pixel rasterized in the middle between them will get an interpolated gray color of the vertices.
Pixel Shading. Then, for each rasterized pixel, a pixel shader is executed (although technically at this stage it is not a pixel but a "fragment", so sometimes a pixel shader is called a fragment shader). This shader programmatically gives the pixel a color, combining material properties, textures, lighting sources and other parameters to get a certain appearance. Pixels are very many (the target renderer with a resolution of 1080p contains more than two million), and each of them must be shaded at least once, so usually the video processor spends a lot of time on the pixel shader.
Render Target Output. Finally, the pixel is written to the target renderer, but before that, some checks are done to make sure it's correct. The depth test discards pixels that are deeper than the pixel already present in the target renderer. But if the pixel passes all the checks (depth, alpha channel, stencil, etc.), it is written to the target rendered in memory.
Actions are much more, but this is the main process: for each vertex in the grid, the vertex shader is executed, each three-vertex triangle is rasterized into pixels, for each rasterized pixel, a pixel shader is executed, and then the resulting colors are written into the target render.
Shader programs that specify the type of material are written in the shader programming language, for example, in HLSL . These shaders are executed in the video processor in much the same way as normal programs run on the central processor - they receive data, execute a set of simple instructions for changing the data, and output the result. But if the CPU programs can work with any type of data, then the shader programs are specifically designed to work with vertices and pixels. These programs are written in order to give the rendered object the kind of material you need - plastic, metal, velvet, leather, etc.
I will give you a concrete example: here is a simple pixel shader that calculates Lambert lighting (ie, only simple scattering, without reflections) for the color of the material and texture. This is one of the simplest shaders, but you do not need to understand it, just see what shaders look like in general.
float3 MaterialColor;A simple pixel shader that calculates basic lighting. Input data such as MaterialTexture and LightColor are transmitted by the CPU, and vUV and vNorm are vertex properties interpolated along the triangle during rasterization. <Tgsri>
float4 MyPixelShader( float2 vUV : TEXCOORD0, float3 vNorm : NORMAL0 ) : SV_Target
float3 vertexNormal = normalize(vNorm);
float3 lighting = LightColor * dot( vertexNormal, LightDirection );
float3 material = MaterialColor * MaterialTexture.Sample( TexSampler, vUV ).rgb;
float3 color = material * lighting;
float alpha = 1; return float4(color, alpha);
Here are the generated shader instructions:
dp3 r0.x, v1.xyzx, v1.xyzx<i> The shader compiler receives the above program and generates such instructions that are executed in the video processor. The longer the program, the more instructions, that is, more work for the video processor. <Tgsri>
rsq r0.x, r0.x
mul r0.xyz, r0.xxxx, v1.xyzx
dp3 r0.x, r0.xyzx, cb0.xyzx
mul r0.xyz, r0.xxxx, cb0.xyzx
sample_indexable(texture2d)(float,float,float,float) r1.xyz, v0.xyxx, t0.xyzw, s0
mul r1.xyz, r1.xyzx, cb0.xyzx
mul o0.xyz, r0.xyzx, r1.xyzx
mov o0.w, l(1.000000)
In passing, I can see how the shader is isolated - each shader works with a separate vertex or pixel and does not need to know anything about the surrounding vertices / pixels. This is done intentionally, because it allows the video processor to process in parallel huge amounts of independent vertices and pixels, and this is one of the reasons why video processors process graphics much faster than CPUs.
Soon we will return to the pipeline to see why the work can slow down, but first we need to take a step back and see how the grid and material get into the video processor. Here we also meet the first performance barrier - the call of rendering.
The CPU and draw callsThe video processor can not work alone: it depends on the code of the game running in the main processor of the computer - the CPU, which tells it what and how to render. The CPU and the video processor are (usually) separate chips that operate independently and in parallel. To get the required frame rate - usually 30 frames per second - both the CPU and the video processor must do all the work to create one frame in a reasonable time (at 30fps it's only 33 milliseconds per frame).
To achieve this, frames are often <i> lined up in a pipeline : The CPU takes the entire frame (it processes AI, physics, user input, animations, etc.) for its work, and then sends instructions video processor at the end of the frame, so that he could start working in the next frame. This gives each of the processors a total of 33 milliseconds to complete the job, but at the cost of this is the addition of latency (delay) length to the frame. This can be a problem for very sensitive games, say, for first-person shooters - the Call of Duty series, for example, works at 60fps to reduce the delay between entering the player and rendering - but usually the player does not notice the extra frame.
Every 33 ms, the final target renderer is copied and displayed on the screen in VSync - the interval during which the new frame is being searched for display. But if the video processor requires more than 33 ms to render the frame, then it skips this window of opportunity and the monitor does not get a new frame to display. This leads to flicker or pauses on the screen and a reduction in the frame rate to be avoided. The same result is obtained if the CPU work takes too much time - this results in the skip effect, because the video processor does not receive the command quickly enough to do its work at the acceptable time. In short, the stable frame rate depends on the good performance of both processors: the CPU and the video processor.
Here the creation of the rendering commands from the CPU took too long for the second frame, so the video processor starts rendering later and skips VSync.
To display the grid, the CPU creates a draw call , which is a simple sequence of instructions that tells the video processor what to draw. During the process of passing a call to the pipeline through the video processor, it uses the various configurable settings specified in the drawing call (basically specified by the material and grid parameters) to determine how the grid is rendered. These settings, called GPU state , affect all aspects of the rendering and consist of everything that the video processor needs to know to render the object. Most important for us is that the video processor contains the current vertex / index buffers, the current vertex / pixel shader programs, and all shader input data (for example, MaterialTexture or LightColor from the above shader code example ).
This means that to change the state of the video processor (for example, to replace the texture or switch shaders), you need to create a new drawing call. This is important, because these rendering calls are costly for the video processor. It takes time to set the necessary changes in the state of the video processor, and then to create a drawing call. In addition to the work that the game engine needs to perform every time the drawing is invoked, there are still costs for additional error checking and storing intermediate results. added by the graphic driver . This is an intermediate layer of code. written by the manufacturer of the video processor (NVIDIA, AMD etc.), which converts the drawing call to low-level hardware instructions. Too many call draws are a heavy burden on the CPU and lead to serious performance problems.
Because of this load, it is usually necessary to set the upper limit of the allowable number of draw calls per frame. If during the gameplay testing this limit is exceeded, then it is necessary to take steps to reduce the number of objects, reduce the depth of rendering, etc. In games for consoles, the number of drawing calls is usually limited to the interval 2000-3000 (for example, for Far Cry Primal, we tried to have them no more than 2500 per frame). This seems like a large number, but it also includes special rendering techniques - cascading shadows , for example, can easily double the number of draw calls in the frame.
As mentioned above, the state of the video processor can only be changed by creating a new drawing call. This means that even if you created a single grid in the 3D editor, but in one half of the grid one texture is used for the albedo map, and in the other half - another texture, the grid will be rendered as two separate drawing calls. The same is true when a grid consists of several materials: you need to use different shaders, that is, create several drawing calls.
In practice, a very frequent source of state change, that is, additional drawing calls, is the switching of texture maps. Usually the same material is used for the whole grid (and hence the same shaders), but different parts of the grid have different sets of albedo / roughness maps. In a scene with hundreds or even thousands of objects to use multiple drawing calls for each object, a significant amount of CPU time is wasted, and this greatly affects the frame rate in the game.
To avoid this, the following solution is often used: they combine all the texture maps used by the grid into one large texture, often called the atlas . Then the UV coordinates of the grid are adjusted in such a way that they search for the necessary parts of the atlas, while the whole grid (or even several grids) can be rendered in one call of drawing. When creating an atlas, you need to be careful that at low MIP levels adjacent textures are not superimposed on each other, but these problems are less serious than the advantages of such an approach for providing speed.
Texture atlas from the demo Infiltrator engine Unreal Engine
Many engines support instancing , also known as batching or clustering. This is the ability to use one drawing call to render multiple objects that are almost the same in terms of shaders and states, and the differences are limited (usually their position and rotation in the world). Usually, the engine understands when it is possible to render several identical objects using cloning, so whenever possible always try to use one object in the scene several times, rather than several different objects that you have to render in separate drawing calls.
Another popular technique to reduce the number of draw calls is to manually merge several different objects with the same material into one grid. It can be efficient, but excessive aggregation should be avoided, which can degrade performance by increasing the amount of work for the video processor. Even before creating the drawing calls, the engine's visibility system can determine whether the object is on the screen at all. If not, it's much less costly to just skip it at this initial stage and not spend on it drawing calls and the time of the video processor (this technique is also known as visibility culling ) . This method is usually implemented by checking the visibility of the object-bound volume from the camera's viewpoint and checking whether it is completely blocked ( occluded ) in the scope of other objects.
However, when several grids are combined into one object, their individual bounding volumes are joined into one large volume, which is large enough to accommodate each of the grids. This increases the likelihood that the visibility system will be able to see part of the volume, and therefore, to calculate the entire set of grids. This means that a drawing call will be created, and therefore the vertex shader must be executed for each vertex of the object, even if only a few vertices are visible on the screen. This can lead to a waste of most of the time of the video processor, because most vertices as a result have no effect on the final image. For these reasons, the grid is most effectively combined for groups of small objects that are close together because they are most likely Anyway, they will be visible on one screen.
A frame from XCOM 2, made in RenderDoc. On the wireframe (bottom), gray shows all the extra geometry that is transferred to the video processor and is outside the scope of the game camera. <Tgsri>
As an illustrative example, take a frame from XCOM 2, one of my favorite games in the last couple of years. The wireframe shows the entire scene transmitted by the engine to the video processor, and the black area in the middle is the geometry visible from the game chamber. All the surrounding geometry (gray) is invisible and will be cut off after the vertex shader is executed, that is, it will waste the time of the video processor. In particular, look at the red selected geometry. This is a few grids of bushes, connected and rendered in just a few drawing calls. The visibility system determined that at least some of the scrubs are visible on the screen, so they are all rendered and their vertex shader is executed for them, then those that can be cut off (it turns out that they are the majority) are recognized.
Understand correctly, I do not accuse it of XCOM 2, just during the writing of the article, I played a lot in it! All games have this problem, and there is always a struggle for a balance between the time spent by the video processor on more accurate visibility checks, the costs of clipping invisible geometry, and the cost of more call draws.
However, everything changes when it comes to the costs of drawing calls. As mentioned above, an important reason for these costs is the additional load that the driver creates when converting and checking for errors. This has been a problem for a very long time, but most of the modern graphics APIs (for example, Direct3D 12 and Vulkan) structure is changed in such a way to avoid unnecessary work. Although this adds complexity to the rendering engine of the game, it leads to less expensive drawing calls, which allows us to render many more objects than was possible before. Some engines (the most notable of them are the latest version of the Assassin's Creed engine) even went completely in a different direction and use the capabilities of modern video processors to control rendering and to effectively get rid of drawing calls.
A large number of draw calls basically reduces the performance of the CPU. And almost all performance problems related to the graphics are associated with the video processor. Now we will find out what the "bottlenecks" are, where they arise and how to cope with them.
Part 2: The usual "bottlenecks" of the video processorThe very first step in optimization is to search for the existing <i> bottleneck so that you can then reduce its influence or completely get rid of it. "Bottleneck" refers to the part of the conveyor that slows down all work. In the example above, where there were too many costly draw calls, the "bottleneck" was the central processor. Even if we performed optimizations that speed up the operation of the video processor, this would not affect the frame rate, because the CPU would still work too slowly and did not have time to create the frame in the required time.
Four drawing calls pass through the pipeline, each of which renders the entire grid containing many triangles. The steps are superimposed, because as soon as one part of the work ends, it can be immediately transferred to the next stage (for example, when three vertices are processed by a vertex shader, then the triangle can be transferred for rasterization). <Tgsri>
As an analogy of the video processor pipeline, you can bring an assembly line. Once each stage finishes with its data, it passes the results to the next stage and begins to perform the next part of the work. Ideally, each stage is busy constantly, and the equipment is used fully and efficiently, as shown in the figure above - the vertex shader constantly processes the vertices, the rasterizer constantly rasters the pixels, and so on. But imagine, if one stage takes much more time than the others:
Here, the costly vertex shader can not transfer data to the next stages quite quickly, and therefore becomes a "bottleneck". If you have such a draw call, the acceleration of the pixel shader will not greatly change the total rendering time of the entire drawing call. The only way to speed up the work is to reduce the time spent in the vertex shader. The method of solution depends on what at the vertex shader stage leads to the creation of "congestion".
Do not forget that any "bottle neck" will exist almost always - if you get rid of one, its place will simply take another. The trick is to understand when you can deal with it, and when you have to just reconcile with it, because this is the price of the render. When optimizing, we strive to get rid of the optional bottlenecks. But how to determine what "bottlenecks" consist in?
ProfilingTo determine what is spent all the time of the video processor, profiling tools are absolutely necessary. The best of them can even indicate what needs to be changed in order to speed up the work. They do this in different ways - some simply show a list of bottlenecks, others allow you to "experiment" and observe the consequences (for example, "how will the drawing time change if all the textures are small", which helps to understand if you are limited memory bandwidth or cache usage).
Unfortunately, everything becomes more complicated here, because some of the best profiling tools are available only for consoles, and therefore fall under the NDA. If you are developing a game for Xbox or Playstation, contact the graphics programmer to show you these tools. We programmers love when artists want to influence performance, and are happy to answer questions or even write instructions on the effective use of tools.
<i> Basic built-in profiler of the video processor of the engine Unity
For the PC, there are quite good (albeit hardware-specific) profiling tools that can be obtained from the manufacturers of video processors, for example Nsight from NVIDIA, AMD's GPU PerfStudio and GPA Intel. In addition, there is a RenderDoc - the best tool for debugging graphics on a PC, but it does not have advanced profiling features. Microsoft is starting to release its amazing Xbox profiling tool PIX and for Windows, even for D3D12 applications only . Assuming that the company wants to create the same bottleneck analysis tools as in the Xbox version (which is difficult given the huge variety of equipment), it will be an excellent resource for developers on the PC.
These tools can give you all the information about the speed of your graphics. Also they will give many clues about how the frame is compiled in your engine and will allow you to debug.
It is important to master the work with them, because artists should be responsible for the speed of their graphics. But do not expect that you will completely understand everything yourself - any good engine should have its own tools for performance analysis, ideally providing metrics and recommendations that help you determine if your graphics resources fit into the performance framework. If you want to more influence performance, but feel that you lack the necessary tools, talk to the team of programmers. There is a possibility that such tools already exist - and if they are not, then they need to be written!
Now that you know how the video processor works and what bottleneck is, we can finally do interesting things. Let's delve into the consideration of the bottlenecks most often encountered in real life, which can arise in the conveyor belt, we learn how they appear and what can be done with them.
Shader InstructionsSince most of the work of the video processor is performed by the shaders, they often become the sources of many bottlenecks. When the "bottleneck" is called the shader instruction - it simply means that the vertex or pixel shader is doing too much work, and the rest of the pipeline has to wait for its completion.
Often the vertex or pixel shader program is too complex, it contains many instructions and its execution takes a long time. Or maybe the vertex shader is perfectly acceptable, but there are too many vertices in the rendering grid, and because of them, the execution of the vertex shader takes too much time. Or the drawing call affects a large area of the screen and a lot of pixels, which takes a lot of time in the pixel shader.
Not surprisingly, the best way to optimize bottlenecks in shader instructions is to do less instructions! For pixel shaders, this means that you need to select a simpler material with fewer characteristics to reduce the number of instructions executed per pixel. For vertex shaders, this means that you need to simplify the grid to reduce the number of vertices processed, and use LOD (Level Of Detail, the simplified versions of the grid used when the object is far away and takes up a small space on the screen).
However, sometimes "congestion" in shader instructions simply indicates problems in another area. Problems such as too much redrawing, poor performance of the LOD system and many others can cause the video processor to perform much more work than necessary. These problems can arise from the side of the engine, and from the content side. Careful profiling, careful study and experience will help to find out what is happening.
One of the most common of these problems is redrawing (overdraw) . The same pixel on the screen has to be shaded several times, because it is affected by a lot of drawing calls. Redrawing is a problem, because it reduces the total time that the video processor can spend rendering. If each screen pixel needs to be shaded twice, then to save the same frame rate, the video processor can only spend half the time on each pixel.
Frame of the PIX game with redraw visualization mode
Sometimes, redrawing is inevitable, for example, when rendering translucent objects, such as particles or grass: an object on the background is seen through the object in the foreground, so you need to render both. But for opaque objects, redrawing is absolutely not required, because the pixel contained in the buffer at the end of the rendering process will be the only one that needs to be processed. In this case, each pixel repainted is an unnecessary waste of the video processor time.
The video processor takes steps to reduce the redrawing of opaque objects. The initial depth test (which is performed before the pixel shader - see the pipeline diagram at the beginning of the article) skips the shading of pixels if it determines that the pixel is hidden behind another object. To do this, he compares the shadowed pixel with depth buffer - the target renderer, in which the video processor stores the depth of the entire frame so that objects can overlap each other correctly. But for the initial depth test to be effective, another object must fall into the depth buffer, that is, be fully rendered. This means that the order of rendering objects is very important.
Ideally, each scene needs to be rendered from the front back (that is, the closest objects to the camera are rendered first), so that only the front pixels are obscured, and the rest are discarded at the initial depth test, completely eliminating redrawing. But in the real world, this is not always possible, because in the rendering process it is impossible to change the order of the triangles inside the drawing call. Complex grids can overlap several times, and combining grids can create many superimposed objects that are rendered in an "incorrect" order and lead to redrawing. A simple answer to these questions does not exist, and this is another aspect that should be considered when deciding to merge the nets.
To help the initial depth test, some games perform a partial depth prepass . This is a preparatory passage in which some large objects capable of effectively overlapping other objects (large buildings, relief, protagonist, etc.) are rendered by a simple shader that outputs only to the depth buffer, which is relatively fast, because at the same time The pixel shader does not work on lighting and texturing. This "improves" the depth buffer and increases the amount of pixel shaders that can be skipped in the full rendering aisle. The drawback of this approach is that double rendering of overlapping objects (first in the pass only for depths, and then in the main pass) increases the number of drawing calls, plus there is always the possibility that the time for rendering the depth passage will be greater than the time saved on Increasing the efficiency of the initial depth test. Only detailed profiling allows you to determine whether this approach is worth using in a particular scene.
Visualization of repaint of explosion particles in Prototype 2
Especially important is the redrawing of the particles, taking into account that the particles are transparent and often overlap. When creating effects, working with particles artists should always remember to redraw. The effect of a thick cloud can be created by the emission of many small overlapping particles, but this will significantly increase the cost of rendering the effect. It is better to emit a smaller number of large particles, and to transfer the effect of density, rely more on textures and texture animation. In this case, the result is often more visually effective, because software such as FumeFX and Houdini can usually create much more interesting effects through texture animation than the real-time simulated behavior of individual particles.
The engine can also take steps to get rid of the unnecessary work of the video processor on the calculation of particles. Each rendered pixel, which is completely transparent as a result, is a waste of time, so the particle trimming optimization is usually performed: instead of rendering the particle with two triangles, a polygon is generated that minimizes empty areas used texture.
Particle cutting tool in Unreal Engine 4
The same can be done with other partially transparent objects, for example, with vegetation. In fact, for vegetation, it is even more important to use arbitrary geometry that allows you to get rid of large amounts of empty texture space, because alpha testing is often used for vegetation. It uses the alpha channel of the texture to determine if a pixel should be dropped at the pixel shader, which makes it transparent. This is a problem because alpha testing has a side effect, it completely turns off the initial depth text (because it depreciates the assumptions made by the video processor relative to the pixel), which results in a much larger amount of unnecessary pixel shader work. In addition, vegetation often contains a lot of redraws (think of all the overlapping tree leaves) and if you do not be careful, it quickly becomes very expensive when rendering.
Very close to the effect on redrawing is excessive overshading , which is caused by small or thin triangles. It can severely damage performance by inappropriately spending a significant amount of time on the video processor. Excessive shading is a consequence of how the video processor handles pixels when shading pixels: not one at a time, but "quads" . These are blocks of four pixels, lined with a 2x2 square. This is then done so that the equipment can cope with tasks such as comparing UV between the pixels to calculate the appropriate levels of MIP-texturing.
This means that if the triangle touches only one quad point (because the triangle is small or very thin), the video processor still processes the entire quad and simply discards the remaining three pixels, wasting 75% of the work. This wasted time can accumulate and is particularly sensitive for direct (ie, not deferred) renderers that perform lighting and shading calculations in a single pass in the pixel shader. This load can be reduced by using properly configured LODs; except for saving on processing vertex shaders, they also significantly reduce the amount of excessive shading by the fact that on average, the triangles cover most of each of the quads.
<i> Pixel buffer 10x8 with quads 5x4. Two triangles badly use quads - the left one is too small, the right one is too thin. The 10 red quads touched by the triangles should be completely shaded, even though in fact the shadings only require 12 green pixels. In general, 70% of the work of the video processor is wasted. <Tgsri>
(Additional information: excessive quad shading also becomes often the reason that the monitor often displays full-screen post-effects that use to overlap the screen one big triangle instead of two
|Vote for this post
Bring it to the Main Page