The whole web on 60+ FPS: as a new renderer in Firefox got rid of jerks and slowdowns
Before the release of Firefox Quantum, there is less time left. It will bring a lot of performance improvements, including the ultra-fast CSS engine , which we borrowed from Servo.
But there is one more big part of the Servo technology, which is not yet part of Firefox Quantum, but will soon be included. This is WebRender, part of the Quantum Render project.
WebRender is known for its exceptional speed. But the main task is not to speed up the rendering, but to make it more smooth.
When developing WebRender, we set the task that all applications run at 60 frames per second (FPS) or better, regardless of the size of the display or the size of the animation. And it worked. Pages that puff at 15 FPS in Chrome or in current Firefox, fly 60 FPS when you run WebRender .
How does WebRender do it? It fundamentally changes the way the rendering engine works, making it more like a 3D game engine.
We'll figure out what this means. But first…
What does the renderer do?In the Stylo article, I explained how the browser goes from parsing HTML and CSS to pixels on the screen, and how most browsers do this in five stages.
These five stages can be divided into two parts. The first of these is, in effect, the drawing up of a plan. To make a plan, the browser parses HTML and CSS, taking into account information such as the size of the viewport, in order to find out exactly how each element should look - its width, height, color, etc. The end result is what is called a "frame tree" or "render tree."
In the second part - rendering and linking - the renderer comes into operation. He takes this plan and turns it into pixels on the screen.
But the browser does not need to do it just once. He has to repeat again and again the operation for the same web page. Every time something changes on the page - for example, a div is opened on the switch - the browser has to repeatedly go through all the steps repeatedly.
Even if nothing changes on the page-for example, you simply scroll or select text-the browser still needs to perform the rendering operations in order to draw new pixels on the screen.
For scrolling and animation to be smooth, they must be updated at 60 frames per second.
You could have heard this phrase before - frames per second (FPS) - being unsure what it means. I present them as a flipbook. It's like a book with static pictures that can be quickly scrolled, so that the illusion of animation is created.
To make the animation in such a flipbook look smooth, you need to view 60 pages per second.
The pages in the flip-book are made of graph paper. There are many, many small squares, and each square can contain only one color.
The task of the renderer is to fill the squares in the graph paper. When they are all full, then the frame rendering is complete.
Of course, your computer does not have real graph paper. Instead, the computer has a memory area called a frame buffer. Each memory address in the frame buffer is like a square on a graph paper ... it corresponds to a pixel on the screen. The browser fills each cell with numbers that correspond to the RGBA values (red, green, blue, and alpha).
When the screen needs to be updated, it refers to this memory area.
Most computer displays are updated 60 times per second. That's why browsers try to give out 60 frames per second. This means that the browser has only 16.67 milliseconds for all the work: the analysis of CSS styles, layout, rendering - and filling all the slots in the frame buffer with numbers that match the colors. This time interval between two frames (16.67 ms) is called the frame budget.
You could hear people sometimes mention skipped footage. The missed frame is when the system does not fit into the budget. The display attempts to receive a new frame from the frame buffer before the browser has finished working on its display. In this case, the display again shows the old version of the frame.
Missed frames can be compared to a torn page from a flipbook. Animation begins to freeze and twitch, because you lost the intermediate link from the previous page to the next.
So you need to put all the pixels in the frame buffer before the display checks it again. Let's see how browsers used to cope with this and how the technology changed over time. Then we can understand how to accelerate this process.
Brief history of rendering and layout<i> Note. Rendering and layout is the part where the engines in rendering in browsers are the most different from each other. Single-platform browsers (Edge and Safari) work a little differently than multiplatform browsers (Firefox and Chrome). <Tgsri>
Even in the very first browsers, some optimizations were carried out to speed up page rendering. For example, when scrolling a page, the browser tried to move the rendered portions of the page, and then draw the pixels in the free space.
The process of calculating what has changed, and then updating only the changed elements or pixels, is called a disability.
Over time, browsers began to use more advanced disability techniques, such as disabling rectangles. Here, the minimum rectangle around the changed area of the screen is calculated, and then only the pixels inside these rectangles are updated.
Here, the amount of computation is really greatly reduced, if only a small number of elements change on the page ... for example, only a blinking cursor.
But this does not help much if large parts of the page change. For such cases, we had to come up with new techniques.
The appearance of layers and layoutThe use of layers greatly helps when changing large parts of the page ... at least in some cases.
Layers in browsers are similar to the layers in "Photoshop" or layers of thin smooth paper, which used to be used to draw cartoons. In general, the various elements of the page you draw on different layers. Then put these layers on top of each other.
For a long time, browsers used layers, but they did not always speed up the rendering. At first they were used simply to ensure the correct drawing of the elements. They implemented the so-called "stacking context".
For example, if you have a semi-transparent element on the page, it must be in its own positional context. This means that it has its own layer so that it can blend its color with the color of the underlying element. These layers were discarded as soon as the rendering of the frame was completed. On the next frame, the layers had to be painted anew.
But some of the elements on these layers do not change from frame to frame. For example, imagine a normal animation. The background does not change even if the characters move in the foreground. It is much more effective to save the layer with the background and just reuse it.
That's what browsers did. They began to maintain the layers, updating only the changed ones. And in some cases, the layers did not change at all. They need only move slightly - for example, if the animation moves across the screen or in the case of an item scrolling.
Such a process of co-arrangement of layers is called a layout. The builder works with the following objects:
source raster images: background (including the empty window where the content should scroll) and the scrolling content itself;
The target bitmap is what is displayed on the screen.
First, the linker copies the background to the target bitmap.
Then he must find out which part of the scrolling content needs to be shown. It will copy this part over the target bitmap.
This reduces the amount of rendering in the main thread. But the main thread still spends a lot of time on layout. And there are many processes that are struggling for resources in the main thread.
But we have other hardware that sits here and does almost nothing. And it is specially created for graphic processing. It's about the GPU, which games have been using since the 90s for quick frame rendering. And since then, graphics processors have become larger and more powerful.
Hardware Acceleration LayoutSo the developers of browsers began to transfer the work of the GPU.
Theoretically, two tasks can be transferred to the graphics accelerator:
Layers of layers with each other.
Rendering is difficult to transfer to the GPU. So usually multiplatform browsers leave this task on the CPU.
However, the GPU can very quickly perform the layout, and this task is easy to hang on to it.
That is, all the work on the layout leaves the main thread. However, there still remains a lot of things. Every time you need to redraw a layer, it does the main thread, and then passes the layer to the GPU.
Some browsers have moved and rendered into an additional thread (now we are working on this in Firefox too). But it will be faster to transfer this last piece of computation - rendering - directly to the GPU.
Rendering with Hardware AccelerationSo, the browsers began to transfer to the graphics processor and rendering too.
This transition is still going on. Some browsers perform all rendering on the GPU, and in others this is possible only on certain platforms (for example, only on Windows or only on mobile devices).
But maintaining this separation between rendering and linking still requires certain costs, even if both processes are running on the GPU. This separation also limits you in optimizations to speed up the GPU.
That's where the WebRender business comes into play. It fundamentally changes the way of rendering, leveling the difference between drawing and layout. This allows you to adjust the performance of the renderer to the requirements of the modern web and prepare it for situations that will appear in the future.
In other words, we wanted to not just speed up the rendering of frames ... we wanted them to render more stable, without jerks and slowdowns. And even if you need to draw a lot of pixels, like in the helmets of the virtual reality WebVR with a resolution of 4K, we still want smooth playback.
Why is animation so slow in modern browsers?The above optimizations have helped in some cases to accelerate the rendering. When a minimum of elements are changed on the page - for example, only blinking courses - the browser does the minimum possible amount of work.
After splitting the pages into layers, the number of such "ideal" scenarios increased. If you can just draw a few layers, and then just move them relative to each other, then the architecture "rendering + layout" does an excellent job.
But the layers have drawbacks. They take up a lot of memory, and sometimes they can slow down the rendering. Browsers should combine layers where it makes sense ... but it is difficult to determine exactly where it makes sense and where not.
So if there are many different objects moving on the page, you'll have to create a bunch of layers. Layers take up too much memory, and the transfer to the linker takes too much time.
In other cases, one layer is obtained where there should be several. This single layer will be continuously redrawn and transferred to the linker, which then assembles it without changing anything.
That is, the drawing effort is succeeded: each pixel is processed twice without any need. It would be faster to just render the page directly, bypassing the build phase.
There are many cases where layers are simply useless. For example, if you have an animated background, the entire layer will still have to be redrawn. These layers only help with a small number of CSS properties.
Even if most frames fit into the optimal scenario - that is, they take away only a small part of the frame budget - the movement of objects can still remain intermittent. To perceive the jerks and tugging on the eye, it is enough to lose just a couple of frames that fit into the worst scenario.
These scenarios are called performance breaks. The application works as if it is normal until it encounters one of these worst scenarios (like an animated background) - and the frame rate suddenly drops to the limit.
But you can get rid of such cliffs.
How to do it? Let's follow the example of 3D game engines.
Using the GPU as a game engineWhat if we stop guessing which layers we need? What if you remove this intermediate step between drawing and linking and just go back to drawing each pixel in each frame?
It may seem like an absurd idea, but in some places such a system is used. In modern video games, each pixel is redrawn, and they keep the level of 60 frames per second safer than browsers. They do this in an unusual way ... instead of creating these rectangles for disabilities and layers that minimize the area for redrawing, the entire screen is simply updated.
Will the rendering of a web page in this way be much slower?
If we draw on the CPU, then yes. But GPUs are specially designed for this kind of work.
GPUs are built with maximum concurrency. I talked about the parallelism in my last article about Stylo . Due to parallel processing, the computer performs several tasks simultaneously. The number of concurrent tasks is limited by the number of cores in the processor.
The CPU usually has 2 to 8 cores, and the GPU has at least a few hundred, and often more than 1000 cores.
However, these kernels work a little differently. They can not function completely independently, like the CPU cores. Instead, they usually perform some sort of joint task, launching one instruction on different pieces of data.
This is exactly what we need when filling in the pixels. All the pixels can be distributed to different kernels. Since the GPU works with hundreds of pixels at the same time, it fills the pixels much faster than the CPU ... but only if all the cores are loaded with work.
Because the kernel must work on the same task at the same time, the GPU has a fairly limited set of steps to perform, and their programming interfaces are severely limited. Let's see how it works.
The first step is to tell the GPU what to draw. This means passing them the forms of objects and instructions for filling them.
To do this, you should break the whole picture into simple shapes (usually triangles). These forms are in 3D space, so some of them can obscure the others. Then you take the vertices of all the triangles - and add the coordinates x, y, z to the array.
Then, send the GPU command to draw these forms (draw call).
From this point on, the GPU starts working. All the kernels will perform the same task at the same time. They will do the following:
Determine the angles of all shapes. This is called vertex shading.
Install the lines that connect the vertices. Now you can determine which pixels include the shapes. This is called rasterization.
When we know which pixels belong to each shape, you can walk through each pixel and assign it a color. This is called pixel shading.
The last step is performed in different ways. To issue specific instructions, a special program called the "pixel shader" works with the GPU. Paint shading is one of the few elements of the GPU functionality that you can program.
Some pixel shaders are very simple. For example, if the entire shape is shaded with a single color, then the shader should simply assign this color to each pixel in the shape.
But there are more complex shaders, for example, in the background image. Here it is necessary to find out which parts of the image correspond to which pixel. This can be done in the same way as the artist scales the image, increasing or decreasing it ... place a grid with squares for each pixel on top of the picture. Then take the color samples inside each square - and determine the final color of the pixel. This is called texture mapping, because here an image (called a texture) is superimposed on the pixels.
The GPU will refer to the pixel shader for each pixel. Different kernels work in parallel on different pixels, but they all need the same pixel shader. When you instruct the GPU to draw object shapes, you simultaneously specify which pixel shader to use.
For almost all web pages, different parts of the page require different pixel shaders.
Since the shader works for all the pixels specified in the command to draw, you usually need to break these commands into several groups. They are called packages. To maximize the load of all the kernels, you need to create a small number of packages with a large number of shapes in each.
That's how the GPU distributes the work to hundreds or thousands of cores. All because of the exclusive parallelism in the rendering of each frame. But even with such exceptional parallelism there is still a lot of work left. The statement of problems must be approached with the mind in order to achieve decent performance. Here comes the WebRender business ...
How WebRender works with the GPULet's remember what steps the browser is taking to render the page. There have been two changes.
There is no longer a separation between drawing and linking ... both processes are performed in one step. GPU makes them simultaneously, guided by the received commands from the graphical API.
Pasting now gives us another data structure to render. Previously, it was something called a frame tree (or a visualization tree in Chrome). And now it passes a display list.
A display list is a set of high-level drawing instructions. It indicates that you need to draw without using specific instructions for a specific graphical API.
As soon as you need to draw something new, the main thread passes the display list to RenderBackend - this is the WebRender code that runs on the CPU.
The task of RenderBackend is to take a list of high-level drawing instructions and convert it into commands for GPUs that are combined into packages for faster execution.
Then RenderBackend passes these packets to the linker thread, which passes them on to the GPU.
RenderBackend wants the commands to be rendered on the GPU at the maximum speed. For this, several different techniques are used.
Removing redundant shapes from the list (early culling)The best way to save time is not to work at all.
First, RenderBackend shortens the display list. It determines which elements of the list will actually be displayed on the screen. To do this, it looks at how far the window is in the scrolling list.
If the shape falls within the window, it is included in the display list. And if no part of the figure falls in here, then it is excluded from the list. This process is called early culling.
Minimizing the number of intermediate structures (task tree for rendering)Now our tree contains only the necessary forms. This tree is organized in the positional contexts, of which we spoke earlier.
Effects like CSS filters and positional contexts complicate things a little. For example, you have an element with a transparency of 0.5, and it has a child element. You might think that all the children are also transparent ... but in reality the entire group is transparent.
Because of this, you first need to bring the group to the texture, with full transparency of each square. Then, placing it in the parent object, you can change the transparency of the entire texture.
Positional contexts can be nested inside each other ... and the parent object can belong to a different positional context. That is, it will need to be drawn on yet another intermediate texture, and so on.
The allocation of space for these textures is expensive. We would like to maximally accommodate all objects on the same intermediate structure.
To help the GPU cope with the task, we create a task tree for rendering. It indicates which textures should be created before other textures. Any textures that are independent of the others can be created in the first pass, that is, they can then be combined on one intermediate texture.
So in the above example with semi-transparent squares, we would paint the first corner of the square with the first pass. (In fact, everything is a little more complicated, but the point is this).
The second pass can duplicate this angle for the entire square and paint over it. Then render a group of opaque squares.
Finally, it remains only to change the transparency of the texture and place it in the corresponding place of the final texture, which will be displayed on the screen.
Having constructed a tree of tasks for rendering, we find out the minimum possible number of rendering objects before output to the screen. This is good, because I mentioned that the allocation of space for these textures is expensive.
The task tree also helps to integrate tasks into packages.
Grouping commands for rendering (batch processing)As we said, you need to create a small number of packages with a large number of shapes in each of them.
The careful formation of packets makes rendering much faster. It is necessary to cram as many objects into the package as possible. This requirement is advanced for several reasons.
First, whenever the CPU does not give the GPU a command to draw, the CPU always has many other tasks. He needs to take care of things like setting up the GPU, downloading the shader program and checking for various hardware bugs. All this work accumulates, and while the CPU does it, the GPU can stand idle.
Secondly, there are certain costs for changing the state. Say, between packages, you need to change the state of the shader. On an ordinary GPU, you will have to wait until all the kernels have completed the task from the current shader. This is called draining the pipeline. While the pipeline is not cleaned, the remaining cores will be put into standby mode.
Because of this, it is desirable to fill the package as tightly as possible. For a typical desktop PC, it is advisable to leave less than 100 drawing commands for each frame, and it is good to shove thousands of vertices into each command. This squeezes the maximum of the parallelism.
We look at each pass in the task tree for rendering and on which tasks to group into one package.
At this time, each type of primitives requires a different shader. For example, there is a border shader, a text shader, and an image shader.
We think you can combine many of these shaders, which will allow you to create even larger packages, although they are now well clustered.
The tasks are almost ready for sending to the GPU. But there is still a little work to get rid of.
Decrease the work of painting the pixels with opacity passes and the alpha channel (Z-culling)Most web pages contain many shapes overlapping each other. For example, the text field is on top of a div (with a background) that is on top of the body (with a different background).
When determining the color of a pixel, the GPU could calculate the pixel color in each shape. But only the top layer will be shown. This is called overdraw, a waste of time for the GPU.
So you can first render the top layer. When it comes to rendering a pixel for the next shape, we check to see if the pixel value already exists. If there is, then extra work is not performed.
True, there is a small problem here. If the figure is translucent, then you need to mix the colors of the two shapes. And for everything to look right, rendering should be done from the bottom up.
So we divide the work into two passes. First pass through the opacity. Render from top to bottom all the opaque shapes. We skip rendering of all pixels that are closed by others.
Then proceed to the semi-transparent figures. They are drawn from the bottom up. If the translucent pixel is over opaque, then their colors are mixed. If it is behind the opaque, it is not calculated.
The division into two passes-along opacity and the alpha channel-with further skipping calculations of unnecessary pixels is called Z-culling.
Although this may seem like a simple optimization, here we get a big benefit. On a typical web page, the number of pixels for processing is significantly reduced. Now we are looking for ways to move even more tasks into the pass by opacity.
At the moment we have prepared a shot. We did our best to remove unnecessary work.
... And we are ready to draw!The graphics processor is ready to configure and render the packages.
Disclaimer: we have not yet all gone on the GPUThe CPU continues to perform part of the rendering work. For example, we still render the symbols on the CPU (they are called glyphs) in text blocks. There is an opportunity to do this on the GPU, but it's difficult to achieve pixel-by-pixel matching with the glyphs that the computer renders in other applications. So people can get confused when rendering fonts on the GPU. We experiment with moving the rendering of glyphs to the GPU within the framework of the Pathfinder project.
But now these things are rendered in raster images on the CPU. Then they are loaded into the texture cache on the GPU. This cache is saved from the frame to the frame, because usually there are no changes in it.
Even though such a drawing remains on the CPU, there is still potential for its acceleration. For example, when drawing font characters, we distribute various characters across all cores. This is done using the same technique that Stylo uses to parallelize the computation of styles ... interception of the operation.
The future of WebRenderIn 2018, we plan to implement WebRender in Firefox as part of Quantum Render, through several releases after the initial release of Firefox Quantum. After that, existing web pages will work smoother. And the Firefox browser will be ready for a new generation of high-resolution 4K displays, because rendering performance is extremely important when increasing the number of pixels on the screen.
But WebRender is useful not only for Firefox. It is also necessary in our work on WebVR, where you need to render different frames for each eye at a speed of 90 FPS at a resolution of 4K.
The first version of WebRender is already available in Firefox, if you manually activate the corresponding flag. The integration work continues, so performance is not as high as it will be in the final release. If you want to monitor the development of WebRender, please watch the GitHub repository or twitter Firefox Nightly , where weekly news is published throughout the project Quantum Render.
|Vote for this post
Bring it to the Main Page