So many problems with wrangling data, particularly processing data for games, are reducible to a simple clustering method that eventually you start to see them everywhere once you know how to view the problem. Since it also seems to be a relatively obscure skill in the games industry, I’d like to share an overview with you today, as well as an example of how I applied this technique recently to improve performance and reduce memory for a AAA title just last week.
First of all, let’s understand the scope of the problem. Data clustering is an NP-hard problem. That means if there is a single best solution, it will require brute force to find it, and it will take non-polynomial time to do so–in other words, it would be impossible to find the ideal solution for any non-trivial number of elements in your data set. That means we will rely on heuristics to guide the algorithm toward a good solution, and accept that a better solution might exist and might not be found. As long as that better solution is only a fraction of a percent better, that’s fine. Squeezing that last fraction of a percent out of the data would take too much CPU time anyway.
Second, let’s define some situations where clustering is used in today’s games. Smooth character skinning is the most obvious place. Most shaders (and even the Wii’s fixed function pipeline) take in a limited number of matrices and have a set of matrices and weights per vertex that reference these matrices. For example, a 3-bone weighted vertex is computed thus, where Mn are the animated bone matrices, Wn are the skinning weights or influences per vertex, Vb is the bind-pose vertex position and Vf is the final, rendered vertex position drawn to the screen:
Vf = (M0*w0 + M1*w1 + M2*w2)*Vb
Skinning is one of the cases where hardware limitations create a need for clustering. We usually can’t render a whole character with only 10 bones, so we need to determine which bones should be sent into the GPU with each batch. Now, we could just pick the first 10 bones, then select all the triangles that can be drawn with them, then pick the next 10 bones for the next batch, etc. The problem is, the GPU renders triangles not vertices, so there may be some triangles that need bones from batch 1 and batch 2, but don’t fit in either. Thinking about this a bit, it’s important to realize that the needs of the hardware shape the problems we solve, and recognizing the proper constraints on a problem is key to solving it effectively.
So, let’s state the problem clearly: We need to break up a mesh into groups of triangles that share the same exact set of matrices. Then we need to combine these groups until we are left with a minimum number of groups, with at most 10 matrices. Certain hardware will have additional requirements, such as making each group have roughly the same number of vertices or triangles (to balance the performance of the processors handling them, be they SPU, CPU, or GPU), or a maximum number of bytes for each group (to fit within some DMA packet size), etc.
Taking a page from the notebooks of the research community involved with triangle stripping, I tend to design NP-hard algorithms with a lot of the same ideas that have proven valuable for wrangling the stripping problem. One specific idea is to use a greedy approach to finding content to merge. By greedy, I mean always try to make the best choice and continue working on the current group until it’s full. There are some algorithms that try to backtrack and second-guess, but it invariable complicates the algorithm and unless the greedy version is doing a bad job in the first place, there won’t be a huge amount of gain.
The problem is, where do you start? There are potentially hundreds of starting points, how do you pick which one is the right one? With greedy algorithms, this is a critical decision, because it completely shapes the output. Think of it like painting the floor with a roller brush… you are not allowed to cross paint stripes, and you want to minimize the number of times you lift the brush. That’s triangle stripping in a nutshell. Of course, you want to paint in a regular pattern, in strips along a rectangle. But a computer can’t ‘see’ the way a human does, so we have to provide intuition as rules in code.
So, the rule for greedy algorithms like this is start with the least desirable, or loneliest element. That basically means, the group that is least likely to be chosen by the greedy algorithm. Why? Because if you choose anything else, in the end, there will be a lot of disconnected, ‘lonely’ groups that have nothing in common with each other. This is intuitive, because the loneliest part of a floor you’re painting is a corner, which is where you’d naturally begin painting from. It’s also important to note that ‘loneliest’ changes after each iteration. Continuing the paint example, after you paint a strip along the wall, there are still 4 corners in the unpainted area of the room.
Now that we’ve discussed the approach, here’s the algorithm in plain terms:
That’s it! Of course, the devil is in the details, and there are literally hundreds of variations that could be implemented in step 3 as selection criteria varies depending on the hardware, game, and art assets involved. For instance, you might want to minimize bounding spheres, or size in bytes, or prefer binning triangles by surface area per triangle, or weight them by the direction normals point (for culling/lighting purposes), etc. But once you have this technology, there are plenty of things you can do to preprocess your data.
JH
You must be logged in to post a comment.