DirectX batch sizes and Impostors

Hi,
I am working on an implementation of theHierarchical Image Cache by Shade et.al. In short, it uses a spatial hierarchy of dynamically updated impostors to avoid having to re-render the base objects for as long as possible: if possible, an impostor is updated by rendering its direct children (which are also impostors, but of a smaller subset of the scene) instead of the full geometry.
I have already written a working implementation in OpenGL (which I must admit I am more proficient with that DirectX), and for a variety of reasons I am now trying to port this code to DirectX.
However, I have a few performance issues with this algorithm. The main point (I think) is that it is quite CPU centric, and uses the GPU in a rather inefficient way. During traversal of the hierarchy, most rendering calls would have extremely small batch sizes - just the few quads required to render the impostors. Texture switches can be relatively easily minimized by tiling large textures and state sorting.
So:
What is the most efficient way to manage and render to a large number of textures? The total number of textures in a large scene is >100k, but only a very small number are in active use at any given time, and typically less than ~50 need to be updated per frame.
I assume that the basic considerations of OpenGL (minimize texture and other state changes as much as possible) are universally true, but should I render directly to the textures that I need to update, or is it faster to batch updates by rendering to a large surface first (each impostor update to a separate tile) and then copying them back to the textures, thus eliminating the context switches at the cost of extra copying? In OGL the first option was not really viable due to the large number of rendering contexts this would have required (although that might have changed now with the availability of FBO's).
Cheers,
-Stephan
[2136 byte] By [step] at [2008-1-12]
# 1
I wrote up some stuff about impostors in Games Programming Gems 2, and various other places (the DirectXDev list for example).
What I've always used for render-to-texture stuff is a few large rendertarget textures, and then a quadtree-based allocator. At the start of the frame, the allocator works out which of the large textures are least used, and then as you go along saying "I need a 32x32 texture", it allocates those as sub-regions of the large textures (using the quadtree idea for allocation). Using a quadtree is fast and can produce very little fragmentation. The allocator only switches to a different large texture if it runs out of space.
So as you go along rendering to your multiple "virtual" textures, you're actually just rendering to the same texture and selecting different scissor/viewport rectangles. When rendering from those targets, you can easily adjust texture coordinates to only select the region you need.
But this doesn't work if you need to use a rendertarget to create a rendertarget (i.e. the "hierarchical" part of your algorithm) - you can't reliably bind a single texture as both a source and a target at the same time. So you may need to modify the algorithm there.
In practical terms, I found that a single depth of impostors was more than sufficient. How many objects are there in a single scene - 10,000? 10k quads is a fairly trivial polygon load for today's graphics cards - there's no point optimising beloy that. More layers in the hierarchy just means more complex updating and smaller batches.
TomForsyth at 2007-9-9 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: Graphics...
# 2
Tiling textures is of course an efficient optimization; I've been playing with quadtree based allocations first but changed to a fixed allocation scheme (textures are assigned fixed chunk sizes - say one texture each for 8x8,16x16, 32x32, ...) which ensures zero fragmentation even with frequent element removals and re-allocations (which is integral to my algorithm). So that is already being done.
The problem is that what we're trying to do produces excessive polygon counts. For example, I've got a landscape with 1M trees at sufficient detail to be viewed quite closely - about 20k polygons each (there are only a handful of tree models which are heavily instanced for the entire scene). So updating a single impostor is fairly costly, and the hierarchical scheme does pay off.
I should probably add that I'm not working on a game project, but a PhD on rendering very large forests...
Anyway, thank you for your input. Something I've been thinking about but haven't really managed to try yet is to 'flatten' the hierarchy a bit by skipping levels of the quadtree - ie. have 256 (or so) children in each node instead of 4. I've got a lazy update scheme that does similar things but without being able to batch the calls (because of the hierarchy traversal taking place), but that might be an interesting option.
Best regards,
-stephan
step at 2007-9-9 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: Graphics...
# 3
I found that fragmentation was not much of a problem (which surprised me somewhat, in a good way). The interesting thing to note is that with impostors, there is a pretty hard limit to the amount of texture memory you need. It's just screen rez * overdraw * inefficiency. On today's cards, that sort of memory is easy to come by, and your inefficiency due to fragmentation can be quite high, and you still have "enough" of memory.
I was rendering scenes with animated objects and a fairly mobile camera though. Both those put a fairly low limit on how long an impostor is valid for, so it was rare that any one impostor stayed around for more than twenty frames. This puts a nice limit on how bad fragmentation can get. (the game was StarTopia by the way - http://www.eelpi.gotdns.org/startopia/startopia.html - you can get it from Amazon for about $3 these days).
I used a quadtree scheme not so much for the fragmentation, but mainly to avoid having to change rendertarget or texture too often. Doing either really hurts batching efficiency - the pipeline needs to fully or partially flush when you change them, and that's likely to be a big part of your cost. You might try simple tests like rendering without ever changing texture and/or rendertarget, and see what a speed difference that makes - the results may be interesting!
This is why I picked the most-empty texture at the start of the frame - it avoids changing rendertarget as much as possible. I was aiming at GeForce1s and Radeon1s, and although cards today are much better about changing rendertarget fast, they're still not stellar. As a side effect, because of the short lifetimes of most of my impostors, it also slightly reduced the number of texture changes, but they were less important, and you don't really have much choice about the order there - you have to render impostors back to front!
4 children per item sounds like a nice recipe for being totally limited by batch overhead. Flattening your hierarchy will increase overdraw for forest scenes though, so there's an obvious limit to how much you want to flatten them, even if the poly count doesn't rise that high.
Interesting problem! I'd be fascinated to hear the results.
TomForsyth at 2007-9-9 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: Graphics...