Performance, above 60hz all the time

August 29, 2019

Hello,

I have some issues related to performance, more clearly to maintaining a frame rate constantly above 60Hz. We are using Unigine 2.5 (but also had a look at Unigine 2.8 and the code that is causing performance issues seems unchanged). This is a wall of text, but please bear with me to get all the details...

We have our own streaming system (from an external format) which bring in a lot of data. We make sure we do all the updates in background threads (including creating the vertex buffers and textures since we've initialized DX11 context without D3D11_CREATE_DEVICE_SINGLETHREADED flag).

I've inserted my own timing and I cannot detect any stalls inside our own update loop (that is happening inside main thread) but I still got lots of stalls, some from within Unigine some reported as render stalls (orange line in the overlay?) by the engine profiler.

I want to approach the Unigine stalls first and maybe here you can help (with some tips, what are the actual cause of stalls etc). So, I've tested several cases:

1. Only rendering of data already loaded, no camera movement or looking around:

Most of the stalls come from several RenderRenderer::setShaderParameters calls and they happen (take a lot of time) for some of the objects. Not for the same objects all the time, but in general the bigger the objects (more surfaces?) more probable to create stalls. These add to a stall that happen like from 1 sec to another second. Almost unnoticeable to the eye, but my requirement is “absolute above 60hz, all the time”.

More details: most CPU processing happen at first “if (is_changed)” block when material parameters are found and set, most notably this takes a lot of time (above 20-30 ms) sometimes:

       VectorStack<int, 128> numbers;
       material->fillSharedParameters(numbers);
       for (int i = 0; i < numbers.size(); i++)
       {
           int parameter = numbers;
           if (parameter_material->checkParameterPass(parameter, pass) == 0)
               continue;
           …
           more code follows here causing stalls
           ...
       }

Also very frequently they do happen after this:

       if (use_old_transform)
           setShaderTransformParameter(surface->getWorldTransform(), surface->getOldWorldTransform());
       else
           setShaderTransformParameter(surface->getWorldTransform());

Rarer they happen after:

   if (state->getShader() != shader || material == NULL)
   {
       is_changed = true;
       state->setShader(shader);

       if (shader)
       {
           render->addShaders(1);
           setShaderParameters(pass, is_screen_space);

           if (pass == RENDER_PASS_POST ||
               pass == RENDER_PASS_OBJECT_POST ||
               (pass >= RENDER_PASS_LIGHT_ENVIRONMENT_PROBE && pass <= RENDER_PASS_LIGHT_WORLD))
           {
               int id = shader->findParameter("s_constant_data", SHADER_MATERIAL_CONSTANT_ARRAY);
               if (id != -1)
                   shader->setParameterFloatArray(id, stored_constant_parameters, 4, RENDER_NUM_DEFERRED_CONSTANTS / 4);
           }

old_material = NULL;
}

And even rarer, outside this function, and inside engine.world->update() call from within Engine::do_update (visibility compute? New nodes inserted?).

So I wonder what is causing this? It may be that getting/setting parameters may cause memory allocations and these, from time to time, bring the execution to a critical path for the memory manager? From what I can tell, these should not happen, especially if you go into fast path, when the parameter id is known.

Also, another case:

2) Beside the above ones, when I start to look around I also have stalls (i.e. large processing, over 20ms or so) inside RenderState::setMaterial when the material textures are bound. Probably a new material coming into view? What are you suggestions to avoid this? I am already trying to collapse as much of our geometry per common textures. Can other things be done here?

They also happen rarer at lights->setShaderLightParameters calls inside RenderRenderer::setShaderParameters and World::update().

Moving around but without loading any new data (but hiding/showing nodes), will not bring new stalls than looking around only.

Another case:

3) When reading new data. Now the big stalls are happening. Basically all the above are happening more often and are usually more severe. The setShaderParameters stalls are happening for more objects per one frame, sometimes bringing one frame stall to add over 100ms. Stalls are now noticeable even to the naked eye. Again all these stalls are not inside my processing, are inside Unigine and seems more likely when encountering a node for the first time or when many different materials are rendered?

My rendering calls number is somewhere around 3000-4500 calls per frame. The frame is usually above 60hz (vsync on) but I do have these severe stalls, especially when new data is being inserted. And again, yes I am creating the vertex buffers and textures in background threads, there is no GPU data creation involved during rendering (when object bind is happening). I am using very fast Nvidia cards and plenty of RAM.

Many thanks,
Adrian
Vstep

PS: don't know if this is the best place to ask this, maybe to be moved to rendering thread?

Edited August 29, 2019 by adrian.licuriceanu

August 30, 2019

Hello Adrian,

Thank you for detailed explanation of the issue.

Based on your description I can suggest that the main problem is in the shader cache and hash generation. We improved this process in the never SDK versions.

Every time your camera encounters a new material (never rendered before) shader cache is being generated. As a rule, this action causes visible render hiccup. In 2.7 and later version we've added semi-automatic shader cache generation and this helped a lot. However, in 2.5 shader cache should be precompiled manually.

To solve this in Superposition benchmark we used "warm up" technique. The idea is to put all static object in front of a player to force shader cache creation, then show in the same way all dynamic objects (I mean the ones that moves) and "shake" them, because a different shader was used for them.

You can reuse our code from Superposition (the demo is available in the SDK Browser).

We can help you to find out the root of the problem if you provide a sample that we can compile and test. This will require some time to research — 2.5 was quite a while ago and our developers are totally dived into current SDK branch.

Thank you.

September 2, 2019

Ok, I will look at that benchmark (but to be noted that I don't have the luxury of warming up objects as I stream them live all the time).

Beside this, how do you explain that I have most of the stalls even if the objects were previously rendered (actually I have these stall from time to time even if I don't stream and move/rotate camera at all)?

Providing a sample is a very hard thing to do, we are dealing with huge amount of data (TB) that are hard to separate and also have confidentiality terms assigned to it.

Kind Regards,

Adrian

September 2, 2019

Unfortunately, it's not enough info to fully diagnose these stalls.

I suggested shaders compilation because it's a well-studied behavior. You also mentioned custom data streaming solution. Could you be more specific, is it modification on an engine level?

1 hour ago, adrian.licuriceanu said:

Providing a sample is a very hard thing to do, we are dealing with huge amount of data (TB) that are hard to separate and also have confidentiality terms assigned to it.

It looked like you've already isolated the code. Can you build a simple scene with loading nodes that causes stalling?

Thanks.

September 23, 2019

Hi,

If I am to provide a demo, can I do it in binary mode? I mean I can provide a plugin dll that you can load for example in Unigine 2.9, is this enough? You can debug then and see what calls I am doing to Unigine and the stalls are basically inside the engine. I am asking this since I cannot provide source code for the entire plugin and even if do so, you will need additional SDKs that are not ours.

Kind Regards,

Adrian L.

September 23, 2019

Hi Adrian,

Yes, we'd like to take a look at the demo. Please, provide instructions to reproduce the issue.

Thanks.

October 14, 2019

Hello Adrian,

I've taken a look at the sample and I think the spikes are pretty much random. We encountered such a pattern when we was experimenting with asynchronous GPU resource streaming. When several threads are uploading GPU resources that may cause spikes on DX calls in the main thread. So we are trying to use one specific background thread for resource uploading and everything goes more or less smooth.
And as I understand your plugin does resource uploading in another one thread (or even several of them). I propose to use Microprofile to investigate the issue. You could use Profiler::begin/endMicro to see resource uploading on the plugin's side in Microprofile. And mark suspicious places in the engine with macros from EngineMicroprofile.h. Once it's done I believe we'll get a more clear picture of what is going on.

October 14, 2019

Hi Andrey and tank you for taking to look at the demo.

Indeed I noticed that in 2.9 you have a special thread to upload DirectX data. Is this the reason that it is a single thread and not multiple ones doing the update?

What I do, in my code is to call flushMesh in background threads (multiple of them). Do I need to use a special, single thread for all these flushes (and the texture uploads)? Do I need to pace the uploads (e.g. do a certain amount of flushes per frame)?

Also, this doesn't explain directly, why stalls are inside RenderRenderer::setShaderParameters. Is DirectX maybe deciding to do some flushing of internal pending operations at that point? But how about the stalls there when there is nothing streaming (all is memory) and I just render (albeit these are very rare in the demo I sent you, they are very often in our final data)? Is it because too many inherited materials and parameters are used and setShaderParameters adds up from all of them? Should I cut surfaces count?

Next, I will also try to use the profiler but I suspect it will just show the stalls I've manually detected with my own timings inserted in the code.

Kind Regards,

Adrian L.

October 14, 2019

Quote

Do I need to use a special, single thread for all these flushes (and the texture uploads)? Do I need to pace the uploads (e.g. do a certain amount of flushes per frame)?

I believe a single thread would be better in any case due less synchronization on the driver side. Limiting uploads may decrease spikes too (or may not).

Quote

But how about the stalls there when there is nothing streaming (all is memory) and I just render (albeit these are very rare in the demo I sent you, they are very often in our final data)?

If you send a sample for this case we will be able to investigate it too.

Quote

I suspect it will just show the stalls I've manually detected with my own timings inserted in the code.

The point is to see the stalls and resource uploading on the overall graphic and check if there is a correlation of spikes with background threads activity.

October 18, 2019

Hi,

I am currently applying microprofiler and gathering all the points where I can see massive stalls. There are multiple ones of them, so let's take them one by one...

One of them was MeshStatic::create_nodes. It turns out that some of our big meshes (e.g. terrain chunks) where being picked (from scripting). I don't really need to pick them, so my solution was to call setIntersectioMask(0, for all surfaces). This seems to avoid creating the spatial tree (which btw gets extremely big for big meshes and the leaves size seems to be hardcoded in the engine). Also this tree, even if I create it in a background thread, at some point seems to be invalidated, so I still get it to be called later on the main thread.

So my question is: can I safely assume (seems that way from the source) that the intersection masks are only used for picking / collision? So if my streamed meshes are just for visuals, can I safely call setIntersectionMask with zero and avoid calls to create_nodes?

Regards,

Adrian

October 21, 2019

Hello Adrian,

Yes, intersection masks are using for intersection exclusively (see Intersection mask)
Collisions are handled by a separate mask. You may want to disable collisions too as they can lead to the same spikes.

October 22, 2019

Hello Andrey and thank you for the response.

I am currently in the middle of microprofiling as you've suggested and results indicate multiple causes for the stalls. But one of the things I notice very frequently on the GPU profiling is render_importance_smapling_mipmaps:

Any idea why this may take so long from time to time? What is the exact purpose of this? Seems to increase sometimes when new objects are inserted in the scene. Is there something I can test further (like inserting further microprofile scopes inside it)? Or do you think this can be caused by driver spikes (video driver being busy with other update work I am providing on background threads)?

Kind Regards,

Adrian_L

October 22, 2019

Hello adrian,

It's GPU side spike you are looking at. I suggest leave it for a moment and look at CPU side spike in the main thread which I believe causes this GPU side stall. That CPU stall may be caused by background threads activity. It will be easier to say if it's true if we can see overall the picture with those background threads

Performance, above 60hz all the time

Recommended Posts

adrian.licuriceanu

Link to comment

morbid

Link to comment

adrian.licuriceanu

Link to comment

morbid

Link to comment

adrian.licuriceanu

Link to comment

morbid

Link to comment

andrey-kozlov

Link to comment

adrian.licuriceanu

Link to comment

andrey-kozlov

Link to comment

adrian.licuriceanu

Link to comment

andrey-kozlov

Link to comment

adrian.licuriceanu

Link to comment

andrey-kozlov

Link to comment