Rendering/processing related stalls


photo

Recommended Posts

Hello,

We have a project that is doing some custom background streaming (creation of static meshes, loading of images and then feeding them in Unigine). Most of the processing is done in background threads, except few things that are not thread safe (like setting nodes parent/child relations, enabling nodes etc). The system works and it is pretty stable, but from time to time we do have some stalls (frame dropping under required frequency which is 60hz, 60 frames per second).

Here are some snapshots from microprofiler on Unigine 2.13.0.1 (DirectX11):

How can I avoid this Environment generate mipmaps, it appears quite often and creates spikes in the profiler:

2dd95a9b-c517-4f2a-9baf-49af6d5af9c4.png.139197e0c96c05e94a4aa98cca60d4c0.png

I usually have all materials/shaders ready in first few frames, but I still got these update_material_conditions as I inherit from them in C++ code so I can setup custom things (like textures per instances). The following one is quite srange since it calls this so many times in the same frame:

46195503-3148-4b9d-8a16-c6f4644ffd58.thumb.png.c7700990017856aa0cf11af381c56f04.png

db208d3c-1f23-49f6-be92-9f078fc713c3.png.dcc8a867df431697a353fa906f01e8e8.png

Also I have those (more rarely though) - probably related to inserting new nodes into the scene?

fe79d1b3-58e9-463f-97e5-cc8ddda2c65f.png.cad4a218a43a416373b2d9be5b391682.png

I also have no idea why this appears (I am not using landscape or grass):

06888f40-7e03-42a0-b223-af2594012c3e.thumb.png.8667627c536fe3ee723c65e8b3e8b192.png

 

Maybe you can indicate some possible causes for these. I do take care to update video data (textures, vertex buffers) in a certain pace so I don't have video driver spikes. 

One thing that I do at free will, without any pacing, is, in the main thread, I call setParent and enable for any nodes that become available (fully created) for that frame. Can this be a source of stalls inside engine? Should I pace the creation of nodes. Like: enable one, check time passed, enable another only if I still have some time left for the current frame etc. Or, if the engine associated actions are done lazily (in other future fames) should I have a certain speed of nodes processing (number of nodes enabled / second)?

Before you ask, it is pretty hard to provide an testing sample, as we depend on very huge IP protected data.

Kind Regards,

Adrian L.

Link to post

I would also like to know how to avoid this when a shader is first encoutered:
image.png.cb0e180ff669784c666317f7b106c35c.png

Should I just warmup the scene (render something offscreen with these shaders)?

Link to post

Hi Adrian,

We've fixed couple of spikes with Environment generate mipmaps in 2.14.x releases, so you should see them less often. They were appearing because of the incorrect Vsync handling in DX11 API.

Regarding the other strange spikes without seeing the actual scene and debugging on our side it's pretty much impossible to say what is going on. GPU driver can be very unstable in terms of multi-threaded usage (as I can see you are heavily using the background resources loading as well). Sometimes changing the GPU will result in totally different behavior in the driver and whole new spectrum of micro-profile messages. Maybe signing NDA between our companies will allow you to send us your project for debugging?

To eliminate various shader spikes when new material appears you need to do a manual warming-up procedure. Basically, all you need to do is to spawn nodes with your custom materials in front of the camera and render them (while you doing this you can render some splash screen). You can check the warming-up procedure in Superposition demo as a reference. You can do it once (before shipping the final application) or do it simply before each start (first start would be slower, but future starts will be faster because all the cache will be already generated).

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to post

Hi and thank you for the support!

Glad to hear that Environment generate mipmaps is fixed. Since we are stuck with 2.13.0.1 for now, is this an easy fix? Maybe you can provide a source hotpatch.

I will also see if we can arrange some testing application.

Some quick followup questions:

- Related to update grass stall: any idea why this may happen? As I am not using landscape or grass at all. Maybe some GPU bottleneck (cause by sort of flush)?

- Related to Unigine::cpu shader wait: I noticed this is caused by the scene intersection code and always happen after I stream new nodes in. Does it depend on the count of new nodes enabled? Can it help that I pace setParent / enable calls to a certain cap (max of new nodes per frame)?

Regards,

Adrian

Link to post

You can try to rebuild the engine on your side with modified source/engine/framework/direct3d11/D3D11RenderContext.cpp:

// instead of 
swap_chain->Present(parent_app->getVSync() ? 1 : 0, 0);

// use:
{
	if (parent_app->getVSync())
	{
		context->Flush();
		swap_chain->Present(1, 0);
	} else
	{
		swap_chain->Present(0, 0);
	}
}

That will help if you are not using AppWall or AppSeparate plugins (with default main application). If this helps we can send you 2.13 patched version from our build system (but it will be exactly the same) :)

 

Quote

- Related to update grass stall: any idea why this may happen? As I am not using landscape or grass at all. Maybe some GPU bottleneck (cause by sort of flush)?

It shouldn't happens if there is no grass object in the scene. Need to check this on a real scene with a debugger attached.

 

Quote

- Related to Unigine::cpu shader wait: I noticed this is caused by the scene intersection code and always happen after I stream new nodes in. Does it depend on the count of new nodes enabled? Can it help that I pace setParent / enable calls to a certain cap (max of new nodes per frame)?

You can try to experiment and reduce number of added nodes per frame, but more accurate information we only can give you after we can reproduce the same behavior on our side.

Thanks!

  • Like 1

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to post

Hi, ok will try the present fix and let you know. I will also try the node pacing.

Regards,
Adrian

Link to post

About present fix, I didn't noticed much difference, plus the Environment generate mipmaps actually appeared when I didn't had vsync activated!

On another note, as I am working on pacing the node insertions, I have a question: does the order of enabling nodes (parent vs children) matters, performance wise? To give an example:

Is this:

1) create parent
parent.enable(true)
gradually insert children into parent and enable them

faster than:

2) create parent
gradually insert children into parent and enable them
parent.enable(true)

I am thinking 1) is faster since we don't enable the entire branch already filled with children, so all at once (like we do in 2). Instead in 1) I insert and enable the children gradually in an already enabled parent. Does it make a difference?

Kind Regards,
Adrian L.

Link to post
Quote

I didn't noticed much difference, plus the Environment generate mipmaps actually appeared when I didn't had vsync activated

And if you will enable vsync will this help? Please, keep in mind that fix will work only for default main app (with no AppWall / AppSeparate plugins enabled).

 

Quote

I have a question: does the order of enabling nodes (parent vs children) matters, performance wise?

The first option in theory should be slightly faster since setEnabled() also updates all the children nodes when called. On practice, however, some benchmarks are needed :)

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to post
  • 2 weeks later...

Hello again,

 

Related to that Present patch, didn't noticed any significant improvements no matter if vsync is on or off. I still have that Environment mipmap appearing (maybe a little less).

 

About the nodes insertions, it turns out this is pretty fast itself. But my theory is that it may still create stalls later when first visibility or first rendering is done, but it is very hard to link one to another (node insertion versus what engine is doing later with the new nodes). What I can tell is that, overall, the spikes are bigger after inserting nodes compared with not changing scene graph at all. Also inserting one node per frame (and even waiting few milliseconds to pass before inserting another in another frame) doesn't help much.

 

That being said, we also have stalls when deleting objects. Some even take 50-60ms. Deletion of ObjectMeshStatic is especially problematic. Probably it doesn't help that we have objects with many (hundreds or thousands) surfaces (and some surfaces maybe have a lot of geometry). But they are this way to try to avoid other stalls related to too many objects when rendering/visibility checks are done (we try to merge geometry as much as possible).

So what are my options when deleting those? Since I need to do this in the main thread, I don't see any options but to somehow break the individual deletion (I do delete one, check time passed, bail out of current frame deletion if takes too much time, so I do take care to not occupy main thread too much, which, btw, the Unigine doesn't do itself with deleteLater pending list, it just deletes all that it finds in that frame). Maybe:

- take the mesh out of the object (take the pointer, set a null pointer) and then delete the mesh in a background thread? Is this safe?

- delete individual surfaces in the main thread, one by one, per each frame. But I don't think it is currently possible to delete individual surfaces via API.

How are you dealing with object deletion in your background threading demos / applications?

Kind Regards,
Adrian

Link to post

Hi Adrian,

Quote

Related to that Present patch, didn't noticed any significant improvements no matter if vsync is on or off. I still have that Environment mipmap appearing (maybe a little less).

That's interesting. However, we still doesn't know what exactly triggers that behavior in your exact case.

 

Quote

How are you dealing with object deletion in your background threading demos / applications?

We are trying not to create / delete objects in runtime. If you need to delete object you can simply disable it (setEnabled(0)) and move out of the frustum, so it would not take part in the current spatial update process.

If you need to create / delete objects too often maybe you need to create some objects pools and re-use objects from them, instead of constructing / deleting on fly. That would be much faster.

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to post

Hi,

About objects deletion, we do need to delete those as player moves and we have quite a huge data to deal with. I can try object pooling, but I am afraid that unless Unigine offers some specialized pool, I would still need to delete the old content (is this even possible via Unigine API) to fill the new one. And I don't think removing the actual memory allocations is the issue, but removing video data, removing all the other associated objects kept alive by smart pointers, removing nodes from Unigine scene graph etc. 

At least, does it help to disable objects first before trying to delete them (disable static mesh object, wait a frame, delete it)?

I'm trying to find some proven paths to approach this, before doing all sorts of memory pools that may not even help in the end.

Kind Regards,
Adrian L.

Link to post

Hi Adrian,

At least, does it help to disable objects first before trying to delete them (disable static mesh object, wait a frame, delete it)?
No, it doesn't matter.

Take the mesh out of the object (take the pointer, set a null pointer) and then delete the mesh in a background thread? Is this safe?
No, you must not delete nodes in background threads. It will cause undefined behavior and crashes.

Delete individual surfaces in the main thread, one by one, per each frame. But I don't think it is currently possible to delete individual surfaces via API.
Yes, you can't do this via API.

Deletion of ObjectMeshStatic is especially problematic. Probably it doesn't help that we have objects with many (hundreds or thousands) surfaces (and some surfaces maybe have a lot of geometry). But they are this way to try to avoid other stalls related to too many objects when rendering/visibility checks are done (we try to merge geometry as much as possible).
Thousands surfaces... This is sad. At the moment, the engine can't remove meshes separately, in parts in background threads. We think about it.

I don't know what to advise you. All you can do now:
1) Call DeleteLater() only on a part of the nodes that you want to delete. For example, only 10-100 times per frame.
2) Make nodes simpler. Make fewer surfaces in ObjectMeshStatic.

Best regards,
Alexander

Link to post

Hi Alexander and thank you for the advices, I will try to see if I can delete only few nodes per frame (call deleteLater only for few), keep the others in a pending list and basically pace the deletion to spread on multiple frames.

Now that I think about it, we have a rendering call per surface anyway, so having fewer objects with so many surfaces doesn't help with rendering calls right? I mean, it may help with other things like frustum culling and inserting nodes in the scenegraph (overall loading time), but doesn't help with CPU consumption from DirectX rendering calls. So maybe I should put a limit of surfaces per static mesh, and just create multiple static meshes when needed? It may help in some cases to delete those thinner static meshes faster, but I believe we still have cases with only few fat objects creating stalls due to a lot of geometry inside (we also try to have as few surfaces per object by collapsing any geometry that we can).

 

On another note, related to microprofiler reported stall for "update grass" (even if I don't have any grass in scene), I inserted my own timings into the code and braked the execution when this happened and it seems related to waiting for a mutex to unlock, more specifically the first call from this:

        mutex.lock();
        for (auto &it : iterators)
        {
            if (frame == (counter++ % render_streaming_destroy_duration))
                resources_update.append(it.key);
        }
        mutex.unlock();

Inside update for the grass RenderManagerResources. This block is reached from the main thread while the mutex is sent as parameter from RenderManager. So I believe the mutex is somehow locked for too much time by some of the background threads (maybe during deletion / creation of other resources?).

Kind Regards,
Adrian L.

Link to post

Yes, after some tests, it seems that deletion duration of static meshe have a direct relations to the number of surfaces (and have no correlation with total number of triangles):

duration=10.90(ms): surfaces no=42, trigsNo=1564
duration=17.05(ms): surfaces no=85, trigsNo=63564
duration=24.35(ms): surfaces no=99, trigsNo=52170
duration=111.47(ms): surfaces no=507, trigsNo=57821

So, if there are no other possible slowdowns I will like to impose a limit of around 20-30 surfaces per object and create many objects rather than few with many surfaces. Would you think this will create other slowdowns in other parts of the system (like in frustum culling, rendering calls etc)?

Link to post

Hello Adrian,
 

Quote

So I believe the mutex is somehow locked for too much time by some of the background threads 

At first glance this mutex shouldn't be locked from backgrounds threads if you don't create objects from them. You can try to localize which one is keeping the mutex to understand how to deal with it. You may use ReentrantMutex to debug this, it saves owner thread id.

Quote

Would you think this will create other slowdowns in other parts of the system (like in frustum culling, rendering calls etc)?

It's hard to say without benchmarking. On one side more nodes create more load on the spatial tree. On the other side surfaces list is just a linear list which leads to linear complexity in some cases. And 500 may be just to much. So subdiving may even lead to some improvements. Drawcalls number should be the same though.

Link to post