Jump to content

rendering performance issue using multiple applications


photo

Recommended Posts

We are currently in the final stages of a simulator project and are optimizing to get the system performance to where we want it. We have already made significant progress but have hit the figurative wall where rendering times are concerned.

The target hardware is a multi-headed linux server using a 7900X 10 core cpu, 32 Gb DDR4 RAM and four nvidia 1080Ti graphics cards. X is currently setup with independent Xserver/screen layouts, one for each graphics card. Each card is configured to output two signals of 1920x1080, with each screen being 1920x2160.

The Heaven 4.0 benchmark on most extreme settings, including AppWall 1x2, runs at ~41 frames/sec when running four simultaneous instances on this hardware. Running less instances results in higher framerates: 45 for three, 49 for two and 53 for a single instance. This is not yet troublesome but noteworthy since each instance has its own dedicated gpu and there should be enough cpu cores to not have the instances compete for cpu time. 

Our application setup consists of a master application instance and three slave application instances, each using one of the four Xserver/graphics cards combo's. Communication between the instances is done through shared memory. The main thread of each instance is responsible for rendering while the simulation itself and other cpu heavy work is performed in other threads in the master instance. So frame rate is mostly directly dependent on Unigine rendering speed. We use an optimized AppProjection plugin so for two displays only a single render pass is done.

Our application when run as a single instance runs at between 45-75 frames/sec currently in general. On a specific test location it runs at 60 frames/sec. (all ms numbers below are on this location)

Now for the conundrum: each time an instance is added the frame rate drops considerably. Adding one it drops below 50, two 30 and at three it's crawling along at 15 frames/sec.

Using the microprofiler we confirmed the rendering (Update + Render) to be 95% of the time spent. With four instances about 60ms is spent in Update + Render on the master instance. With three it's 45ms, two 30ms and just the one 15ms. It actually looks like something is scaling perfectly, aside from the fact there are four independent gpu's.. The slave instances have shorter times (but still too long compared to the single rate of 60Hz) and spend the remainder waiting on the master. Note that the drops are not caused by the shared memory syncing, four independent instances without any syncing exhibit the same performance degradation.

Note as well:

- a development laptop (msi gs63vr) with a 4 core cpu and a 1060 mobile is able to run all four instances at about 18 frames/sec. So a couple frames faster!

- a development pc with a 6 core cpu and a 1070 is able to run all four instances at roughly 30 frames/sec. So considerably faster.

- disabling AppProjection (eliminating warping/blending shader and halving the total resolution from 1920x2160 to 1920x1080) has absolutely no effect on framerate. Frame time (Update + Render) remains at 60ms.(!)

- the gpu's are idling. In fact the utilisation is so low that the fans don't even spin up because of the low core temperatures of around 55 degrees.(!) Minimal Present times confirm the gpu's aren't the bottleneck. And running the Heaven benchmark has temperatures at over 85 degrees constantly.

- from a visual estimation (using xosview) the cpu utilisation is between 25-50% at max with not even a single core being maxed constantly.

- if we enable a mirror (complete extra render pass) in a slave instance, this seems to have the same effect on overall frame rate as adding a slave instance. So one slave with mirror equals two slaves with no mirrors.

- the master instance is always started first, so it's unigine is 'first'.

- application is started with vsync disabled (__GL_SYNC_TO_VBLANK=0)

- changing Xserver configuration to single head with four screens makes no difference. updating to latest nvidia driver, no difference.

 

We're now officially stumped. All our heavy lifting is out of the render time-critical path so we cannot optimize this away in our code. And even though it seems unlikely, both from the perfect scaling (15->30->45->60) and the fact that halving the gpu (pixel) workload has no effect whatsoever it leads us to the thought that possibly Unigine is spending time doing nothing somewhere? A busy wait? Unigine instances interfering with each other somehow causing side-effects?? As I said, we're pretty stumped.

We have unigine source so we can add profiling to narrow things down but it would be very helpful to know where the most likely candidates/problem spots are.

Early december this simulator will be on display at a trade show and this is putting a serious dent in our progress.. Any insights or advice, both as to the cause or how to proceed finding the cause would be greatly appreciated!

 

 

 

Link to comment

Hi Esger,

In this case there might be a bottleneck in PCI Express bus bandwidth. Deferred render is constantly transferring a lot of GBuffers textures (you can check it's size via render_info console command). Such effects like TAA also can significantly increase bus load. Because of that there is no way to get any performance in SLI (even 2xSLI) - bus load is too high. Multi-GPU is always create additional overhead, so the better results with single GPU is understandable.

Heaven is built on top of the outdated Unigine 1 (mostly forward), so the Gbuffer here is much much smaller in size.

Switching to PCIe 4.0 (or even 5.0 in the nearest future) should give some performance.

But, I'm not sure that's the bus issue, there could be almost anything, since setup is not really a common one and we never tested such things on our side.

Building multi-PC config is not an option in your case? Single GPU per PCIe bus will be better choice over multi-gpu setup.

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment

Hi Silent,

multi-pc is the way we did it years ago. Having it on one pc has reliability benefits (less parts), cost benefits (less parts) and technical benefits (fast syncing through shared memory, one hardware clock for all gpu's so all video signals are synchronised) etc. However going with multiple pc's is the current backup option yes but that loses above benefits and will incur additional hardware costs. And there is something here that does not make sense, but I'll get back to that further below.

I know what you are saying regarding SLI and it's diminishing returns. Last time we investigated we found absolutely zero performance improvement using SLI on linux (might be that has improved the last years though). But note that we do not use an SLI setup. The four cards are not bridged but each is linked to their own Xserver setup. Each application instance again connects to their own Xserver. So we have four parallel 'streams':

Master Instance -> Xserver 1 -> GPU 1
Slave Instance 1 -> Xserver 2 -> GPU 2
Slave Instance 2 -> Xserver 3 -> GPU 3
Slave Instance 3 -> Xserver 4 -> GPU 4

Regarding bus bandwidth: the 7900X is a 44 lane cpu so the gpu's are connected in x8, x8, x16, x8 fashion. As far as I am aware not even a 1080Ti can fill the bandwidth of x16 PCIe 3.0, would have to check x8.

Heaven being Unigine 1 is a good point, and one we overlooked. So we did some additional testing with the Superposition demo. Test results below:

  • Superposition single instance 1080p extreme
    • 1080Ti connected @x8:  5070 32/38/43 fps
    • 1080Ti connected @x16: 5100 32/38/43 fps
    • 1060M dev laptop: 379 2.4/2.5/3.1 fps
    • 1070 dev pc: 2735 17/21/24 fps
  • Superposition three instances 1080p extreme
    • #1 1080Ti @x8: 5038 27/37/43 fps
    • #2 1080Ti @x8: 5026 24/37/43 fps
    • #3 1080Ti @x16: 5117 32/38/43 fps
  • Superposition four instances 1080p extreme
    • #1 1080Ti @x8: 4986 23/37/43 fps
    • #2 1080Ti @x8: 4974 25/37/42 fps
    • #3 1080Ti @x16: 5025 24/37/42 fps
    • #4 1080 @x8: 3329 21/24/29 fps

Two conclusions can be drawn, performance wise a 1080Ti really is faster than a 1080, which is faster than a 1070, which is faster than a 1060M. Which is within expected parameters. Secondly, while there is a small performance difference between the x16 and the x8 GPU's the difference is for all intents and purposes negligible here. Importantly, there is almost no slowdown going from one instance to four. So I do not think we're bus IO bound. As all GPU's were at 100% utilisation we also ran tests with the lowest settings to try and get it cpu bound:

  • Superposition one instance 720p all settings lowest/off
    • 1080Ti @x8: 17803 89/133/179 fps
  • Superposition two instances 720p all settings lowest/off
    • #1 1080Ti @x8: 15306 78/114/158 fps
    • #2 1080Ti @x8: 15034 66/112/166 fps
  • Superposition three instances 720p all settings lowest/off
    • #1 1080Ti @x8: 13175 65/99/142 fps
    • #2 1080Ti @x8: 13391 59/100/145 fps
    • #3 1080Ti @x16: 13232 61/99/138 fps
  • Superposition four instances 720p all settings lowest/off
    • #1 1080Ti @x8: 12430 60/92/137 fps
    • #2 1080Ti @x8: 12470 47/93/167 fps
    • #3 1080Ti @x16: 12233 52/91/150 fps
    • #4 1080 @x8: 10868 52/81/123 fps

GPU utilisation for all runs was 41%. Multiple instances do exhibit a performance drop now. However this drop is still only around 30% going from one to four. In our application rendering suffers a 75% performance loss. From xosview observation it seems that with one or two instances one or two cpu cores actually constantly get maxed. Starting from three there are no cores being constantly maxed and according to i7z there are no cores running over 4GHz (7900X base is 3.4GHz, boost is 4.3GHz, turbo is 4.5GHz). So with four instances we're not GPU bound here (41% utilisation) and visually it does not seem we are cpu bound (no cores being maxed, not running on max clocks). That leaves us with IO bound. As this server has pretty fast memory (DDR4 3.3Ghz 8.4ns) we tested again with xmp switched off. Meaning the memory ran at 2.1Ghz and probably higher CAS timing, or roughly on two-thirds of its original speed.

  • Superposition four instances 720p all settings lowest/off XMP off
    • #1 1080Ti @x8: 10190 56/76/105 fps
    • #2 1080Ti @x8: 9326 48/69/94 fps
    • #3 1080Ti @x16: 10019 50/74/105 fps
    • #4 1080 @x8: 8697 46/65/101 fps
  • Superposition four instances 1080p extreme XMP off
    • #1 1080Ti @x8: 5023 avg 37 fps
    • #2 1080Ti @x8: 5017 avg 37 fps
    • #3 1080Ti @x16: 5056 avg 37 fps
    • #4 1080 @x8: 3339 avg 37 fps

So eliminating one third of memory bandwidth results in a <20% drop in performance. Extrapolating here to our application seems fraught with danger and possibly quite nonsensical, but assuming for a moment linear scaling of bandwidth vs performance: for a 50% performance drop (going from one instance to two instances) memory bandwidth would need to decrease by around 80%. Which seems silly.. but still the combined picture here does not seem to imply that losing 50% bandwidth (to the other instance) would halve performance. Note that at extreme setting there is no performance loss at all.

So we're not GPU bound, we're not CPU bound and it is questionable if we are memory IO bound..

We also noticed that the content dev pc (980Ti) is doing about 250 allocations using the editor, whereas the same scene in our application is doing about 1000-1500 on the server, while interestingly, on my laptop its doing ~800 average. What are these allocations and what causes them?

With all this testing and data there is still one thing that has us completely stumped:

For the last test we removed three cards from the server (7900X 10core 4.3 Ghz) so it just had one 1080Ti (x16) and one Xserver configured. Running our application with four instances results in 16 fps. (note that there is absolutely no performance drop from the one card having to do all the work, four cards was 16 as well)

My laptop with 6700HQ 4 core 2.6GHz and 1060M is doing this at 20 fps!

How the hell do 6700HQ 2.6Ghz + 1060M not only be as fast, but actually beat a 7900X + 1080Ti???

 

This is the part that really does not make sense to us and has us pulling our hair. Somewhere in Unigine and/or openGL or driver time is being spent.. doing nothing?? Remember, we weren't GPU bound or CPU bound. If we are memory bound the laptop must have the same bound. However, even though everything is slower on the laptop, including the memory itself its performance is 30% faster..

We have not repeated the same test with the 1070 dev pc, but if I remember correctly it is doing it at 30 fps, so also faster.

 

Is there any way to get some metrics from Unigine about idle time, memory use..? Are there things we could perhaps disable temporarily for testing to try to locate what is causing this?

 

In the mean time I will see if I can get Intel VTune working to get some memory IO numbers..

 

PS. while typing a post the forum regularly refreshes automatically presenting the main forum again. Fortunately, after doing a 'back' and clicking the reply box all text is still there, but it is a bit disconcerting..)

Edited by esger.abbink
Link to comment

Hi Esger,

Quote

PS. while typing a post the forum regularly refreshes automatically presenting the main forum again. Fortunately, after doing a 'back' and clicking the reply box all text is still there, but it is a bit disconcerting..)

That's known issue for now. The fix is not yet found, so the redirection issue will be with us for some time.

Regarding your performance results - it's very strange. Almost identical scores on different hardware is first signal of CPU bound application. As you can see with Superpostion on presets higher than Low and Medium behavior is totally expected (it uses as less CPU as possible) and there is no CPU bottleneck.

It's hard to say what is going on in your application (especially on Linux, where so little amount of profiling tools available). Out of the box engine don't do anything that can drop framerate that much.

For a debugging, I guess it's better to first get the decent framerate on a single-GPU setup and than try to add more hardware. Maybe starting with empty scene and then uncommenting parts of code that have more CPU influence will give you a hint where a lot of allocations are happens and you can found a sweet spot and optimize it.

Without seeing the whole picture it's hard to predict what can possibly go wrong, sorry.

 

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment

Hi Silent,

I understand the amount of crystal ball staring here for you.. Regarding your remark about cpu boundness, I can understand a 1080Ti not providing any benefit over a 1060 if the application is cpu bound in a major way. What I don't understand is how a cpu bound application is running faster on a slower cpu..

Using VTune on a very simple version of our application shows that 80-90% of cpu time is spent inside Unigine. The extra work the normal version does is almost completely in other threads, so I would expect a similar percentage for the full application main thread.

To give you a little bit more information:

  • Our scene consists of 4km by 4km of ObjectTerrain tiles with up to several hundred vehicles being moved around and up to a few dozen animations. (For testing purposes all vehicles were disabled).
  • Traffic logic is completely handled in separate thread
  • Vehicle model is completely handled in separate thread
  • Education logic is partly still in main thread but is computationally negligible
  • our application / unigine integration is based from the GLAppQt sample

I will ask Rene to post some screenshots here. But from a main thread PoV it is mostly Unigine running the show, aside from some bookkeeping.

If there is anything we can provide to give you better insight that would be fine.

One thing we will try to test is to see the results when we have one application instance drive multiple cards with one big window containing multiple viewports.

 

A coworker also made an important observation: Superposition is  Unigine 2.6 based we assume? We are still on Unigine 2.3. Have there been any changes between 2.3 and 2.6 that would significantly alter performance?

We are planning to upgrade, but would not normally do this this close to the end of a project. However, if it can be shown that upgrading would go along way to solving this issue that is a position we might be willing to reconsider. ;)

 

Link to comment

Hi Esger,

Quote

 What I don't understand is how a cpu bound application is running faster on a slower cpu..

Slower CPU in multi-threading workloads? Some good multi-threaded CPUs are bad at single core workloads and in some cases cheap Pentium CPU can outperform expensive Core i7 (but only with a couple of frames per second or less):

image.png

Superposition is mostly build on top of 2.4 / 2.5 SDKs. We currently do not recommend to 2.6 upgrade for the existing close to release projects. Another major difference is OpenGL 4.5 enabled in Superposition versus OpenGL 4.4 in the 2.3 SDK. Not sure if that relevant in any way.

The reason that we don't enabled OpenGL 4.5 in 2.5 were broken ObjectMeshSkinned, if you don't have such objects you can try to recompile engine with opengl_45=1 flag set to see if it will change something.

For now I would recommend to start with empty scene and no logic and increase it's complexity by adding more code and objects and find out when exactly you got this performance drop.

If you can send us a project that will show near 15 fps on GTX 1080TI on Linux with sources we can test it on our single GPU PC.

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment
2 hours ago, esger.abbink said:

I will ask Rene to post some screenshots here. But from a main thread PoV it is mostly Unigine running the show, aside from some bookkeeping.

As Esger already explained we generated the terrain with the use of the Landscape plugin as it was available in Unigine 2.3.
I've attached an image of the settings we've used at the time of creation (and as a result are still using).
On my workstation I can get the application to run with a solid 60+ fps with our content without any edits to the landscape settings (lodding etc).

bK5fXpy.jpg

OSN8iTY.jpg

If you need any more information from my side(content), i'll be glad to supply it.

Link to comment

Is Unigine capable of dealing with multiple GPUs on Linux out of the box? On Windows we had to alter the source code in order to make Unigine use the second videocard (injected an extra command line parameter that was forwarded to the device creation) as it was always using the first one even if you were running multiple instances.

Link to comment

Yes, we have not needed to make any source changes there. Everything is setup by X and the (nvidia) driver. Each Unigine instance only 'sees' the card belonging to the Xserver the application is connecting to. Using xinerama or sli mosaic it should be possible to have a single instance use all cards present. We're not using such configurations though so I cannot confirm they would scale well.

Let me know if you'd like more information.

Link to comment
×
×
  • Create New...