Jump to content

2.14.1 Infinite Loop in RenderContext::endRenderWindow() ?


photo

Recommended Posts

Dear Unigine Team,

Two of out 2.14.1 based projects are "freezing" after a random amount of time. For one of them I was able to catch the issue in the debugger. It seems that Timer::getTime() returns a negativ value, which causes an infinite loop in RenderContext::endRenderWindow(). Please see the attached small video.

When digging out my little bit of assembly knowledge, the following seems to happen:

  1. Timer::getTime() writes an 64 bit integer timer value into RAX register
  2. the cvtsi2sd converts the signed integer into a double value which is written into xmm0:stuck_xmm0.thumb.png.ce2eee403ad6e312843b08c65365b97c.png
  3. After that it is multiplied with a positive value stored in xmm6: 
  4. And finally it is compared with another positive value in xmm7:
  5. if the xmm6*xmm0 is less than xmm7 it jumps back to Timer::getTime() which again returns a negative value

Of course, I cannot be 100 perecent sure that Timer::getTime() is not delivering some positive value after a certain amount of time (therefore the ? in the title) -  but intuitively it seems to be quite strange that a timer returns a negativ value.

Could the issue be releated to the render_max_fps setting? It is 150 in the world file, which is never reached. In this particular case the software has been running for 4 days at around 30 fps on Windows 10.

I should also mention that we need to fix that quite urgently, because we have to ship the software soon.

Many thanks

Helmut

 

stuck_xmm6.png

stuck_xmm7.png

2022-04-04 10-36-17.mkv

Link to comment

Hi Helmut,

Thanks for the detailed description. Unfortunately, based om this information it's still pretty hard to guess what could possible cause this.

As an option you can try to change Max FPS from 150 to 0 to avoid entering this particular code path that should control max fps. Since your app is running 30, there should be no issues I guess?

In the meantime we would try to reproduce this behavior on our test stand. I hope that leaving engine running just for 4 days will result in something similar to your case (otherwise, I'm afraid we would need to get your project for testing).

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment

Hi Silent,

just to add, we are using the Max FPS to steady out the scene and cap it at 30 fps, not only on this but also other projects.

Maybe that we can do that VIA Vsync as well, but the max fps function is very useful as we do not have to rely on driver doing that.

Best. Werner

Link to comment

I'm not quite sure than how to connect these two messages together:

Quote

It is 150 in the world file, which is never reached. In this particular case the software has been running for 4 days at around 30 fps on Windows 10.

Quote

we are using the Max FPS to steady out the scene and cap it at 30 fps, not only on this but also other projects.

So are you using Max FPS 30 or not? :)

In case if you are using it on this project that you need to ship right now there is no other way to fix it other than setting Max FPS to zero and using driver vsync.

Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment

Sry a bit missleading.

The first project which produced the halt had MaxFps 150 turned on.

But basically we use the max fps in a lot of projects 30,60 most of the time, to ensure all the elements to be constant.

For previous projects, we always use Vsync for exact those reasons, but also to prevent frame tearing for large public displays like LEDs and projectors.

So that always helped. 

For some reasons in some settings (GFX card, Driver, Unigine Version, project etc.) Vsync didnt work, So we were happy to see, that we can rely on Unigine internal FPS cap.

Thats for explanation.

 

For the latest issue, Helmut suspects some issues with cpu clocking, max fps combinations and we wanted to report to you and also ask if there are known issues, or suspects into that direction.

We had 2 project stopping after a couple of hour/days, which used max fps.

 

So the counter test now would be to give up max fps and see what happens.

We will do so.

Sry for confusion. oO

Best. Werner

Link to comment

Hello Silent,

thank you for looking at that.

7 minutes ago, silent said:

I'm not quite sure than how to connect these two messages together:

In this project we can deactivate the frame limiter, but we have others in production where we have an irregular workload per frame and we are using the render_max_fps there for making the framerate more stable.

Diving a little bit deeper, Timer::getTime() calls Windows' QueryPerformanceCounter() which internally uses the RDTSCP instruction on that machine. According to the microsoft docs (https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps) this is not allways the case, so the issue might be machine dependent.

I have the application still running, I can see if I can find the source of the negative value (i.e. if the offset it greater than the timestamp).

Regarding the infinite loop: Visual Studio allowed me to insert a break point, and indeed it is still in that loop according to that test.

Thanks

Helmut

Link to comment

Little update: I think it is an overflow in Timer::getTime(), which is happening because the value returned by QueryPerformanceCounter() is multiplied with 100000.overflow_qpc_times_100000.thumb.png.9646bdd443ee4192ac2251b04dc4beca.png

I think the following is happening:

  • Queryperformance counter just finished its execution, the result is in [tval]. It returned TRUE in the RAX register which is just 1
  • For some normalisation reasons the result is multplied with 100000 immediately afterwards, this is the place where the overflow happens.
  • The cqo (https://www.felixcloutier.com/x86/cwd:cdq:cqo) instruction following the imul instruction sign-extends the 64 bit value into a 128 bit value located in the RDX:RAX register which is negativ (needed for the following idiv (https://www.felixcloutier.com/x86/idiv) instruction which operates on a 128 bit integer).
  • The idiv instruction uses a negativ number as input (it always uses the pair of RDX:RAX as input).overflow_cqo.thumb.png.b7e16ff3e2cf1089863a244ce159d02c.png

Please observe that RAX starts with 8 in hex, which indicates that RAX was already negativ before the execution of cqo. Also RDX is 0xFFFFFFFFFFFFFFFFFFF which is the negativ sign extension of RAX

All the best

Helmut

Link to comment

We will try that, but it will take some time to find out,

I should mention that all other places where Timer::getTime() is called (and all other instances, in case it is not static) are affected as well, irrespective of the offset value. And i have seen that it is at least invoked by numerous worker threads - all of them will read negative values.

So I think it is something which should be fixed ;-)

Thanks

Helmut

Link to comment

In the meantime you can implement your own Max FPS analogue that should not cause any issues. Something like that:

LARGE_INTEGER prev_frame_end;

int AppWorldLogic::swap()
{
	const auto max_fps = 32.0f;

	static int initialized = 0;
	static LARGE_INTEGER frequency;
	
	if (initialized == 0)
	{
		initialized = 1;
		QueryPerformanceFrequency(&frequency);
		QueryPerformanceCounter(&prev_frame_end);
	}

	// When the the max FPS value is provided, we may need to wait some amount of time
	// at the end of each frame to enforce the requested frame rate.
	{
		float frame_duration = 1.0f / max_fps;
		const auto get_elapsed = []()
		{
			LARGE_INTEGER tval;
			LARGE_INTEGER  elapsed;
			QueryPerformanceCounter(&tval);
			elapsed.QuadPart = tval.QuadPart - prev_frame_end.QuadPart;
			elapsed.QuadPart *= 1000000;
			elapsed.QuadPart /= frequency.QuadPart;
			return double(elapsed.QuadPart) / 1000000.0;
		};

		while (get_elapsed() < frame_duration)
		{

			// Do nothing, just spin and wait.
		}
	}
	QueryPerformanceCounter(&prev_frame_end);

	return 1;
}

We will see what we can do to fix this behavior in the upcoming SDK versions.

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment

For adding more information to this,

I am using similar spin locks to cap the Frame rate to 25, 50 and 60. I use Timer::getFloatTime().

Never had an issue. 

Quote

double dInterval = get_Target_FPS_Interval(); //(1 / 50)

 auto CurrentTime = Timer::getFloatTime();
 auto NextFrameTime = m_dLastFrameTime + dInterval;

 while (CurrentTime < NextFrameTime)
 {
        CurrentTime = Timer::getFloatTime();
 }

m_dLastFrameTime = CurrentTime;

In one of the check we ran our application 10 days straight using 2.14.1.

@helmut.bresslerStill the problem is interesting. 

  • Thanks 1
Link to comment

@silent Thank you very much for the suggested work around, I think that should be easy to implement.

Yeah it is, it is even more mysterious on the second machine. It got stuck in a loop after approximately 3 weeks (but the machine is switched on for much longer).

stuck_snake.thumb.png.0280a9dabbaa0674c4797f5c8cc453cb.png

On this machine the Debugger spills out the local variable of the timer variable. If the debugger is correct, the frequency is here only 1000000. It is an old Core I5 4000something CPU (The other one was a quite new 11600k). The strange thing here is that the offset is negativ (I don't know when the offset value is sampled). Maybe this is because the machine has been running for so long already.

Anyways the issue is the same. The value returned by QueryPerformanceCounter() multiplied with 1000000 exceeds the range of a signed 64-bit integer and leads to an overflow into minus.

And it is actually quite easy to see that the code is problematic. Take for example an imaginary CPU which has a clock frequency of 4.2 GHz (or 2^32 for the sake of simplicity) and QueryPerformanceCounter() uses the RDTSC instruction which simply returns the value of a clock counter. In other words, the counter gets increased by 2^32 per second.

The timer implementation multiplies that value with 1000000 (I suppose in order to translate it into microseconds)  BEFORE it divides it by the frequency.

For our hypothetical CPU that would mean that the overflow happens in the best case after around 2.1 million seconds, which is something between 3 to 4 weeks. (The best case is when the timer is zero when the application starts).

Cheers

Helmut

 

  • Like 1
Link to comment

Dear Helmut,

GetTickCount() will run continuously for 49.7 days. As 2^32 - 1 will cause overflow around zero for storing milliseconds.  

But QueryPerformanceCounter() returning LONGLONG can save 9,223,372,036,854,775,807 on positive side before turning to negative numbers with an overflow.

For your hypothetical CPU of 4.2 GHz, there will be 2^32 - 1 ticks for a second. 

Then probable seconds stored are 9,223,372,036,854,775,807 / 4,29,49,67,295 ~ much much more than  3-4 weeks.

If you replace the code with spin loops for timing and even used GetTickCount64().  I suppose there should not be this small limitation of 3-4 weeks. Even with GetTickCount() with milliseconds precision, the program should run till 7 weeks at-least.

Did I miss anything or misunderstood the calculations?

Rohit

Link to comment

Hello Rohit,

you are right, both GetTickCount() and QueryPerformanceCounter() are running for a very long time (I think QueryPerformanceCounter() is specified to work for at least 100 years - which actually make my imaginary cpu incorrect (only by a factor of 2 or something)). I think for clarifying whats happening inside Timer::getTime() I'm trying to translate it into C code (Its easier for me to explain things in code, I'm not a text writer :-)

static LARGE_INTEGER frequency;
static long long offset; ///< in microseconds
static int initialized = 0;

static UNIGINE_API long long Timer::getTime()
{
	if(!initialized)
	{
		// initialise frequency and offset here
    		return 0;
	} else
	{
		LARGE_INTEGER tval;
		LARGE_INTEGER  elapsed;
		QueryPerformanceCounter(&tval);
		elapsed.QuadPart = tval.QuadPart;
		elapsed.QuadPart *= 1000000;
		elapsed.QuadPart /= frequency.QuadPart;
		
		return (long long)elapsed.QuadPart - offset;				
	}
}	

As you can see the elapsed time is multiplied with 1 Million (in order to translate the result into microseconds) before it is divided by the frequency. This is done before the division because the division is an integer division and the frequency is in hertz, but we need microseconds precision.

This is problematic because a high frequency can easily eat up all the range benefits from using 64-bit integers. Instead I believe something like the following could be done:

static LARGE_INTEGER frequency;
static long long offset; ///< in microseconds
static int initialized = 0;

static UNIGINE_API long long Timer::getTime()
{
	if(!initialized)
	{
		// initialise frequency and offset here
	} else
	{
		LARGE_INTEGER tval;
		LARGE_INTEGER  elapsed;
		QueryPerformanceCounter(&tval);
		elapsed.QuadPart = tval.QuadPart;
		
		if(frequency.QuadPart > THRESHOLD)
		{
			elapsed.QuadPart *= 1000;
			elapsed.QuadPart /= (frequency.QuadPart / 1000);		
		} else
		{
			elapsed.QuadPart *= 1000000;
			elapsed.QuadPart /= frequency.QuadPart;		
		}
		
		return (long long)elapsed.QuadPart - offset;				
	}
}	

or something similar - but this sacrifieses some precision.

I think the problem is not so trivial to solve with 64 bits - the windows performance counter API allows either to measure the time very precisely but only for a short time span or for a longer timespan but less precisely. A solution could be maybe to use 128 bit integers instead?

Another possibilty would be to implement render_max_fps as @silent proposed - subtracting the timestamp from the previous frame makes the elapsed value in the code I posted always small - but this wouldn't solve the issue for Timer::getTime() in general.

Cheers

Helmut

  • Thanks 1
Link to comment

Hi @helmut.bressler,

I will also do a test with Timer::getTime() and will let you know my results. 

@silent I was checking diff between 2.15 and 2.14.1 documentation. It seems there are changes in Timer and Time class for new SDK.  But upgrading section does not mention any of this.

Rohit

Edited by rohit.gonsalves
Typo
Link to comment
6 minutes ago, rohit.gonsalves said:

@silent I was checking diff between 2.15 and 2.14.1 documentation. It seems there are changes in Timer and Time class for new SDK.  But upgrading section does not mention any of this.

Hm, that's strange. Will see if we can fix this :) Thanks!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Link to comment
×
×
  • Create New...