Jump to content

Asynchronous copy of unigine::texture in a unigine::image (PBO?)


photo

Recommended Posts

Hi.

We need to transfer rendered data (just the pixels) over a network. For this we have a "double buffered" offscreen viewport with 2 textures we render into, and we would like to capture the data asynchronously (like with OpenGL's PBOs or DX11 staging resources) so that the download of  texture#1 in an image (which data will be sent asynchronously per TCP/IP) happens while rendering to texture#2.

Is this possible with Unigine 2.14.1?

The download/copy from GPU to RAM is done with Texture::getImage(). We cannot use renderImage2D() because we have our own post-process steps.

Link to comment

Dear @gr7.76,

I have acquired a great experience about this Updating textures from RAM to GPU and staging /mapping from GPU to RAM.

The problem is, by specification DirectX Map API will wait for the complete transfer to happen. Synchronous. So all GPU Drivers probably fail to have asynchronous behavior for this functionality. 

But There are few architectural things may help and if you buy specialize GPU then it will be solid easy to make it asynchronous.

Please refer this explanation:

Rohit

Edited by rohit.gonsalves
  • Thanks 1
Link to comment

Dear @gr7.76,

I am working with 10-12 high definition textures UPLOAD which are UYVY color space

Two High Definition UYVY textures Mapping to System Memory all in real time. upto 60 FPS.

Doing it flawlessly with 2.14.1. In your case if it is only one texture and that too RGBA, still you can send it flawlessly in Real time. You may use the ring buffer approach.

I hope the resolution is not 4K or more. If Resolution is more than expected, then before sending convert the Texture to UYVY or NV12 color spaces with some shaders on GPU. Then send the low bandwidth textures on PCIe. These transfers are bottlenecks. Then on CPU convert back to required format from UYVY to other. But if you are sending on network it is also an good idea to send it as UYVY and at last port or stop, convert to RGBA or required format for display.

Saving the bandwidth on networks is also a good idea.

Rohit

 

  • Thanks 1
Link to comment

@rohit.gonsalves

Hi, Thanks for taking the time to answer. In 

You write:

Quote

There is another thread that pops up the stacked texture from ring buffer and maps for the GPU to CPU transfer. texture->getImage()

and

Quote

To solve your problem at this time, move the map calls in a separate thread. It is not difficult.

That's basically something we tried, but our app crashes in glBindTextures (I use Linux). Very likely because we currently have no OpenGL context in the second thread. I am trying to fix that but I do not know if this is needed for DirectX11 (we also have a Windows build). Is there something I specifically need to pay attention to?

The rest (resolution, pixel format, upload, etc.) is irrelevant since it is not a Unigine API issue and also since our users are free to choose resolution and format (YUV422 is one of many options).

Edited by gr7.76
Link to comment

Dear @gr7.76,

You should not put glBindTextures in separate thread. I didn't mean "map calls in a separate thread" as DirectX map calls. 

Just to make things easier, Please check the following intel Realsense D455 capture using the above discussion but with ring buffer size 1.

Please create one basic world and use these files. Take a look at process method from D455-RGB. It continuously producing the frames at 30 FPS. The main thread calls capture method and takes the image from CPU to GPU. But this would be blocking main thread whenever, D455 is using the mutex in process method. 

Please try this approach for your output Texture first (mine is capture). Your first test should be successful.

Then instead of blocking there, you may have a structure 
 

struct VideoData
{
	TexturePtr textureOnGPU;
	ImagePtr   imageOnCPU;
}

std::queue<VideoData> m_queueVideoTransferUpdateSurface;
  
int iQueueSize = 2;
  1. Now on init mehtod enqueue the queue with default textures and images till ring size.
  2. Put mutex on enqueue and dequeue
  3. In capture deque data from the queue and use.
  4. In process method,  enqueue data, and Check for ring sizes
  5. Continue till the end of process.

If you really want asynchronous behavior then you need to buy NVIDIA RTX Quadro and they support Direct 2 GPU. They have specialized firmware and drivers, where the card it self creates a OPENGL context internally once you use Direct2GPU API. It takes the buffer from MAP or bind functions and returns the call. They hooked DirectX and OpenGL map and bind functions respectively. Special driver implementation. And then The buffer is queued to extra OpenGL ring buffer. Now driver takes care of sending queued buffers to CPU with special synchronization. But your thread is asynchronous now. 

Long back I have tried to imitate something like this with extra OpenCL process (OPEN CL) which will accept the buffers, But it was difficult for me to talk to two processes simultaneously on GPU.  NVIDIA guys have written special firmware and driver for it. They have access to everything.

Hope this helps. Anyhow with consumer GPU we can only do this much. All synchronous.

Rohit

AppWorldLogic.cpp AppWorldLogic.h D455-RGB.cpp D455-RGB.h

  • Like 1
Link to comment

@rohit.gonsalves Thank again for your time.

If I understand your example properly, texture::setImage() (which uploads the data from CPU to GPU) is done sequentially with the rendering, and in the main Thread.

I would like to do something else: be able to render to texture#2 while texture#1 is getting downloaded into the image. But it seems (apparently confirmed from other sources) that this is generally not possible (not Unigine specific), unless I use the "NV Dual Copy Engine" that you mention. This is not an option.

What we did before we switched to Unigine was to have 1 Render Target and 2 PBOs and while glReadPixel was transferring the data in one PBO, we would memcopy the previous frame from the other PBO to the buffer to be processed (actually processing the data occurs in a separate thread). It seams this cannot be done either.

Well. I need to find something else to speed things up (4K: about 25ms to render& 9ms to capture).

 

Closed for now.

 

 

Edited by gr7.76
  • Like 1
  • Thanks 1
Link to comment
×
×
  • Create New...