Does this Cuda scan kernel only work within a single block, or across multiple blocks? -


I am doing homework and a quadron kernel is given a primitive scan operation. I can say to this kernel that if a block is used (due to int id = threadInx.x ) it will scan the data. is it true?

  // Hilis & amp; Steel: Kernel Function / / Jake Heath, October 8, 2013 (c) // - Changed by KD: Input array has been changed to be an unsigned entity instead of __global__ void scanKernel (unsigned int * in_data, unsigned int * Out_data, size_t numElements) {// We are creating an extra space for each newment so that the array size needs 2 * numElements. // cuda does not like the dynamic array in the shared memory, so it may obviously require the size of this memory / __shared__ int temp [1024 * 2]; // Instant variable variable int id = threadIdx.x; Int pout = 0, pin = 1; / // Load Input in Shared Memory // Exclusive Scan: Move the right one by one and set the first element to floating [id] = (id> 0)? in_data [id - 1]: 0; __syncthreads (); // For each thread, each stage / loop through each stage, the next frequency (int offset = 1; offset & lt; numelle; offset & lt; & lt; = 1) {// This switch is so much that the data Can move back and pout extra spaces between the fourth place = 1 - pout; Pin = 1 - pout; // IF: The number should be added in some, make sure to remove the offset numbers with those contents of the contents of the material, then move it to its respective location // ELSE: the number must be necessary only below Drop, just move those contents to those related places if (id> = offset) {// this element should be added to something; Do this and copy it at temporary [pout * newmilements + id] = float [pin * newel ++] + temporary [pin * newmales + id - offset]; } Else {// This element just goes down, so copy it to Temporary [Pout * Newmales + ID] = Temporary [Pin * Newlames + ID]; } __Syntchreads (); } // Output Out_Data [ID] = Temporary [Pout * Newlames + ID]; }   

I would like to modify this kernel to work in several blocks, I will call it int id ... to int id = thready idx. X + BlockDiam.X * BlockIdx.x . But shared memory is only within the block, which means that the scan kernels in the blocks can not share the proper information.

I can say this kernel that if only one scan, then a single block Is used (due to int id = threadInx.x) Is this true?

Not at all this kernel no matter how many blocks will launch, but all blocks will receive the same input and calculate the same output because of id is calculated:

  int id = threadIdx.x;   

This is independent of id block idx , and therefore, like all the blocks, regardless of their number.


If I did not convert the multi-block version of this scan to a lot of code, then I would present an assistant straightforward to deposit the per-block amount. Then, run the same scan on that array, calculate the per-block increment. Finally, run the last kernel to add the block elements to the per-block increment. If memory works, CUDA SDK samples have a similar kernel.

The code can be written more efficiently after Kepler, especially through usage. Additionally, changing the algorithm to work per-strain instead of per-block will have to be rid of __ sync and improve performance. A combination of these two improvements will relieve you of shared memory and will work with registers for maximum performance.

Comments

Popular posts from this blog

php - PDO bindParam() fatal error -

logging - How can I log both the Request.InputStream and Response.OutputStream traffic in my ASP.NET MVC3 Application for specific Actions? -

java - Why my included JSP file won't get processed correctly? -