Does this Cuda scan kernel only work within a single block, or across multiple blocks? -

- February 15, 2012

I am doing homework and a quadron kernel is given a primitive scan operation. I can say to this kernel that if a block is used (due to int id = threadInx.x ) it will scan the data. is it true?

  // Hilis & amp; Steel: Kernel Function / / Jake Heath, October 8, 2013 (c) // - Changed by KD: Input array has been changed to be an unsigned entity instead of __global__ void scanKernel (unsigned int * in_data, unsigned int * Out_data, size_t numElements) {// We are creating an extra space for each newment so that the array size needs 2 * numElements. // cuda does not like the dynamic array in the shared memory, so it may obviously require the size of this memory / __shared__ int temp [1024 * 2]; // Instant variable variable int id = threadIdx.x; Int pout = 0, pin = 1; / // Load Input in Shared Memory // Exclusive Scan: Move the right one by one and set the first element to floating [id] = (id> 0)? in_data [id - 1]: 0; __syncthreads (); // For each thread, each stage / loop through each stage, the next frequency (int offset = 1; offset & lt; numelle; offset & lt; & lt; = 1) {// This switch is so much that the data Can move back and pout extra spaces between the fourth place = 1 - pout; Pin = 1 - pout; // IF: The number should be added in some, make sure to remove the offset numbers with those contents of the contents of the material, then move it to its respective location // ELSE: the number must be necessary only below Drop, just move those contents to those related places if (id> = offset) {// this element should be added to something; Do this and copy it at temporary [pout * newmilements + id] = float [pin * newel ++] + temporary [pin * newmales + id - offset]; } Else {// This element just goes down, so copy it to Temporary [Pout * Newmales + ID] = Temporary [Pin * Newlames + ID]; } __Syntchreads (); } // Output Out_Data [ID] = Temporary [Pout * Newlames + ID]; }    I would like to modify this kernel to work in several blocks, I will call it  int id ...  to  int id = thready idx. X + BlockDiam.X * BlockIdx.x . But shared memory is only within the block, which means that the scan kernels in the blocks can not share the proper information.  
 
   I can say this kernel that if only one scan, then a single block Is used (due to int id = threadInx.x) Is this true?   
 Not at all this kernel  no matter how many blocks will launch, but  all blocks  will receive the same input and calculate the same output because of  id  is calculated:  
  int id = threadIdx.x;    This is independent of  id   block idx , and therefore, like all the blocks, regardless of their number.  
  If I did not convert the multi-block version of this scan to a lot of code, then I would present an assistant straightforward to deposit the per-block amount. Then, run the same scan on that array, calculate the per-block increment. Finally, run the last kernel to  add  the block elements to the per-block increment. If memory works, CUDA SDK samples have a similar kernel.  
 The code can be written more efficiently after Kepler, especially through usage. Additionally, changing the algorithm to work per-strain instead of per-block will have to be rid of  __ sync  and improve performance. A combination of these two improvements will relieve you of shared memory and will work with registers for maximum performance.   

 




  



















Get link





Facebook





X





Pinterest





Email





Other Apps

Comments Post a Comment

Search This Blog

linux

Does this Cuda scan kernel only work within a single block, or across multiple blocks? -

Comments

Post a Comment

Popular posts from this blog

php - PDO bindParam() fatal error -

php - How can I cram 6+31 numeric characters into 22 alphanumeric characters? -

mysql - where clause in inner join query -