Micro-architectural support for improving synchronization and efficiency of SIMD execution on GPUs