I needed to use 16bit logic to ashive this. Also moved out dx*2, dy*2 calculation outside of loop for better performance.