noticing that the three loops over i, j, and k in any #dense #linear #solver can be placed in any of the six permutations, and only some of those yield the desired data-reuse that lets #vector #architectures show their stuff
paper "In-Place Transposition of Rectangular Matrices" by Fred G. Gustavson and Tadeusz Swirszcz, which provides a solution for the problem.
An online version of the paper can be found here:
http://www.orcca.on.ca/conferences/cca2008/papers/gustavson.pdfCycles of Permutation Related to Rectangular Matrix Transposition