cs61c_lab7

Exercise 1:

Scenario 1: 1. Because (Step Size) * 4 in bytes (32 bytes) is exactly equal to Cache Size in bytes. 2. hit rate will always be zero. because the index portion is always 00, and the address we access is always mapped to the first set.However the tag portion is always different, so the first cache line is always evicted so that the hit rate will always be zero. 3. Change Step Size to 1 Scenario 2 1. There are two memory accesses. 2. One miss followed by three hit. Because every four accesses will increase index portion by one which lead one miss, and the next there accesses will not change the flag and index portion. 3. The hit rate is 75%. 4. When Rep Count goes to infinity, the hit rate will be very closed to 100%. Because all data in array has been cached. 5. We can access part of the array which size is equal to the size of cache size at a time and apply all of the function to the part of the array so we can be completely done with it before moving on, thereby keeping that the data of portion hot in the cache and not having to circle back to it later on.

Scenario 3 1. The hit rate of our L1 cache is 50% and the hit rate of L2 cache is 0%. Overall hit rate is 50%. 2. We have 32 accesses to L1 cache total. 16 of them misses. 3. We have 16 accesses to L2 cache total. They are the accesses which missed in L1 cache. 4. We can increase Rep Count to increase our L2 hit rate.Because the first iteration cache all the data to L2 cache, so L2 hit rate has increased. However the tag portion changed in L1 cache and the data are not cached, so our L1 hit rate will not change. 5. When we increase the number of blocks in L1, the hit rates for L1 and L2 will not change. When We increase the L1 block size, the hit rate for L1 will increase and for L2 will not change.

Exercise 2:

ijk:    n = 1000, 1.702 Gflop/s
ikj:    n = 1000, 0.160 Gflop/s
jik:    n = 1000, 1.647 Gflop/s
jki:    n = 1000, 9.107 Gflop/s
kij:    n = 1000, 0.155 Gflop/s
kji:    n = 1000, 8.787 Gflop/s

jki loop order performs best. Because in the innermost iteration, it moves through B with stride 0 and moves through A and C with stride 1.
kij loop order performs worst. Because in the innermost iteration, it moves through B and C with stride n.
We stride through the matrices with respect to the innermost loop continuous can reduce the time to multiply the matrices.

Exercise 3:

代码实现如下：

void transpose_blocking(int n, int blocksize, int *dst, int *src) {
    // YOUR CODE HERE
    for (int bx = 0; bx + blocksize <= n; bx += blocksize) {
        for (int by = 0; by + blocksize <= n; by += blocksize) {
            for (int x = 0; x < blocksize; x ++) {
                for (int y = 0; y < blocksize; y ++) {
                    dst[by + y + (bx + x) * n] = src[bx + x + (by + y) * n];
                }
            }
        }
    }
    if (n % blocksize == 0) return;
    int ex = (n / blocksize) * blocksize;
    int ey = ex;
    for (int x = ex; x < n; x ++) {
        for (int y = 0; y < ey; y ++) {
            dst[y + x * n] = src[x + y * n];
        }
    }
    for (int x = 0; x < ex; x ++) {
        for (int y = ey; y < n; y ++) {
            dst[y + x * n] = src[x + y * n];
        }
    }
    for (int x = ex; x < n; x ++) {
        for (int y = ey; y < n; y ++) {
            dst[y + x * n] = src[x + y * n];
        }
    }
}

Part 1

blocksize = 20, n = 100:    0.004 milliseconds, 0.023 milliseconds
blocksize = 20, n = 1000:   0.898 milliseconds, 1.816 milliseconds
blocksize = 20, n = 2000:   28.081 milliseconds, 4.874 milliseconds
blocksize = 20, n = 5000:   207.785 milliseconds, 30.871 milliseconds
blocksize = 20, n = 10000:  1103.16 milliseconds, 157.027 milliseconds

N is big.
Because there are more operations in the blocked version.

Part 2

blocksize = 50, n = 10000:      1405.15 milliseconds, 161.008 milliseconds
blocksize = 100, n = 10000:     888.638 milliseconds, 147.19 milliseconds
blocksize = 500, n = 10000:     951.018 milliseconds, 131.87 milliseconds
blocksize = 1000, n = 10000:    974.174 milliseconds, 174.938 milliseconds
blocksize = 5000, n = 10000:    938.069 milliseconds, 891.538 milliseconds

First it gets better, then it gets worse.