NIST PCA Example


# A Principal Components Analysis example
#
# walk through the NIST/Sematech PCA example data section 6.5
# http://www.itl.nist.gov/div898/handbook/
#
# This file is part of the Hume La Package
# (C)Copyright 2001, Hume Integration Services
#
# These calculations use the Hume Linear Algebra package, La
#
package require La
catch { namespace import La::*}


# Look at the bottom of this file to see the console output of this procedure.

proc nist_pca {} {
    puts {Matrix of observation data X[10,3]}
    set X {2 10 3 7 4 3 4 1 8 6 3 5 8 6 1 8 5 7 7 2 9 5 3 3 9 5 8 7 4 5 8 2 2}
    puts [show $X]
    puts {Compute column norms}
    mnorms_br X Xmeans Xsigmas	
    puts "Means=[show $Xmeans]"
    puts "Sigmas=[show $Xsigmas %.6f]"
    puts {Normalize observation data creating Z[10,3]}
    puts [show [set Z [mnormalize $X $Xmeans $Xsigmas]] %10.6f]\n
    puts {Naive computation method that is sensitive to round-off error}
    puts {ZZ is (Ztranspose)(Z) }
    puts "ZZ=\n[show [set ZZ [mmult [transpose $Z] $Z]] %10.6f]"
    puts {the correlation matrix, R = (Zt)(Z)/(n-1)}
    puts "R=\n[show [set R [mscale $ZZ [expr 1/9.0]]] %8.4f]"
    puts {Continuing the naive example use R for the eigensolution}
    set V $R
    mevsvd_br V evals
    puts "eigenvalues=[show $evals %8.4f]"
    puts "eigenvectors:\n[show $V %8.4f]\n"
    puts "Good computation method- perform SVD of Z"
    set U $Z
    msvd_br U S V
    puts "Singular values, S=[show $S]"
    puts "square S and divide by (number of observations)-1 to get the eigenvalues"
    set SS [mprod $S $S]
    set evals [mscale $SS [expr 1.0/9.0]]
    puts "eigenvalues=[show $evals %8.4f]"
    puts "eigenvectors:\n[show $V %8.4f]"
    puts "(the 3rd vector/column has inverted signs, thats ok the math still works)"
    puts "\nverify V x transpose = I"
    puts "As computed\n[show [mmult $V [transpose $V]] %20g]\n"
    puts "Rounded off\n[show [mround [mmult $V [transpose $V]]]]"
    #puts "Diagonal matrix of eigenvalues, L"
    #set L [mdiag $evals]
    #
    
    puts "\nsquare root of eigenvalues as a diagonal matrix"
    proc sqrt x { return [expr sqrt($x)] }
    set evals_sr [mat_unary_op $evals sqrt]
    set Lhalf [mdiag $evals_sr]
    puts [show $Lhalf %8.4f]

    puts "Factor Structure Matrix S"
    set S [mmult $V $Lhalf]
    puts [show $S %8.4f]\n
    set S2 [mrange $S 0 1]
    puts "Use first two eigenvalues\n[show $S2 %8.4f]"
    # Nash discusses better ways to compute this than multiplying SSt
    # the NIST method is sensitive to round-off error
    set S2S2 [mmult $S2 [transpose $S2]]
    puts "SSt for first two components, communality is the diagonal"
    puts [show $S2S2 %8.4f]\n


    # define reciprocal square root function
    proc recipsqrt x { return [expr 1.0/sqrt($x)] }
    # use it for Lrhalf calculation
    set Lrhalf [mdiag [mat_unary_op $evals recipsqrt]]

    set B [mmult $V $Lrhalf]
    puts "Factor score coefficients B Matrix:\n[show $B %12.4f]"

    puts "NIST shows B with a mistake, -.18 for -1.2 (-1.18 with a typo)"

    puts "\nWork with first normalized observation row"
    set z1 {2 3 0 0.065621795889 0.316227766017 -0.7481995314}
    puts [show $z1 %.4f]

    puts "compute the \"factor scores\" from multiplying by B"
    set t [mmult [transpose $z1] $B]
    puts [show $t %.4f]
    puts "NIST implies you might chart these"

    #set T2 [dotprod $t $t]
    #puts "Hotelling T2=[format %.4f $T2]"
    
    puts "Compute the scores using V, these are more familiar"
    set t [mmult [transpose $z1] $V]
    puts [show $t %.4f]
    puts "Calculate T2 from the scores, sum ti*ti/evi"

    set T2 [msum [transpose [mdiv [mprod $t $t] [transpose $evals]]]]
    puts "Hotelling T2=[format %.4f $T2]"
 
    puts "Fit z1 using first two principal components"
    set p [mrange $V 0 1]
    puts p=\n[show $p %10.4f]
    set t [mmult [transpose $z1] $p]
    set zhat [mmult $t [transpose $p]]
    puts "  z1=[show $z1 %10.6f]"
    puts "zhat=[show $zhat %10.6f]"
    set diffs [msub [transpose $z1] $zhat]
    puts "diff=[show $diffs %10.6f]"
    puts "Q statistic - distance from the model measurement for the first observation"
    set Q [dotprod $diffs $diffs]
    puts Q=[format %.4f $Q]
    puts "Some experts would compute a corrected Q that is larger since the 
observation was used in building the model.  The Q calculation just used
properly applies to the analysis of new observations."
    set corr [expr 10.0/(10.0 - 2 -1)]  ;# N/(N - Ncomp - 1)
    puts "Corrected Q=[format %.4f [expr $Q * $corr]] (factor=[format %.4f $corr])"
    }

return
##########################################################################
Console Printout
##########################################################################
Matrix of observation data X[10,3]
7 4 3
4 1 8
6 3 5
8 6 1
8 5 7
7 2 9
5 3 3
9 5 8
7 4 5
8 2 2
Compute column norms
Means=6.9 3.5 5.1
Sigmas=1.523884 1.581139 2.806738
Normalize observation data creating Z[10,3]
  0.065622   0.316228  -0.748200
 -1.903032  -1.581139   1.033228
 -0.590596  -0.316228  -0.035629
  0.721840   1.581139  -1.460771
  0.721840   0.948683   0.676942
  0.065622  -0.948683   1.389513
 -1.246814  -0.316228  -0.748200
  1.378058   0.948683   1.033228
  0.065622   0.316228  -0.035629
  0.721840  -0.948683  -1.104485

Naive computation method that is sensitive to round-off error
ZZ is (Ztranspose)(Z) 
ZZ=
  9.000000   6.017916  -0.911824
  6.017916   9.000000  -2.591349
 -0.911824  -2.591349   9.000000
the correlation matrix, R = (Zt)(Z)/(n-1)
R=
  1.0000   0.6687  -0.1013
  0.6687   1.0000  -0.2879
 -0.1013  -0.2879   1.0000
Continuing the naive example use R for the eigensolution
eigenvalues=  1.7688   0.9271   0.3041
eigenvectors:
  0.6420   0.3847  -0.6632
  0.6864   0.0971   0.7207
 -0.3417   0.9179   0.2017

Good computation method- perform SVD of Z
Singular values, S=3.98985804571 2.88854344819 1.65449373616
square S and divide by (number of observations)-1 to get the eigenvalues
eigenvalues=  1.7688   0.9271   0.3041
eigenvectors:
  0.6420   0.3847   0.6632
  0.6864   0.0971  -0.7207
 -0.3417   0.9179  -0.2017
(the 3rd vector/column has inverted signs, thats ok the math still works)

verify V x transpose = I
As computed
                   1        -5.55112e-017         1.38778e-016
       -5.55112e-017                    1         5.55112e-017
        1.38778e-016         5.55112e-017                    1

Rounded off
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0

square root of eigenvalues as a diagonal matrix
  1.3300   0.0000   0.0000
  0.0000   0.9628   0.0000
  0.0000   0.0000   0.5515
Factor Structure Matrix S
  0.8538   0.3704   0.3658
  0.9128   0.0935  -0.3975
 -0.4544   0.8838  -0.1112

Use first two eigenvalues
  0.8538   0.3704
  0.9128   0.0935
 -0.4544   0.8838
SSt for first two components, communality is the diagonal
  0.8662   0.8140  -0.0606
  0.8140   0.8420  -0.3321
 -0.0606  -0.3321   0.9876

Factor score coefficients B Matrix:
      0.4827       0.3995       1.2026
      0.5161       0.1009      -1.3069
     -0.2569       0.9533      -0.3657
NIST shows B with a mistake, -.18 for -1.2 (-1.18 with a typo)

Work with first normalized observation row
0.0656 0.3162 -0.7482
compute the "factor scores" from multiplying by B
0.3871 -0.6552 -0.0608
NIST implies you might chart these
Compute the scores using V, these are more familiar
0.5148 -0.6308 -0.0335
Calculate T2 from the scores, sum ti*ti/evi
Hotelling T2=0.5828
Fit z1 using first two principal components
p=
    0.6420     0.3847
    0.6864     0.0971
   -0.3417     0.9179
  z1=  0.065622   0.316228  -0.748200
zhat=  0.087847   0.292075  -0.754958
diff= -0.022225   0.024153   0.006758
Q statistic - distance from the model measurement for the first observation
Q=0.0011
Some experts would compute a corrected Q that is larger since the 
observation was used in building the model.  The Q calculation just used
properly applies to the analysis of new observations.
Corrected Q=0.0016 (factor=1.4286)