NIST PCA Example

# A Principal Components Analysis example
#
# walk through the NIST/Sematech PCA example data section 6.5
# http://www.itl.nist.gov/div898/handbook/
#
# This file is part of the Hume La Package
# (C)Copyright 2001, Hume Integration Services
#
# These calculations use the Hume Linear Algebra package, La
#
package require La
catch { namespace import La::*}

# Look at the bottom of this file to see the console output of this procedure.

proc nist_pca {} {
puts {Matrix of observation data X[10,3]}
set X {2 10 3 7 4 3 4 1 8 6 3 5 8 6 1 8 5 7 7 2 9 5 3 3 9 5 8 7 4 5 8 2 2}
puts [show \$X]
puts {Compute column norms}
mnorms_br X Xmeans Xsigmas
puts "Means=[show \$Xmeans]"
puts "Sigmas=[show \$Xsigmas %.6f]"
puts {Normalize observation data creating Z[10,3]}
puts [show [set Z [mnormalize \$X \$Xmeans \$Xsigmas]] %10.6f]\n
puts {Naive computation method that is sensitive to round-off error}
puts {ZZ is (Ztranspose)(Z) }
puts "ZZ=\n[show [set ZZ [mmult [transpose \$Z] \$Z]] %10.6f]"
puts {the correlation matrix, R = (Zt)(Z)/(n-1)}
puts "R=\n[show [set R [mscale \$ZZ [expr 1/9.0]]] %8.4f]"
puts {Continuing the naive example use R for the eigensolution}
set V \$R
mevsvd_br V evals
puts "eigenvalues=[show \$evals %8.4f]"
puts "eigenvectors:\n[show \$V %8.4f]\n"
puts "Good computation method- perform SVD of Z"
set U \$Z
msvd_br U S V
puts "Singular values, S=[show \$S]"
puts "square S and divide by (number of observations)-1 to get the eigenvalues"
set SS [mprod \$S \$S]
set evals [mscale \$SS [expr 1.0/9.0]]
puts "eigenvalues=[show \$evals %8.4f]"
puts "eigenvectors:\n[show \$V %8.4f]"
puts "(the 3rd vector/column has inverted signs, thats ok the math still works)"
puts "\nverify V x transpose = I"
puts "As computed\n[show [mmult \$V [transpose \$V]] %20g]\n"
puts "Rounded off\n[show [mround [mmult \$V [transpose \$V]]]]"
#puts "Diagonal matrix of eigenvalues, L"
#set L [mdiag \$evals]
#

puts "\nsquare root of eigenvalues as a diagonal matrix"
proc sqrt x { return [expr sqrt(\$x)] }
set evals_sr [mat_unary_op \$evals sqrt]
set Lhalf [mdiag \$evals_sr]
puts [show \$Lhalf %8.4f]

puts "Factor Structure Matrix S"
set S [mmult \$V \$Lhalf]
puts [show \$S %8.4f]\n
set S2 [mrange \$S 0 1]
puts "Use first two eigenvalues\n[show \$S2 %8.4f]"
# Nash discusses better ways to compute this than multiplying SSt
# the NIST method is sensitive to round-off error
set S2S2 [mmult \$S2 [transpose \$S2]]
puts "SSt for first two components, communality is the diagonal"
puts [show \$S2S2 %8.4f]\n

# define reciprocal square root function
proc recipsqrt x { return [expr 1.0/sqrt(\$x)] }
# use it for Lrhalf calculation
set Lrhalf [mdiag [mat_unary_op \$evals recipsqrt]]

set B [mmult \$V \$Lrhalf]
puts "Factor score coefficients B Matrix:\n[show \$B %12.4f]"

puts "NIST shows B with a mistake, -.18 for -1.2 (-1.18 with a typo)"

puts "\nWork with first normalized observation row"
set z1 {2 3 0 0.065621795889 0.316227766017 -0.7481995314}
puts [show \$z1 %.4f]

puts "compute the \"factor scores\" from multiplying by B"
set t [mmult [transpose \$z1] \$B]
puts [show \$t %.4f]
puts "NIST implies you might chart these"

#set T2 [dotprod \$t \$t]
#puts "Hotelling T2=[format %.4f \$T2]"

puts "Compute the scores using V, these are more familiar"
set t [mmult [transpose \$z1] \$V]
puts [show \$t %.4f]
puts "Calculate T2 from the scores, sum ti*ti/evi"

set T2 [msum [transpose [mdiv [mprod \$t \$t] [transpose \$evals]]]]
puts "Hotelling T2=[format %.4f \$T2]"

puts "Fit z1 using first two principal components"
set p [mrange \$V 0 1]
puts p=\n[show \$p %10.4f]
set t [mmult [transpose \$z1] \$p]
set zhat [mmult \$t [transpose \$p]]
puts "  z1=[show \$z1 %10.6f]"
puts "zhat=[show \$zhat %10.6f]"
set diffs [msub [transpose \$z1] \$zhat]
puts "diff=[show \$diffs %10.6f]"
puts "Q statistic - distance from the model measurement for the first observation"
set Q [dotprod \$diffs \$diffs]
puts Q=[format %.4f \$Q]
puts "Some experts would compute a corrected Q that is larger since the
observation was used in building the model.  The Q calculation just used
properly applies to the analysis of new observations."
set corr [expr 10.0/(10.0 - 2 -1)]  ;# N/(N - Ncomp - 1)
puts "Corrected Q=[format %.4f [expr \$Q * \$corr]] (factor=[format %.4f \$corr])"
}

return
##########################################################################
Console Printout
##########################################################################
Matrix of observation data X[10,3]
7 4 3
4 1 8
6 3 5
8 6 1
8 5 7
7 2 9
5 3 3
9 5 8
7 4 5
8 2 2
Compute column norms
Means=6.9 3.5 5.1
Sigmas=1.523884 1.581139 2.806738
Normalize observation data creating Z[10,3]
0.065622   0.316228  -0.748200
-1.903032  -1.581139   1.033228
-0.590596  -0.316228  -0.035629
0.721840   1.581139  -1.460771
0.721840   0.948683   0.676942
0.065622  -0.948683   1.389513
-1.246814  -0.316228  -0.748200
1.378058   0.948683   1.033228
0.065622   0.316228  -0.035629
0.721840  -0.948683  -1.104485

Naive computation method that is sensitive to round-off error
ZZ is (Ztranspose)(Z)
ZZ=
9.000000   6.017916  -0.911824
6.017916   9.000000  -2.591349
-0.911824  -2.591349   9.000000
the correlation matrix, R = (Zt)(Z)/(n-1)
R=
1.0000   0.6687  -0.1013
0.6687   1.0000  -0.2879
-0.1013  -0.2879   1.0000
Continuing the naive example use R for the eigensolution
eigenvalues=  1.7688   0.9271   0.3041
eigenvectors:
0.6420   0.3847  -0.6632
0.6864   0.0971   0.7207
-0.3417   0.9179   0.2017

Good computation method- perform SVD of Z
Singular values, S=3.98985804571 2.88854344819 1.65449373616
square S and divide by (number of observations)-1 to get the eigenvalues
eigenvalues=  1.7688   0.9271   0.3041
eigenvectors:
0.6420   0.3847   0.6632
0.6864   0.0971  -0.7207
-0.3417   0.9179  -0.2017
(the 3rd vector/column has inverted signs, thats ok the math still works)

verify V x transpose = I
As computed
1        -5.55112e-017         1.38778e-016
-5.55112e-017                    1         5.55112e-017
1.38778e-016         5.55112e-017                    1

Rounded off
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0

square root of eigenvalues as a diagonal matrix
1.3300   0.0000   0.0000
0.0000   0.9628   0.0000
0.0000   0.0000   0.5515
Factor Structure Matrix S
0.8538   0.3704   0.3658
0.9128   0.0935  -0.3975
-0.4544   0.8838  -0.1112

Use first two eigenvalues
0.8538   0.3704
0.9128   0.0935
-0.4544   0.8838
SSt for first two components, communality is the diagonal
0.8662   0.8140  -0.0606
0.8140   0.8420  -0.3321
-0.0606  -0.3321   0.9876

Factor score coefficients B Matrix:
0.4827       0.3995       1.2026
0.5161       0.1009      -1.3069
-0.2569       0.9533      -0.3657
NIST shows B with a mistake, -.18 for -1.2 (-1.18 with a typo)

Work with first normalized observation row
0.0656 0.3162 -0.7482
compute the "factor scores" from multiplying by B
0.3871 -0.6552 -0.0608
NIST implies you might chart these
Compute the scores using V, these are more familiar
0.5148 -0.6308 -0.0335
Calculate T2 from the scores, sum ti*ti/evi
Hotelling T2=0.5828
Fit z1 using first two principal components
p=
0.6420     0.3847
0.6864     0.0971
-0.3417     0.9179
z1=  0.065622   0.316228  -0.748200
zhat=  0.087847   0.292075  -0.754958
diff= -0.022225   0.024153   0.006758
Q statistic - distance from the model measurement for the first observation
Q=0.0011
Some experts would compute a corrected Q that is larger since the
observation was used in building the model.  The Q calculation just used
properly applies to the analysis of new observations.
Corrected Q=0.0016 (factor=1.4286)