Recent Results
Japanese Version
Return to Institute of Statistical Mathematics
Return to English page
Return to Japanese page
Algebraic structure of non-identifiable learning machines :
Questions and Comments are welcome.
E-mail: swatanab@pi.titech.ac.jp
(1) A Problem:
Let us consider a learning machine represented by a parametric probability
distribution p(x,w), where x is an input and w is a parameter.
Here p(x,w) estimates the true distribution q(x),
using a set of empirical samples {xi;i=1,2,3,...,n }
independetly taken from q(x).
Let p( w | {xi} ) be the Bayesian a posteriori distribution of
the parameter w under the condititon that samples {xi} are given.
The estimated probability density p( x | {xi}) is defined by
p( x | {xi} ) = \int p( x | w) p ( w | {xi} ) dw.
Our main purpose is to clarify
how fast the estimated learner p( x | {xi} ) attains the true distribution q(x).
We define the average Kullback distance or the average relative entropy,
K(n) = E { \int q(x) \log [ q(x) / p(x | {xi} ) ] dx },
where E represents the expectation value over all sets of empirical samples.
We want to know the asymptotic expansion of $K(n)$ for the case n tends to infinity.
Remarks:
(1) Note that K(n)=0 if and only if p(x | {xi} ) = q(x).
(2) In statistical learning theory, K(n) is often called `Learning Curve'
or `Generalization Error'.
(3) I heard that physists often omit E by chanting a spell "Self-Averaging !".
(2) Stochastic Complexity and Kullback Distance :
The stochastic complexity or the Free energy of a learning machine is
defined by
F(n) = - E { log \int exp( - H_{n}(w) ) r(w)dw },
where H_{n}(w) is the empirical Kullback information or the
random Hamiltonian,
H_{n}(w)=\sum_{i=1}^{n} \log [ q(xi) / p(xi,w) ].
The stochastic complexity F(n) plays a central
role in information theory, Bayesian statistics, statistical physics,
and statistical learning theory. For example,
Amari and Murata [1] showed that the average Kullback distance
is equal to the increase of the stochastic complexity,
K(n)=F(n+1)-F(n),
and proved that, if the learning model p(x,w) satisfies
the regularity condition for the asymptotic normality of
the maximum likelihood estimator, then
Regular Models: K(n)=d/2n + o(1/n),
where d is the number of parameters and n is the number of training samples.
However, it has been left unclear how fast K(n) goes to zero for the
case when p(x,w) is not a regular model. This problem was left unknown
even in mathematical statistics and mathematical physics.
Dr. Hagiwara and Dr. Fukumizu proved that
a lot of complex learning machines are not regular models, in general.
For example, multi-layer perceptrons,
radial basis functions, and mixtuies of normal distributions
do not satisfy the regularity condition for the asymptotic normality of the
maximum likelihood estimator. The sets of true paramters
in such models are analytic sets or anlgebraic varieties with singularities.
It should be emphasized that H_{n}(w) can not be approximated by any quadratic form.
(3) Our recent results:
Assume that p(x,w) is analytic for w.
We obtained the complete asymptotic form of F(n)
F(n)= A log n - (B-1) log log n + const.,
where A is a positive rational number and B is a natural number.
(A is not larger than d/2, and B is not larger than d,
where d is the number of parameters.)
Our result claims that, if K(n) can be asymptotically expanded, then
K(n)=A/n - (B-1)/ ( n log n ) .
Let H(w) be the Kullback information for a parameter w,
H(w)= \int q(x) log [ q(x)/ p(x,w)] dx.
In order to prove the asymptotic expansion of F(n), we need
the Sato-Bernstein's b-function [M2].
The b-function is the polynomial b(z) that satisfies the relation
P(z,w,dw)H(w)^{z+1}=b(z)H(w)^{z},
where P(z,w,dw) is a finite order differential operator for w with
coefficients of holomorphic functions and a polynoimial for z.
The existence of b-function was proved by Sato, Bernstein, Bjork, and
Kashiwara. An algorithm to calculate b-function was developed by Oaku [M3].
The constants -A and B are important in real world computing.
They are characterized as the maximum pole and
its multiplicity of the complex function
J(z) = \int H(w)^{z} r(w)dw.
The algorithm to calculte A and B is also established
by employing resolution of singularities in algebraic geometry,
the famous theorem by Hironaka [M1]. In fact, based on the resolution theorem,
we can find a d-dimensional real manifold U and
a resolution map g: U -> W, such that locally near each point in U,
H(g(u))=a(u)u_{1}^{k1}u_{2}^{k2} *** u_{d}^{kd},
where a(u) is an invertible function, and k1,k2,...,kd are non-negative
even integers. The resolution of singularites enables us to
calculate A and B. The resolution map g(u) can be found by using the
finite times blowing ups of the non-singular manifolds in
singularities. I heard that Hironaka's theorem not only
proves the existence of the desingularization map but also
gives an algorithm to compute it constructively.
Also I heard that we can find the Grobner basis firstly in his proof,
which is referred to as Hironaka's standard basis.
Here, we consider a learning machine with high complexity,
and its learning can be clarified based on some results
in modern mathematics.
We expect that this result is the mathematical foundation
in the research of computational intelligence.
(4) Problem for the future:
Our result shows that the Bayesian estimation case is
clarified essentially. However,
the maximum likelihood case is not yet clarified.
Dr. Hagiwara in Mie University proved that the average expectation value
is larger than (log n)/n, if the input distrubution is given as the
finite sum of delta functions [3]. Dr. Fukumizu in Institute of Mathematical
Statistics proved that the average
expectation value of the reduced rank approximation
is characterized as the average of the eigen values of random matrices
subject to Wishart distribution [2]. The maximum likelihood case for
the general complex learning machines is the important problem for the future.
Related Papers:
[1] Amari, S. and Murata, N., (1993)
Statistical Theory of learning curves under
entropic loss criterion. Neural Computation, 5, 140-153.
[2] Fukumizu, K., (1999) Generalization error of linear neural networks
in unidentifiable cases. Lecture notes on computer sciences, 1720, 51-62.
[3] Hagiwara, K. Kuno, K, and Usui, S., (1998) Degenerate Fisher information
matrix and average expectation error of neural networks,
Symposium on statistical estimation and information theory, 95-102.
[4] Watanabe, S., (1998) On the generalization error of layered statistical
models with Bayesian estimation. IEICE Trans. J-81-A, 1442-1552. The English
version is to appear in Electronics and Communications in Japan, John Wiley
and Sons.
[5] Watanabe, S., (1999) Algebraic analysis for singular statistical
estimation. Lecture Notes in Computer Sciences, Vol.1720, pp.39-50.
[6] Watanabe,S., (2000) Algebraic analysis for non-regular learning machines.
Neural Information Processing Systems. Vol.12, pp.356-362, MIT Press.
Article, Postscript file, gzipped
Correction of mistake, Postscript file, gzipped
[7] Watanabe, S., (2000) Algebraic analysis for non-idetifiable
learning machines. to appear in Neural Computation.
Article, Postscript file, gzipped
[8] Watanabe, S.,(2001) Algebraic Information Geometry for
Learning Machines with Singularities. to appear in Advances
in Neural Information Processing Systems.
Article, Postscript file, gzipped
Related Pure Mathematics:
[M1] Hironaka, H., (1964) Resolution of singularities of an algebraic
variety over a field of characteristic zero. Ann. of Math. 79, 109-326.
[M2] Kashiwara, M., (1976) B-functions and holonomic systems. Inventions
Math., 38, 33-53.
[M3] Oaku, T., (1997) Algorithms for the b-function and D-modules
associated with a polynomial. J. Pure and Appl. Alg., 117, 495-518.