Theoretical Explanation of Deep Neural Network (DNN) and Analysis of its Nonlinear Dynamics
In this paper, Deep Neural Network (DNN) is considered as a discrete-time dynamical system due to its layered structure. The complexity provided by the nonlinearity in the DNN dynamics is analyzed in terms of topological entropy and chaos characterized by Lyapunov exponents.
by Husheng Li
Published onAug 31, 2024
Theoretical Explanation of Deep Neural Network (DNN) and Analysis of its Nonlinear Dynamics
·
Abstract
The theoretical explanation for the great success of deep neural network (DNN) is still an open problem. In this paper, DNN is considered as a discrete-time dynamical system due to its layered structure. The complexity provided by the nonlinearity in the DNN dynamics is analyzed in terms of topological entropy and chaos characterized by Lyapunov exponents. The properties revealed for the dynamics of DNN are applied to analyze the corresponding capabilities of classification and generalization. In particular, for both the hyperbolic tangent function and the rectified linear units (ReLU), the Lyapunov exponents are both positive given proper DNN parameters, which implies the chaotic behavior of the dynamics. Moreover, the Vapnik-Chervonenkis (VC) dimension of DNN is also analyzed, based on the layered and recursive structure. The conclusions from the viewpoint of dynamical systems are expected to open a new dimension for the understanding of DNN.
Introduction
Deep neural network (DNN, a.k.a. deep learning) is the most shining star in the community of information science in the last decade [1]. It has brought a paradigm shift to artificial intelligence and many cross-disciplinary areas. Despite its great success in applications, the theoretical explanation and mathematical framework are still open problems for DNN. There have been substantial efforts on the mathematical analysis for DNN, such as using the wavelet framework to understand DNN [2], applying the framework of function approximation [3], enumerating the linear regions after the nonlinear mapping of DNN [4], analyzing the shape deformation in the transient chaos of DNN [5], and information theoretic analysis [6], to name a few. Although these excellent studies have made significant progress on a deeper understanding of DNN, they are still insufficient to fully describe the behavior, quantitatively analyze the performance and thus systematically design DNNs.
In this paper, we propose a systematic framework to analyze DNN by considering it as a dynamical system. A prominent feature of DNN is the layered structure, usually having many hidden layers (e.g., hundreds of layers). Each layer is a mapping having similar structures, namely a linear mapping followed by a nonlinear operation. One can consider each layer as a nonlinear transformation for the input vector. Therefore, the DNN with the multi-layer structure can be considered as a discrete-time nonlinear dynamical system, where the initial value is the input vector to the DNN. The transformation is specified by the DNN parameters, and the final value of the transformations can be easily handled using simple classifiers such as linear ones.
The view of DNN as a dynamical system brings a novel approach for the theoretical study, due to the powerful framework of dynamical systems developed in the past 100 years. On one hand, a DNN should be sufficient complex to classify different objects (e.g., images). The topological entropy, developed in the study of dynamical systems in 1960s [7], can be used to characterize the complexity of dynamics. On the other hand, for complex objects, a DNN may need to distinguish two feature vectors that are very close to each other, pulling them apart for the final linear classification. This implies that the behavior of DNN near these points needs to be chaotic; i.e., the distance of the two points will become significant after the nonlinear mappings even though they are initially very close to each other. This can also be understood from the viewpoint of classification surface. For the original inputs of different classes, the classification surface could be very complicated; however, the DNN can deform the complicated surface to a simple hyperplane. Therefore, the inverse transformation of DNN (e.g., linear transformation plus tanh−1) should be able to warp a hyperplane to a very complex surface, which is illustrated in Fig. 1. The nonlinear warping requires chaotic behaviors characterized by Lyapunov exponents.
The analysis of dynamical behaviors in DNN can be applied to analyze various properties of DNN. On one hand, the complexity incurred by the nonlinearity in the dynamics of DNN can be used to quantify the classification capability of DNN, and predict the required number of layers. On the other hand, if the nonlinear dynamics provide too much complexity, the Vapnik-Chervonenkis (VC) dimension could be high and thus affect the generalization capability. Meanwhile, the layered structure of DNN also provides a layer-by-layer analysis for the VC-dimension. These applications will be discussed in this paper.
The remainder of this paper is organized as follows. The system model of DNN is introduced in Section 2. The topological entropy and the chaotic behavior, characterized by the Lyapunov exponent, of DNN are analyzed in Section 3. The applications of the nonlinear dynamics behavior in the capabilities of classification and generalization of DNN are discussed in Section 4. The conclusions are provided in Section 6.
System Model
In this section, we introduce the system model of DNN. We first explain the structure of DNN. Then, we model a DNN as a random dynamical system for facilitating the analysis.
Structure of DNN
For simplicity of analysis, we consider the problem of binary classification, which outputs a binary label in {−1,1} given a d-dimensional input feature vector x∈Rd. We assume that the feature selection algorithm has been fixed and is beyond the scope of this paper. For simplicity of analysis, we assume that all the features are within the compact set [−1,1]d.
We assume that there are D hidden layers in the DNN. The weights of layer i are denoted by Wi, which is a d×d matrix. Note that we assume that each Wi is a square matrix and all the vectors {xi}i=1,...,D have the same dimension. This is to simplify the analysis and can be extended to more generic cases. The input to layer i is denoted by xi−1, i=1,...,D. At layer i, the output xi (also the input of layer i+1) is given by
{hi=Wixi−1+bixi=ϕ(hi),(1)
where ϕ is the elementwise nonlinear mapping from the intermediate result hi to the input xi for the next layer. In this paper, we consider the following two cases:
Hyperbolic tangent tanh: This operation is used in traditional neural networks. It confines the intermediate vectors xi within [−1,1]d. The advantage of tanh is the smoothness, which facilitates the analysis, and the compactness of the state space.
Rectified linear units (ReLU) x=max{h,0}: ReLU is demonstrated to outperform the hyperbolic tangent tanh in recent progress of DNNs. The challenge of ReLU is the loss of differentiability at point x=0. We can assume that the derivative is a piecewise continuous function:
dhdϕ(h)={0,h<01,h>0,(2)
where the derivative is not defined at x=0.
The d-dimensional output of the last hidden layer is fed into a linear classifier with coefficients w and offset b. The final decision is given by
decision=sign(wTxD+b).(3)
The overall structure of DNN studied in this paper is summarized in Fig. 2. Note that we do not consider more specific structures for the DNN in this paper. For example, more structures are embedded into the convolutional neural networks (CNNs), recurrent neural networks (RNNs) and so on, which improve the performance over the generic structure of DNNs. It will be our future work to incorporate these special structures into the dynamical systems based study.
Random Dynamical System Modeling
As justified in the introduction, we can consider a DNN as a dynamical system. However, in traditional dynamical systems, the mapping of the system state, denoted by T, is identical throughout the dynamics, while the mappings in different layers of DNN are typically different. Therefore, we use the random dynamical system [1] to model the mappings of different layers in DNN. A random dynamical system consists of a series of possibly distinct transformations T1,...,Tn,..., namely
xi=Ti(xi−1).(4)
In the context of DNN, Ti is determined by (Wi,bi,ϕ). A rigorous definition of random dynamical system is given in Appendix 7.
Topological Entropy and Chaos
In this section, we discuss the topological entropy and chaos, characterized by the Lyapunov exponent, of DNN. We first define these quantities in dynamical systems. Then, we analyze the chaotic behaviors in DNN.
Topological Entropy and Lyapunov Exponent
Topological Entropy
We first define the topological entropy in a space X equipped with metric m(⋅,⋅). Consider a mapping T:X→X. For time n∈N, a metric mn between two points in X with respect to T is defined as
We define the n-th order ball of point x, denoted as Bn(x,ϵ), as the ball around x with radius ϵ under the definition of metric in (5). A set of points in X, denoted by S, are said to be an (n,ϵ)-spanning orbit set, if for any x∈X, we can always find a point y∈S such that mn(x,y)≤ϵ. Then we define
r(T,n,ϵ)=S∈(n,ϵ)−spanningsetmin∣S∣.(6)
Then we define the spanning orbits based topological entropy as follows, where the subscript s means ‘spanning’.
Definition 1 (). The spanning orbits based topological entropy is defined as hs(T)=limϵ→0hs(T,ϵ), where hs(T,ϵ)=limn→∞n1Hs(T,n,ϵ), and Hs(T,n,ϵ)=log2r(T,n,ϵ).
A detailed introduction to alternative definitions of topological entropy can be found in Appendix 8.
Note that the above definition is for traditional dynamical systems. In random dynamical systems, the topological entropy can be defined in the same manner as that in Definitions 1, simply by replacing Ti with Ti∘Ti−1∘...∘T1.
Ensemble Topological Entropy
The conventional topological entropy measures the number of distinct paths (up to diminishing errors) in the dynamical system. In this paper we extend it to the topological entropy of an ensemble of dynamics. For an integer L>0, we define a dL-dimensional hyper-vector z=(x1,...,xL), where each xi is a d-vector. Then, for the sequence of transformations {Tt}t=1,2,... that map from Rd to Rd, we define the transformation T~tL:RdL→RdL as
T~tL(z)=(Tt(x1),...,Tt(xL)).(7)
Consider M sets of transformations {Tmt}m=1,...,M,t=1,...,n, we define re({Tmt}m,t,n,ϵ,z) as the number of distinct paths (within distance ϵ in (5)) beginning from z under the M composite transformations {T~mtL}m=1,...,M,t=1,...,n. We define He({Tmt}m,t,n,ϵ,z)=n1log2re({Tmt}m,t,n,ϵ,z). We can further define he({Tmt}m,t,z) by letting n→∞ and ϵ→0, similarly to the topological entropy hs. As will be seen, the concept of ensemble topological entropy will be used in the calculation of VC-dimension of DNNs. Unfortunately we are still unable to evaluate the ensemble topological entropy analytically, which will be our future study.
Lyapunov Exponent
As metrics characterizing the chaotic behavior of dynamical systems, the Lyapunov spectrum [8] quantifies how fast two nearby points depart from each other as the system state evolves with time, as illustrated in Fig. 3. Considering an initial point x0, we define
Jt(x0)=Jx(Tt∘Tt−1∘...∘T1(x))∣x=x0,(8)
where J is the Jacobian matrix defined as Jx(y)=(∂xj∂yi)ij. Obviously Jt(x0) measures the sensitivity to the perturbation on the initial value x0. Then, we define
L(x0)=t→∞lim(Jt(x0)JtT(x0))t1,(9)
whose convergence is guaranteed by the Multiplicative Ergodic Theorem proved by V. I. Oseledec [9]. Then, we define the i-th largest Lyapunov exponent λi as
λi(x0)=log2Λi(x0),(10)
where Λi(x0) is the i-th largest eigenvalue (obviously nonnegative) of L(x0). The set of eigenvalues {λ1,...,λn}, sorted in a descending order, is called the Lyapunov spectrum of the random dynamical system {T1,T2,...,Ti,...}[10]. Note that, for the generic case, the Lyapunov spectrum is dependent on the initial value x0. For ergodic dynamics, the Laypunov spectrum is identical for all initial values, since each point in the state space will be visited sooner or later.
For the random dynamics in DNN, we have the following lemma characterizing the Lyapunov spectrum, which can be easily shown by calculus given the structure of DNN:
Lemma 1. For the random dynamics in DNN, we have
Jt(x0)=i=1∏t(Si(xi)Wi)(11)
where when ϕ is the hyperbolic tangent tanh we have
Si(xi)=diag(1−tanh2(xij))j=1,...,n,(12)
and when ϕ is the ReLU function we have
Si(xi)=diag(sign(xij))j=1,...,n,(13)
where sign(x) equals 1 if x>0 and 0 if x<0.
Relationship
The topological entropy can be derived from the Laypunov spectrum through the following theorem (as a corollary of Theorem A in [11]):
Theorem 1 ([11]). Suppose that the random dynamics with transformations T1, T2, ..., are ergodic. Then, we have
hs(T1,T2,...)=i=1∑nmax{λi,0},(14)
where λi is identical for all initial points due to the ergodicity.
Remark 1. For non-ergodic dynamics, we also have similar conclusions [11]. We can calculate the topological entropy by evaluating the Lyapunov spectrum. In the following discussion, we will study the Lyapunov spectrum at different points in the state space, and thus obtain the conclusions on the topological entropy of DNNs.
Local Chaos with Deterministic Mapping
The Lyapunov spectrum could be dependent on the initial value if the dynamics are non-ergodic. Hence, the system could be chaotic at certain points, while being non-chaotic in the remainder of the state space. The following theorems show that the maximum Lyapunov exponent of the DNN dynamics could be arbitrarily large. This implies that a DNN, as a dynamical system, could be highly chaotic at certain regions, thus being able to form highly complicated classification surfaces.
Hyperbolic Tangent Function
The following theorem shows the existence of chaotic behavior when the DNN uses the hyperbolic tangent function. The proof is given in Appendix 9.
Theorem 2. Consider the hyperbolic tangent function ϕ=tanh. There exist parameters of DNN and inputs (namely the initial vector of the dynamical system) such that the maximum Lyapunov exponent λ1 of the corresponding random dynamics is arbitrarily large.
ReLU Function
For the ReLU function case, the challenge is the unboundedness of the state space (although many practical algorithms add constraints on the parameters such as normalization). It has been shown that an n-dimensional linear dynamics x(t+1)=Ax(t) has a topological entropy of ∑i=1nlog2max(∣λi(A)∣,1), where λi(A) is the i-th eigenvalue of the matrix A[12]. Therefore, when any eigenvalue of A has a norm greater than 1, the topological entropy is positive. However, this does not imply the complexity of the dynamics, since it is still linear. The positive topological entropy simply stems from the enlarging norm of system state in the unbounded state space. Therefore, to characterize the essential complexity and chaotic behavior of the DNN dynamics subject to ReLU function, we need to consider a compact space, instead of the original unbounded state space.
To this end, we consider the angle of the system states in the state space, and omit the norm of the system states. This is reasonable, since the linear separability of the system state samples at the DNN output is determined by the angles, instead of the norms. The following theorem shows the existence of chaotic behavior in the angles of the system states for the 2-dimensional case. The proof is given in Appendix 10. Note that the 2-dimensional case can be easily extended to more generic cases.
Theorem 3. Consider a 2-dimensional DNN with the ReLU function, where the system state is x(t)=(x1(t),x2(t)). The angle of the system state is denoted by θ(t)=arctan(x1(t)x2(t)). Then, for any number C>0, one can always find a set of parameters of the DNN and a time T such that
∥∥∇x(0)ρ(x(T))∥∥>C.(15)
Random Mapping
In the above analysis, the linear mappings are designed specifically to result in chaos. It demonstrates only that the nonlinearity brought by the operation tanh may result in chaos. However, the dynamics in real DNNs may not be the same as that in Theorem 2. Therefore, we study the ensemble of elementwisely Gaussian distributed matrices in this subsection. Note that it is reasonable to assume that the elements in the matrices {Wi}i=1,...,D are normally distributed [13][14]. Therefore, the Gaussian ensemble of linear mappings is a good approximation to real DNNs. For simplicity, we analyze the hyperbolic tangent function tanh. The analysis on the ReLU function is more difficult due to the challenge of analyzing the angle dynamics that are highly nonlinear. It will be our future research to extend the conclusions on tanh to ReLU functions.
Due to the difficulty of mathematical analysis, we first carry out numerical calculations for the maximum Lyapunov exponent λ1 in the random mapping. In the numerical calculation, we assume ϕ=tanh and that the elements in Wt are i.i.d. Gaussian random variables with zero expectation and variance σ2. We set b=0. The corresponding Lyapunov exponents, as a function of the variance σ2 and dimension d, are shown in Fig. 20. The contour graph is given in Fig. 5. We observe that, fixing the variance σ2, the higher the dimension d is, the more possible that the Lyapunov exponent becomes positive. Fixing the dimension d, a too small σ2 or a too large σ2 may make the Lyapunov exponent negative. In particular, when d=1, namely the dynamics is scalar, the maximum Lyapunov exponent is always negative, which implies that there be no chaotic behavior in the scalar dynamics. Since usually the input dimension d is very high in real DNN applications, the results imply that chaotic behaviors do exist in practical DNNs.
The analysis on the Lyapunov spectrum of the random dynamics incurred by the Gaussian ensemble of mappings is difficult. The challenge is that there are two series of dynamics in the DNN, namely that of the tangent vector ut:
ut=St(xt)Wtut−1,(16)
and that of the system state xt:
xt=ϕ(Wtxt−1+bt).(17)
In Theorem 2, we design the mappings such that the dynamics of ut and xt are decoupled. However, in the case of Gaussian random mappings, the two dynamics are coupled, since the matrix St is determined by xt. In order to obtain an analytic result, we apply the approach of mean field analysis in statistical mechanics by fixing the impact of xt on ut; i.e., we consider the average behaviors of the random variables. In more details, we approximate the dynamics of the tangent vector ut in ut=St(xt)Wtut−1 using
βt=E[∥ut−1∥2]E[∥E[St(xt)]Wtut−1∥2],(18)
where the numerator is the norm of the tangent at time t subject to the expectation of St while the denominator is the norm of the system state at time t−1. The metric βt approximately characterizes the enlarging rate of the norm of ∥ut∥ in a deterministic manner. In particular, taking expectation on St is to simplify the analysis. Then, we have the following assumption of the mean field dynamics:
Assumption 1. We have the following assumptions for the dynamics of DNN with random linear mappings, whose justification for the assumption will be given in Appendix 11.
When βt>1 for all sufficiently large t, the norm of ut is expected to increase exponentially, thus resulting in chaotic behavior.
When t and n are sufficiently large, we have ∑i=1nxti2=C, where C is a time-invariant constant.
Given Assumption 1, we have the following theorem providing a condition for the chaotic behavior with random linear mappings. The proof of is given in Appendix 11.
Theorem 4. Given Assumption 1, we have λ1>1 for the DNN dynamics with Gaussian random linear mappings, if
(1−4tanh(σπ2h(σπ2)))dσ2>1,(19)
where h(x) is the nonzero and positive solution to the equation z=tanh(xz) (where z is the unknown while x is the parameter).
Remark 2. The above inequality is based on mean dynamics analysis and Assumption 1. However, it shows that a higher dimensional dynamics is more possible to be chaotic. Moreover, a too large or too small σ2 disables. Both coincide the observations in the numerical results. An explicit expression for the function h in terms of infinite series, based on the Lagrange Inverse Formula [15] can also be found in Appendix 11.
Capabilities of Classification and Generalization
In this section, we apply the metrics characterizing the dynamics of DNN to the analysis of classification and generalization capabilities.
Classification Capability
We first define the complexity of classification problem and then relate it to the topological entropy of DNN dynamics.
Complexity of Classification
We define the complexity of classification, which facilitates the subsequent analysis of DNN classification. The philosophy is to measure how difficult it is to separate the samples from different classes. There have been some studies on how to define the complexity of pattern recognition [16][17][18][19]. However, they are mainly focused on the nonlinearity of classifier and the overlap of features, which is significantly different from our proposed definition.
For defining the complexity of classification, we first need the following definition of set radius.
Definition 2. For a connected and compact set A in Rd, we define its radius as r(A)=supx,y{∥x−y∥∣x,y∈A}. Then, we define the radius of a partition P of A as r(P)=maxA{r(A)∣A∈P}.
Then, we define the classification complexity as follows:
Definition 3. Given the sample set X, consisting of X1 labeled by +1 and X2 labeled by -1, we say that a Borel set A is hybrid with respect to X if A∩Xj=ϕ, j=1,2. Suppose that X is contained in a compact set A (e.g., the convex hull of X). For a partition P of A, we define
N(P∣X)=∣{A∣A∈P,Aishybridw.r.t.X}∣.(20)
Then, we define the ϵ-complexity given the sample set X as
C(X,ϵ)=log2P:r(P)≤ϵinfN(P∣X).(21)
Remark 3. The complexity of classification in the above equation characterizes how the two classes of samples are mixed with each other. It also characterizes the size of the boundary surface between the two classes; the larger, the more complex. As will be seen, the definition of classification complexity will facilitate the analysis based on topological entropy. The links between the classification complexity and related metrics are discussed in Appendix 12.
Number of DNN Layers
Now, we apply the concept of topological entropy to the analysis of DNN, in terms of the capability of classification. We need the following definition:
Definition 4. Consider two sets of vectors Z1={z11,...,z1N1} and Z1={z11,...,z1N1}. We say that Z1 and Z2 are affine separable with an ϵ-margin, if there exists an affine classifier (w,b) (where ∥w∥=1) such that
{wTz1−b≥2ϵwTz2−b≤−2ϵ.(22)
Then, we have the following lower bound on the number of layers in DNN in the following theorem, whose proof is given in Appendix 13:
Theorem 5. Suppose that the output of DNN hidden layers is affine separable with ϵ-margin for the sample set X. Then, the number of layers D is lower bounded by
D≥Hs({Tn}n=1,..,D,D,ϵ)C(X,ϵ)+1.(23)
Remark 4. Recall that Hs({Tn}n=1,..,D,D,ϵ) is defined in Definition 1. Note that, when D is sufficiently large and ϵ is sufficiently small, Hs({Tn}n=1,..,D,D,ϵ) can be well approximated by hs({Tn}n=1,2,...), which becomes independent of D and ϵ.
Generalization Capability
A over-complex model for classification may fail in the generalization to samples beyond the training set. Hence, it is important to study the generalization capability of DNNs, characterized by the VC-dimension [20]. Note that the VC-dimension is usually studied by quantifying the separated components in the parameter space [20]. Meanwhile, the VC-dimension of DNN has been substantially explored in [21], which is slightly different from the setup in our study, since it addresses piecewisely linear DNNs. Our results may not be as precise as that in [21]; however, we leverage the layered structure of DNN and may provide a new direction for studying the VC-dimension of DNN. Note that the generalization capability of DNN has been shown to be better than predicted by VC-dimension argument in typical applications [22]. Many explanations have been given to the excellent generalization capability of DNN, e.g. [23]. However, our study shows a new recursive approach to explore the generalization of DNN.
VC-dimension
It is well known that the generalizability of a hypothesis set H is determined by its VC-dimension, which is defined as follows:
Definition 5. We say that a set of m(≥1) points is fully shattered by a binary hypothesis set H if the hypotheses in H can realize all the 2m possible labelings for S. The VC-dimension is defined as
dVC(H)=max{∣S∣:S is fully shattered by H}.(24)
The relationship between the VC-dimension and the generalization capability can be found in [24].
VC-dimension in DNN
In the subsequent analysis, we find both lower and upper bounds for the VC-dimension from the view point of dynamical systems. Based on the ensemble topological entropy, we have the following upper bound for the VC-dimension of DNN. The proof is given in Appendix 14.
Theorem 6. The VC-dimension of DNN is upper bounded by
when the DNN output is affine separable with ϵ margin.
Remark 5. Again the quantity He can be approximated by he when D is sufficiently large and ϵ is sufficiently small. The above upper bound for the VC-dimension of DNN is obtained from the complexity analysis. Unfortunately, we are still unable to numerically or analytically evaluate the ensemble topological entropies He or he.
We can also analyze the details of shattering the samples by leveraging the nonlinearity of tanh in a layer-by-layer manner, thus resulting in the following lower bound of VC-dimension. The proof is given in Appendix 15.
Theorem 7. The number of distinct samples that can be shattered by a DNN is at least D+3, where D is the number of layers. Therefore, the VC-dimension of DNN is lower bounded by dVC≥D+3.
Remark 6. The conclusion shows that the VC-dimension increases at least linearly with the number of layers. The layer-by-layer analysis may be extended to find finer bounds, since we only analyzed the 2-dimensional case in the proof.
Numerical Results
In this section, we use the ResNet-101 classifier, together with the image set called ImageNet, to evaluate the largest Lyapunov exponents for practical DNNs and real images. For notational simplicity, we call the largest Lyapunov exponent the Lyapunov exponent.
ResNet-101
We used the pretrained ResNet-101 [25], which is a convolutional deep neural network and is available in Matlab codes. The DNN of ResNet-101 has 101 layers. The input and output of ResNet-101 are images and the corresponding image categories. The coefficients of the DNN are trained using the celebrated ImageNet database [26]. The evaluation of the Lyapunov exponents is based on the image set ‘Caltech 101’, which contains 101 classes of images. Each image is normalized to the size of 224×224. In total 8733 images are used in our numerical computations.
Estimation of Lyapunov Exponents
For traditional dynamical systems, the Lyapunov exponents can be estimated using the approaches in [27] (pp.71–73). However, these algorithms are designed for dynamics having identical mappings in each round, while the DNN dynamics are random, and have different nonlinear mappings in each layer due to different coefficients. Moreover, the number of layers of ReSNet-101 (i.e., 101 layers) may not be sufficiently large such that the maximum increasing direction emerges spontaneously from any initial direction. Therefore, we propose the algorithm in Procedure 1 to approximately estimate the Lyapunov exponents.
Algorithm 1 (H).
Numerical Results
Consider each image as a point in the ‘image space’, whose dimension is 50176. Therefore, the Lyapunov exponent, estimated using Procedure 1, can be considered as the corresponding Lyapunov exponent at the corresponding point.
Trajectory
In Fig. 6, the trajectories beginning from an image point and a perturbed point are plotted. Since the trajectories are in a very high dimensional space, we plot only the trajectories on two dimensions. We observe that, although the perturbation is very small, the two trajectories are significantly different. Note that the trajectories return to the origin frequently, which is due to the function of ReLU.
Histogram of Lyapunov Exponents
The histogram of Lyapunov exponents obtained from the 8733 images and the algorithm in Procedure 1 is given in Fig. 7. We observe that the distribution has a bell shape and is concentrated around the value of 0.005. Some are negative, which means that the perturbation diminishes. We notice that the Lyapunov exponents at very few points are close to 0.2. The images of the highest 4 Lyapunov exponents are given in Fig. 8. It is not clear why these 4 images have large Lyapunov exponents. However, the large Lyapunov exponents imply that the DNN classifier is substantially more sensitive at these images, which may imply that they are closer to other classes of images and are thus more difficult to classify. This possible implication will be tested in our future research.
Comparison of Different Classes
We also plot the expectation and variance of the Lyapunov exponents for the 101 classes in the plane, as shown in Fig. 9. The four classes of images having the greatest variance of Lyapunov exponents are shown in Fig. 10. We notice that the variances of different image classes are around 6e-6, while the overall variance is 7e-6. Therefore, Lyapunov exponents are chaotically distributed among different classes of images.
Conclusions
In this paper, we have proposed to consider DNN as a random dynamical system, and thus study the properties of DNNs from the viewpoint of dynamical systems. To characterize the complexity of the DNN dynamical system, we have analyzed the corresponding topological entropy and Lyapunov exponents. In particular, we have demonstrated that there may exist chaotic behavior due to the nonlinearity of DNN, which explains why DNN can form complicated classification surfaces. In our future study, we will calculate the topological entropy either analytically (e.g., in the high dimensional regime) or numerically (e.g., using the generating function of symbolic dynamics).
Random Dynamical Systems
Here we provide a rigorous definition for random dynamical systems [28]. A measurable random dynamical system is a 4-tuple (Ω,F,P,(θ(t))t∈T) over a measurable space (X,B), where X is the system state set and B is the σ-algebra:
Ω: the sample space representing the random events in the random dynamical system;
F: the σ-algebra of Ω;
P: the distribution over Ω;
T: the set of time index;
θ: the time shift acting on the random event w.
The random dynamical system provides a mapping ψ:
ψ:T×Ω×X→X,(26)
namely mapping the system state x∈X at time t to ψ(t,w,x), given the realization w∈Ω.
The dynamical system (Ω,F,P,(θ(t))t∈T)) is called ergodic, if all sets in I have probabilities 0 or 1, where I⊂F is the sub-σ-algebra formed by all invariant sets. Note that a set A is called invariant with respect to θ if θ−1A=A.
Definitions of Topological Entropy
There are three equivalent definitions for the topological entropy [29]. We will introduce them separately in the subsequent discussion. For the three definitions, we consider a topological dynamical system represented by a triple (X,T,S). Here X is a metric space, in which the metric is denoted by d. As will be seen, the metric is not necessary for one of the definitions. However, we still assume the structure of metric space, since it will be used in the first two definitions. We assume that X is compact. When the space X is not compact, an unbounded dynamical system such as x(t)=2x(t−1) may have positive topological entropy but is still not complex. We define the transformation T mapping from X to X. Moreover, we assume that T is continuous; i.e., T−1 maps from open sets to open sets in the topological space X.
Spanning Orbit Based Definition
We first define the topological entropy based on spanning orbits, namely a set of typical orbits of the dynamics, such that for any orbit one can always find a sufficiently close one in this set. To this end, we need to define the distance between two orbits, namely the metric dn defined as
dn(x,y)=max{d(Tix,Tiy),i=0,...,n−1},(27)
for orbits with initial points x and y, repectively. Intuitively speaking, the distance between the two orbits is the maximum of the distances at all time slots. We define the n-th order ball of point x∈X as the ball around x with radius ϵ and denote it by Bn(x,ϵ). Fig. 11 shows that a track beginning from x2 is within Bn(x1,ϵ).
Now we define the ‘typical set’ of orbits (or equivalently the initial points). We say a set F of points in X is (n,ϵ)-spanning if it intersects every (n,ϵ)-ball (i.e., intersecting Bn(x,ϵ) for every x). Intuitively speaking, this means that, for any point x in X (or equivalently the orbit beginning from x), one can always find a sufficiently close orbit (up to the error ϵ), or the corresponding initial point, to approximate the orbit beginning from x.
Then, we define
r(n,ϵ)=min{#F:Fisan(n,ϵ)−spanning},(28)
namely the cardinality of the minimum (n,ϵ)-spanning. One can consider each point or the corresponding orbit in the set as a ‘codeword’, in the context of source coding in communications. Then, r(n,ϵ) is the minimum number of codewords to represent all possible orbits, with errors up to ϵ.
The spanning orbits based topological entropy is then defined as follows, where the subscript s means ‘spanning’.
Definition 6. The topological entropy of the dynamical system (X,T,S) is defined as
Here, Hs(T,n,ϵ) is the minimum number of bits for the source coding of the orbits with error up to ϵ. Finally the topological entropy hs(T) is the asymptotic coding rate when the time span tends to infinity while the tolerable error tends to 0.
Separated Orbits based Definition
The topological entropy defined from separated orbits is similar to that defined from the spanning orbits. In both cases, we find a set of ‘typical’ initial points or the corresponding orbits, in order to encode the possible orbits. In the spanning orbits case, we begin from a lot of typical orbits, which may be more than enough, and then find the minimum set of orbits. In a contrast, in the separated orbits based definition, we consider points with neighborhoods of given radius. Then, we squeeze as many such orbits (or initial points) as possible into the space. As will be seen later, both selections of typical orbits will yield the same topological entropy in the asymptotic sense.
To this end, we say that a subset F⊂X is (n,ϵ)-separated, if dn(xi,xj)>ϵ for all xi,xj∈F and xi=xj. Intuitively speaking, the elements in F are separated with distances of at least ϵ. Since X is assumed to be compact, all the possible (n,ϵ)-separated sets have finite elements, as illustrated in Fig. 12. We denote the maximal cardinality of all (n,ϵ)-separated sets by
s(n,ϵ)=max{∣F∣:Fis(n,ϵ)−separated},(30)
which means the maximum number of points that one can ‘squeeze’ into the space X, given the required lower bound of distance ϵ.
Then, we define the metric based topological entropy as
Definition 7. We define the topological entropy as
Here Hm(n,ϵ) is the number of bits needed to describe the maximal (n,ϵ)-separated set. Then, the topological entropy hm(T) is the coding rate (or equivalently, the exponent of increasing orbits) in the asymptotic case (the time span tends to infinity and the tolerable error tends to zero).
Cover Based Definition
As the third type of definition of topological entropy, the cover based definition of topological entropy does not need the definition of metric. Hence it is still valid in spaces without a metric structure. The requirement is the topological structure, namely the definition of open sets. A cover of X is defined as a family of open sets that cover X. Consider a cover U. Due to the assumption of the compactness of X, we can always find a finite subset of open sets in U to cover X. For a transformation T and any integer, T−n(U) is also a cover of X. A cover V is a subcover of cover U if V⊂U, namely the open sets in V are also in U while V also covers X. The minimum cardinality of a subcover of U is denoted by N(U).
We define the joint of two covers U and V as
U∨V={U∩V:U∈U,V∈V}.(32)
We further define
Un=∨n=0,...,n−1T−n(U),(33)
namely the joint set of the covers obtained from tracing back the open sets in U along the transformation T.
Based on the above definitions on cover sets, we can now define the topological entropy:
Definition 8. The topological entropy of the dynamical system (X,T,S) is defined as
Here Hc(U) is the number of bits needed to describe the open sets that cover X. hc(T,U) is the coding rate for describing the joint sets generated by tracing back the open sets in U.
Equivalence
The above three definitions of topological entropy look different, focusing on different aspects of the dynamical system. In particular, the cover based definition is significantly different from the first two. A surprising result is that they are actually equivalent in metric spaces:
Theorem 8 ([30]). In metric spaces, we have
hs(T)=hm(T)=hc(T).(35)
The proof is highly nontrivial. The details can be found in Chapter 6 of [31]. Due to their equivalence, we use only the notation hs(T) in the subsequent discussion.
Here we consider the case of hyperbolic tangent function for the DNN. Before we provide the rigorous proof, we provide the intuition for the possibility of the chaotic behavior in the DNN dynamics. The maximum Lyapunov exponent λ1 is determined by the expansion rate of the tangent vector ut=Jt(x0)u0. Only when ∥Jt(x0)u0∥ increases exponentially, is λ1 positive. We notice that the matrix Si(x) always shrinks ∥Jt(x0)u0∥, since all the diagonal elements are positive but less than 1. Hence, to make λ1 positive, the determinant of Wi should be larger than 1, namely enlarging the vector norm, in an average sense. However, Wi operates on both ut and xt. Hence, if the determinant of Wi is too large, it will make the elements in xt large, thus making the elements Si(x) small and shrinking ∥ut∥. Therefore, there is a tradeoff for the determinant of Wi, to balance the enlarging effect of Wi and the shrinking effect of Si.
For scalar dynamics, namely d=1, we carry out numerical calculations to evaluate the Lyapunov exponents. We first tested the scalar dynamical system xt+1=axt+b. In Fig. 13, we show the Lyapunov exponent with respect to various values of a and b, by fixing the initial value x0=0.2. We observe that the Lyapunov exponent λ1 is negative. The value of b has much less impact on the Lyapunov exponent. In Fig. 14, we show the Lyapunov exponent with respect to various values of x0 and b, by fixing a=3.12. We observe that, for all the nonzero values of x0∈[−1,1] and b, the Lyapunov exponents are negative. Note that the Lyapunov exponent is different at different initial values, which implies that the scalar dynamical system is not ergodic.
The reason for the lack of chaotic behavior in the scalar dynamics of DNN is because ut and xt are aligned in the same direction in R. Therefore, the enlarging effect of Wi will backfire at ut via St directly. However, in higher dimensional spaces, it is possible to tune the parameters Wi such that ∥xt∥ is prevented from being too large while ∥ut∥ will be enlarged exponentially, since xt and ut may not be aligned in the same subspace.
The following proof of Theorem 2 follows the above idea of decoupling the system state xt and tangent vector ut:
Proof. There are two approaches to generate chaos in the DNN dynamics by designing the DNN parameters carefully. We assume d=2, namely the dynamics of DNN is in R2. For higher dimensional dynamical systems, we can project the system state to 2-dimensional subspaces, thus not affecting the conclusion.
Approach 1: In this first approach, we force the system state xt to have a constant norm, as illustrated in Fig. 15. Denote by u0=0 the initial direction and by x0=(x01,x02) the initial point of the successive mapping. We assume r=∥x0∥<1. We further assume that u0 and x0 are linearly independent. We also fix a positive number A>1.
For the t-th layer, the output is xt=tanh(x~t), where x~t=Wtxt−1. The mapping maps the derivative ut−1 to ut in (16). We assume that ut−1 and xt−1 are linearly independent. We set the linear mapping Wt to be a matrix with real eigenvector ut−1 corresponding to eigenvalue A. We further assume that Wt satisfies
Wtxt−1=argtanh(xt),(36)
such that ∣xt∣=r<1. Due to the constraints, the matrix Wt can be obtained from the following equation:
{Wtut−1=Aut−1Wtxt−1=x~t,(37)
which can be rewritten as
WtBt=Ct,(38)
where Bt=(ut−1,xt−1) and Ct=(Aut−1,x~t). Due to the assumption that ut−1 and xt−1 are linearly independent, the matrix Bt is of full rank and is thus invertible. Therefore, we can obtain Wt by
Wt=CtBt−1.(39)
Then, the derivative after the t-th round of mapping is given by
Now, we need to guarantee that ut and xt are linearly independent for the purpose of induction. Notice that the only constraint on xt is ∥xt∥=r. Suppose that for any point xt on the circle ∥xt∥=r, we have
ut1xt2=ut2xt1,(41)
namely det(Bt)=0. Then, substituting (40) into (41), we have
ut−1,1xt2(1−xt12)=ut−1,2xt1(1−xt22).(42)
Unless both ut−1,1=ut−1,2=0, which is impossible due to the assumption u0=0 and the induction in (40), the equation in (42) defines a curve of degree 3. According to the Bezout theorem [32], there are at most six real intersection points for the curve in (42) and the circle ∥xt∥=r. Therefore, one can always find a point on ∥xt∥=r such that (42) does not hold, namely ut and xt are linearly independent.
Using induction on (40), we have
ut=Atl=1∏t(1−xl12001−xl22)u0.(43)
Hence, we have
∥u0∥∥ut∥≥(A(1−r2))t.(44)
When we choose a sufficiently large A such that the right hand side of (44) is greater than 1, ∣u0∣∣ut∣ increases exponentially, thus making the Laypunov exponent positive. Moreover, the Laypunov exponent can be made arbitrarily large due to the arbitrary selection of A.
Approach 2: In the second approach, the DNN dynamics force the system state to rational points in R2. To this end, we consider a compact set Ω∈[−1,1]2 with nonzero Lebeasgue measure. We assume
x∈Ωmax∥x∥1≤r<1,(45)
where the maximum is achieved due to the compactness of Ω.
We can enumerate all the rational points (i.e., the coordinates are rational numbers) in Ω, except for the points on the lines x1=x2 and x1=−x2, in the sequence X={xi}i=1,2,3,.... It is easy to verify that the set X is dense in Ω. We also choose the initial vector u0 such that
u02u01∈/Q,(46)
namely the ratio of the two coordinates in u0 is an irrational number.
Then, we design the linear mappings in the same way as in Approach 1, except that the design of Wt in (37) is changed to
{Wtut−1=Aut−1Wtxt−1=argtanh(xt),(47)
which is WtBt=Ct. Here the major goal is to assure the reversibility of the matrix Bt, namely ut and xt are linearly independent. To this end, we assume that ut and xt are linearly dependent, which implies
ut2ut1=xt2xt1.(48)
According to (43), we have
ut2ut1=u02u01l=1∏t(1−xl221−xl12),(49)
which is irrational since u02u01 is irrational and ∏l=1t(1−xl221−xl12) is rational (recall that both xl1 and xl2 are rational). However, xt2xt1 is rational. Hence, (48) does not hold, since the left hand side is irrational while the right hand side is rational. Therefore, ut and xt are linearly independent, which assures the existence of solution in (47). This concludes the proof. ◻
Proof. Similarly to the hyperbolic tangent case, we consider only the 2-dimensional case. Since the angle is given by θ(t)=arctan(x1(t)x2(t)), we have
where according to the chain’s rule the partial derivative is the sum of the chains of partial derivatives along all possible paths from x1(0) to x1(T) or x2(T), namely
We first provide numerical results to justify the assumption in Theorem 4, namely ∑i=1dxti2 is a constant C when both d and t are sufficiently large. Figures 16 and 17 show the time evolution and variance of the norm normalized by d, respectively. We observe that the normalized norm becomes more concentrated as the dimension becomes larger. Note that a sufficiently large t means that the distribution of xt becomes stationary, while a large d implies the concentration of measure for xt, thus resulting in the constant norm of xt.
To prove Theorem 4, we need the following simple lemma:
Lemma 2. There exist a positive solution to the following equation with the unknown R:
R=E[tanh2(z)],(57)
where z is a Gaussian random variable with zero expectation and variance σ2R, when σ2>1.
Proof. It is easy to verify that the righthand side of (57) is a continuous function of R. Then, we let R be sufficiently small, such that tanh(z) is very close to a linear function (since tanh is close to a linear function when the argument is small), namely
E[tanh2(z)]=σ2R+o(R),(58)
which is larger than the left hand side of (57).
We let R tend to infinity. The right hand side of (57) is bounded by 1 while the left hand side is larger than the right band side. Due to the continuity of both sides of (57), there exists at least one positive solution. ◻
We also need the following lemma for the proof of Theorem 4.
Lemma 3. We consider the following scalar dynamics
x(t+1)=tanh(αx(t)),(59)
where α is a Gaussian random variable with zero expectation and variance σ2. For a fixed α>0, we consider the equation
x=tanh(αx),(60)
whose nonzero solution is denoted by x=h(α). Then, we have
Var[x]≤4tanh(σπ2h(σπ2)).(61)
Proof. The variance of x in the stationary stage is given by