Skip to main content# Theoretical Explanation of Deep Neural Network (DNN) and Analysis of its Nonlinear Dynamics

# Abstract

# Introduction

# System Model

## Structure of DNN

## Random Dynamical System Modeling

# Topological Entropy and Chaos

## Topological Entropy and Lyapunov Exponent

### Topological Entropy

### Ensemble Topological Entropy

### Lyapunov Exponent

### Relationship

## Local Chaos with Deterministic Mapping

### Hyperbolic Tangent Function

### ReLU Function

## Random Mapping

# Capabilities of Classification and Generalization

## Classification Capability

### Complexity of Classification

### Number of DNN Layers

## Generalization Capability

### VC-dimension

### VC-dimension in DNN

# Numerical Results

## ResNet-101

## Estimation of Lyapunov Exponents

## Numerical Results

### Trajectory

### Histogram of Lyapunov Exponents

### Comparison of Different Classes

# Conclusions

# Random Dynamical Systems

# Definitions of Topological Entropy

## Spanning Orbit Based Definition

## Separated Orbits based Definition

### Cover Based Definition

## Equivalence

# Proof of Theorem 2 (for Hyperbolic Tangent Case)

# Proof of Theorem 3 (for the ReLU Case)

# Proof of Theorem 4 and Justification

In this paper, Deep Neural Network (DNN) is considered as a discrete-time dynamical system due to its layered structure. The complexity provided by the nonlinearity in the DNN dynamics is analyzed in terms of topological entropy and chaos characterized by Lyapunov exponents.

Published onAug 31, 2024

Theoretical Explanation of Deep Neural Network (DNN) and Analysis of its Nonlinear Dynamics

The theoretical explanation for the great success of deep neural network (DNN) is still an open problem. In this paper, DNN is considered as a discrete-time dynamical system due to its layered structure. The complexity provided by the nonlinearity in the DNN dynamics is analyzed in terms of topological entropy and chaos characterized by Lyapunov exponents. The properties revealed for the dynamics of DNN are applied to analyze the corresponding capabilities of classification and generalization. In particular, for both the hyperbolic tangent function and the rectified linear units (ReLU), the Lyapunov exponents are both positive given proper DNN parameters, which implies the chaotic behavior of the dynamics. Moreover, the Vapnik-Chervonenkis (VC) dimension of DNN is also analyzed, based on the layered and recursive structure. The conclusions from the viewpoint of dynamical systems are expected to open a new dimension for the understanding of DNN.

Deep neural network (DNN, a.k.a. deep learning) is the most shining star in the community of information science in the last decade [1]. It has brought a paradigm shift to artificial intelligence and many cross-disciplinary areas. Despite its great success in applications, the theoretical explanation and mathematical framework are still open problems for DNN. There have been substantial efforts on the mathematical analysis for DNN, such as using the wavelet framework to understand DNN [2], applying the framework of function approximation [3], enumerating the linear regions after the nonlinear mapping of DNN [4], analyzing the shape deformation in the transient chaos of DNN [5], and information theoretic analysis [6], to name a few. Although these excellent studies have made significant progress on a deeper understanding of DNN, they are still insufficient to fully describe the behavior, quantitatively analyze the performance and thus systematically design DNNs.

In this paper, we propose a systematic framework to analyze DNN by considering it as a dynamical system. A prominent feature of DNN is the layered structure, usually having many hidden layers (e.g., hundreds of layers). Each layer is a mapping having similar structures, namely a linear mapping followed by a nonlinear operation. One can consider each layer as a nonlinear transformation for the input vector. Therefore, the DNN with the multi-layer structure can be considered as a discrete-time nonlinear dynamical system, where the initial value is the input vector to the DNN. The transformation is specified by the DNN parameters, and the final value of the transformations can be easily handled using simple classifiers such as linear ones.

The view of DNN as a dynamical system brings a novel approach for the theoretical study, due to the powerful framework of dynamical systems developed in the past 100 years. On one hand, a DNN should be sufficient complex to classify different objects (e.g., images). The topological entropy, developed in the study of dynamical systems in 1960s [7], can be used to characterize the complexity of dynamics. On the other hand, for complex objects, a DNN may need to distinguish two feature vectors that are very close to each other, pulling them apart for the final linear classification. This implies that the behavior of DNN near these points needs to be chaotic; i.e., the distance of the two points will become significant after the nonlinear mappings even though they are initially very close to each other. This can also be understood from the viewpoint of classification surface. For the original inputs of different classes, the classification surface could be very complicated; however, the DNN can deform the complicated surface to a simple hyperplane. Therefore, the inverse transformation of DNN (e.g., linear transformation plus

The analysis of dynamical behaviors in DNN can be applied to analyze various properties of DNN. On one hand, the complexity incurred by the nonlinearity in the dynamics of DNN can be used to quantify the classification capability of DNN, and predict the required number of layers. On the other hand, if the nonlinear dynamics provide too much complexity, the Vapnik-Chervonenkis (VC) dimension could be high and thus affect the generalization capability. Meanwhile, the layered structure of DNN also provides a layer-by-layer analysis for the VC-dimension. These applications will be discussed in this paper.

The remainder of this paper is organized as follows. The system model of DNN is introduced in Section 2. The topological entropy and the chaotic behavior, characterized by the Lyapunov exponent, of DNN are analyzed in Section 3. The applications of the nonlinear dynamics behavior in the capabilities of classification and generalization of DNN are discussed in Section 4. The conclusions are provided in Section 6.

In this section, we introduce the system model of DNN. We first explain the structure of DNN. Then, we model a DNN as a random dynamical system for facilitating the analysis.

For simplicity of analysis, we consider the problem of binary classification, which outputs a binary label in

We assume that there are

$\begin{aligned}
\left\{
\begin{array}{ll}
&\mathbf{h}_i=\mathbf{W}_i\mathbf{x}_{i-1}+\mathbf{b}_i\\
&\mathbf{x}_{i}=\phi(\mathbf{h}_i)
\end{array}
\right.,
\end{aligned} \tag{1}$

where

Hyperbolic tangent

$\tanh$ : This operation is used in traditional neural networks. It confines the intermediate vectors$\mathbf{x}_{i}$ within$[-1,1]^d$ . The advantage of$\tanh$ is the smoothness, which facilitates the analysis, and the compactness of the state space.Rectified linear units (ReLU)

$x=\max\{h,0\}$ : ReLU is demonstrated to outperform the hyperbolic tangent$\tanh$ in recent progress of DNNs. The challenge of ReLU is the loss of differentiability at point$x=0$ . We can assume that the derivative is a piecewise continuous function:$\begin{aligned} \frac{d\phi(h)}{dh}=\left\{ \begin{array}{ll} 0,\qquad h<0\\ 1,\qquad h>0 \end{array} \right., \end{aligned} \tag{2}$where the derivative is not defined at

$x=0$ .

The

$\begin{aligned}
decision=sign(\mathbf{w}^T\mathbf{x}_D+b).
\end{aligned} \tag{3}$

The overall structure of DNN studied in this paper is summarized in Fig. 2. Note that we do not consider more specific structures for the DNN in this paper. For example, more structures are embedded into the convolutional neural networks (CNNs), recurrent neural networks (RNNs) and so on, which improve the performance over the generic structure of DNNs. It will be our future work to incorporate these special structures into the dynamical systems based study.

As justified in the introduction, we can consider a DNN as a dynamical system. However, in traditional dynamical systems, the mapping of the system state, denoted by

$\begin{aligned}
\mathbf{x}_i=T_{i}(\mathbf{x}_{i-1}).
\end{aligned} \tag{4}$

In the context of DNN,

In this section, we discuss the topological entropy and chaos, characterized by the Lyapunov exponent, of DNN. We first define these quantities in dynamical systems. Then, we analyze the chaotic behaviors in DNN.

We first define the topological entropy in a space

$\begin{aligned}
m^n(x_1,x_2)=\max_{i=1,...,n}m\left(T^{i-1}(x_1),T^{i-1}(x_2)\right).
\end{aligned} \tag{5}$

We define the

$\begin{aligned}
r(T,n,\epsilon)=\min_{S { \in (n,\epsilon)-spanning set}}|S|.
\end{aligned} \tag{6}$

Then we define the spanning orbits based topological entropy as follows, where the subscript

**Definition 1** (). *The spanning orbits based topological entropy is defined as ** where **, and **.*

A detailed introduction to alternative definitions of topological entropy can be found in Appendix 8.

Note that the above definition is for traditional dynamical systems. In random dynamical systems, the topological entropy can be defined in the same manner as that in Definitions 1, simply by replacing

The conventional topological entropy measures the number of distinct paths (up to diminishing errors) in the dynamical system. In this paper we extend it to the topological entropy of an ensemble of dynamics. For an integer

$\begin{aligned}
\tilde{T}^L_t(\mathbf{z})=(T_t(\mathbf{x}^1),...,T_t(\mathbf{x}^L)).
\end{aligned} \tag{7}$

Consider

As metrics characterizing the chaotic behavior of dynamical systems, the Lyapunov spectrum [8] quantifies how fast two nearby points depart from each other as the system state evolves with time, as illustrated in Fig. 3. Considering an initial point

$\begin{aligned}
\mathbf{J}_t(\mathbf{x}_0)=\mathbf{J}_{\mathbf{x}}(T_t\circ T_{t-1} \circ ... \circ T_1(\mathbf{x}))|_{\mathbf{x}=\mathbf{x_0}},
\end{aligned} \tag{8}$

where

$\begin{aligned}
\mathbf{L}(\mathbf{x}_0)=\lim_{t\rightarrow\infty} (\mathbf{J}_t(\mathbf{x}_0)\mathbf{J}_t^T(\mathbf{x}_0))^{\frac{1}{t}},
\end{aligned} \tag{9}$

whose convergence is guaranteed by the Multiplicative Ergodic Theorem proved by V. I. Oseledec [9]. Then, we define the

$\begin{aligned}
\lambda_i(\mathbf{x}_0)=\log_2 \Lambda_i(\mathbf{x}_0),
\end{aligned} \tag{10}$

where

For the random dynamics in DNN, we have the following lemma characterizing the Lyapunov spectrum, which can be easily shown by calculus given the structure of DNN:

**Lemma 1**. *For the random dynamics in DNN, we have *

$\begin{aligned}
\mathbf{J}_t(\mathbf{x}_0)=\prod_{i=1}^t \left( \mathbf{S}_i(\mathbf{x}_i)\mathbf{W}_i\right)
\end{aligned} \tag{11}$

* where when ** is the hyperbolic tangent ** we have *

$\begin{aligned}
\mathbf{S}_i(\mathbf{x}_i)=diag\left(1-\tanh^2\left(x_{ij}\right)\right)_{j=1,...,n},
\end{aligned} \tag{12}$

* and when ** is the ReLU function we have *

$\begin{aligned}
\mathbf{S}_i(\mathbf{x}_i)=diag\left(sign\left(x_{ij}\right)\right)_{j=1,...,n},
\end{aligned} \tag{13}$

* where ** equals 1 if ** and 0 if **.*

The topological entropy can be derived from the Laypunov spectrum through the following theorem (as a corollary of Theorem A in [11]):

**Theorem 1** ([11]). *Suppose that the random dynamics with transformations **, **, ..., are ergodic. Then, we have *

$\begin{aligned}
h_s(T_1,T_2,...)=\sum_{i=1}^n \max\{\lambda_i,0\},
\end{aligned} \tag{14}$

* where ** is identical for all initial points due to the ergodicity.*

**Remark 1**. *For non-ergodic dynamics, we also have similar conclusions *[11]*. We can calculate the topological entropy by evaluating the Lyapunov spectrum. In the following discussion, we will study the Lyapunov spectrum at different points in the state space, and thus obtain the conclusions on the topological entropy of DNNs.*

The Lyapunov spectrum could be dependent on the initial value if the dynamics are non-ergodic. Hence, the system could be chaotic at certain points, while being non-chaotic in the remainder of the state space. The following theorems show that the maximum Lyapunov exponent of the DNN dynamics could be arbitrarily large. This implies that a DNN, as a dynamical system, could be highly chaotic at certain regions, thus being able to form highly complicated classification surfaces.

The following theorem shows the existence of chaotic behavior when the DNN uses the hyperbolic tangent function. The proof is given in Appendix 9.

**Theorem 2**. *Consider the hyperbolic tangent function **. There exist parameters of DNN and inputs (namely the initial vector of the dynamical system) such that the maximum Lyapunov exponent ** of the corresponding random dynamics is arbitrarily large.*

For the ReLU function case, the challenge is the unboundedness of the state space (although many practical algorithms add constraints on the parameters such as normalization). It has been shown that an

To this end, we consider the angle of the system states in the state space, and omit the norm of the system states. This is reasonable, since the linear separability of the system state samples at the DNN output is determined by the angles, instead of the norms. The following theorem shows the existence of chaotic behavior in the angles of the system states for the 2-dimensional case. The proof is given in Appendix 10. Note that the 2-dimensional case can be easily extended to more generic cases.

**Theorem 3**. *Consider a 2-dimensional DNN with the ReLU function, where the system state is **. The angle of the system state is denoted by **. Then, for any number **, one can always find a set of parameters of the DNN and a time ** such that *

$\begin{aligned}
\left\|\nabla_{\mathbf{x}(0)}\rho(\mathbf{x}(T))\right\|>C.
\end{aligned} \tag{15}$

In the above analysis, the linear mappings are designed specifically to result in chaos. It demonstrates only that the nonlinearity brought by the operation

Due to the difficulty of mathematical analysis, we first carry out numerical calculations for the maximum Lyapunov exponent

The analysis on the Lyapunov spectrum of the random dynamics incurred by the Gaussian ensemble of mappings is difficult. The challenge is that there are two series of dynamics in the DNN, namely that of the tangent vector

$\begin{aligned}
\mathbf{u}_t=\mathbf{S}_t(\mathbf{x}_t)\mathbf{W}_t\mathbf{u}_{t-1},
\end{aligned} \tag{16}$

and that of the system state

$\begin{aligned}
\mathbf{x}_t=\phi\left(\mathbf{W}_t\mathbf{x}_{t-1}+\mathbf{b}_t\right).
\end{aligned} \tag{17}$

In Theorem 2, we design the mappings such that the dynamics of

$\begin{aligned}
\beta_t=\frac{E\left[\|E\left[\mathbf{S}_t(\mathbf{x}_t)\right]\mathbf{W}_t\mathbf{u}_{t-1}\|^2\right]}{E\left[\|\mathbf{u}_{t-1}\|^2\right]},
\end{aligned} \tag{18}$

where the numerator is the norm of the tangent at time

**Assumption 1**. *We have the following assumptions for the dynamics of DNN with random linear mappings, whose justification for the assumption will be given in Appendix **11**.*

*When*$\beta_t>1$ *for all sufficiently large*$t$ *, the norm of*$\mathbf{u}_t$ *is expected to increase exponentially, thus resulting in chaotic behavior.**When*$t$ *and*$n$ *are sufficiently large, we have*$\sum_{i=1}^nx_{ti}^2=C$ *, where*$C$ *is a time-invariant constant.*

Given Assumption 1, we have the following theorem providing a condition for the chaotic behavior with random linear mappings. The proof of is given in Appendix 11.

**Theorem 4**. *Given Assumption **1**, we have ** for the DNN dynamics with Gaussian random linear mappings, if *

$\begin{aligned}
\left(1-4\tanh \left(\sigma\sqrt{\frac{2}{\pi}}h\left(\sigma\sqrt{\frac{2}{\pi}}\right)\right)\right)d\sigma^2>1,
\end{aligned} \tag{19}$

* where ** is the nonzero and positive solution to the equation ** (where ** is the unknown while ** is the parameter).*

**Remark 2**. *The above inequality is based on mean dynamics analysis and Assumption **1**. However, it shows that a higher dimensional dynamics is more possible to be chaotic. Moreover, a too large or too small ** disables. Both coincide the observations in the numerical results. An explicit expression for the function ** in terms of infinite series, based on the Lagrange Inverse Formula *[15]* can also be found in Appendix **11**.*

In this section, we apply the metrics characterizing the dynamics of DNN to the analysis of classification and generalization capabilities.

We first define the complexity of classification problem and then relate it to the topological entropy of DNN dynamics.

We define the complexity of classification, which facilitates the subsequent analysis of DNN classification. The philosophy is to measure how difficult it is to separate the samples from different classes. There have been some studies on how to define the complexity of pattern recognition [16][17][18][19]. However, they are mainly focused on the nonlinearity of classifier and the overlap of features, which is significantly different from our proposed definition.

For defining the complexity of classification, we first need the following definition of set radius.

**Definition 2**. *For a connected and compact set ** in **, we define its radius as **. Then, we define the radius of a partition ** of ** as *

Then, we define the classification complexity as follows:

**Definition 3**. *Given the sample set **, consisting of ** labeled by +1 and ** labeled by -1, we say that a Borel set ** is hybrid with respect to ** if **, **. Suppose that ** is contained in a compact set ** (e.g., the convex hull of **). For a partition ** of **, we define *

$\begin{aligned}
N(P|\mathbf{X})=\left|\{A|A\in P, A { is hybrid w.r.t. \mathbf{X}}\}\right|.
\end{aligned} \tag{20}$

* Then, we define the **-complexity given the sample set ** as *

$\begin{aligned}
C(\mathbf{X},\epsilon)=\log_2 \inf_{P:r(P)\leq \epsilon}N(P|\mathbf{X}).
\end{aligned} \tag{21}$

**Remark 3**. *The complexity of classification in the above equation characterizes how the two classes of samples are mixed with each other. It also characterizes the size of the boundary surface between the two classes; the larger, the more complex. As will be seen, the definition of classification complexity will facilitate the analysis based on topological entropy. The links between the classification complexity and related metrics are discussed in Appendix **12**.*

Now, we apply the concept of topological entropy to the analysis of DNN, in terms of the capability of classification. We need the following definition:

**Definition 4**. *Consider two sets of vectors ** and **. We say that ** and ** are affine separable with an **-margin, if there exists an affine classifier ** (where **) such that *

$\begin{aligned}
\left\{
\begin{array}{ll}
&\mathbf{w}^T\mathbf{z}_1-b\geq \frac{\epsilon}{2}\\
&\mathbf{w}^T\mathbf{z}_2-b\leq -\frac{\epsilon}{2}
\end{array}
\right..
\end{aligned} \tag{22}$

Then, we have the following lower bound on the number of layers in DNN in the following theorem, whose proof is given in Appendix 13:

**Theorem 5**. *Suppose that the output of DNN hidden layers is affine separable with **-margin for the sample set **. Then, the number of layers ** is lower bounded by *

$\begin{aligned}
D\geq \frac{C(\mathbf{X},\epsilon)+1}{H_s(\{T_n\}_{n=1,..,D},D,\epsilon)}.
\end{aligned} \tag{23}$

**Remark 4**. *Recall that ** is defined in Definition **1**. Note that, when ** is sufficiently large and ** is sufficiently small, ** can be well approximated by **, which becomes independent of ** and **.*

A over-complex model for classification may fail in the generalization to samples beyond the training set. Hence, it is important to study the generalization capability of DNNs, characterized by the VC-dimension [20]. Note that the VC-dimension is usually studied by quantifying the separated components in the parameter space [20]. Meanwhile, the VC-dimension of DNN has been substantially explored in [21], which is slightly different from the setup in our study, since it addresses piecewisely linear DNNs. Our results may not be as precise as that in [21]; however, we leverage the layered structure of DNN and may provide a new direction for studying the VC-dimension of DNN. Note that the generalization capability of DNN has been shown to be better than predicted by VC-dimension argument in typical applications [22]. Many explanations have been given to the excellent generalization capability of DNN, e.g. [23]. However, our study shows a new recursive approach to explore the generalization of DNN.

It is well known that the generalizability of a hypothesis set

**Definition 5**. *We say that a set of ** points is fully shattered by a binary hypothesis set ** if the hypotheses in ** can realize all the ** possible labelings for **. The VC-dimension is defined as *

$\begin{aligned}
d_{VC}(H)=\max\{|S|: S \text{ is fully shattered by }H\}.
\end{aligned} \tag{24}$

The relationship between the VC-dimension and the generalization capability can be found in [24].

In the subsequent analysis, we find both lower and upper bounds for the VC-dimension from the view point of dynamical systems. Based on the ensemble topological entropy, we have the following upper bound for the VC-dimension of DNN. The proof is given in Appendix 14.

**Theorem 6**. *The VC-dimension of DNN is upper bounded by*

$\begin{aligned}
d_{VC}\leq \sup_{\{T_{mt}\}_{m,t},\mathbf{z}}H_e(\{T_{mt}\}_{m=1,...,2^{d_{VC}},t=1,...,D},D,\epsilon,\mathbf{z})D,
\end{aligned} \tag{25}$

*when the DNN output is affine separable with ** margin.*

**Remark 5**. *Again the quantity ** can be approximated by ** when ** is sufficiently large and ** is sufficiently small. The above upper bound for the VC-dimension of DNN is obtained from the complexity analysis. Unfortunately, we are still unable to numerically or analytically evaluate the ensemble topological entropies ** or **.*

We can also analyze the details of shattering the samples by leveraging the nonlinearity of

**Theorem 7**. *The number of distinct samples that can be shattered by a DNN is at least **, where ** is the number of layers. Therefore, the VC-dimension of DNN is lower bounded by *

**Remark 6**. *The conclusion shows that the VC-dimension increases at least linearly with the number of layers. The layer-by-layer analysis may be extended to find finer bounds, since we only analyzed the 2-dimensional case in the proof.*

In this section, we use the ResNet-101 classifier, together with the image set called *ImageNet*, to evaluate the largest Lyapunov exponents for practical DNNs and real images. For notational simplicity, we call the largest Lyapunov exponent the Lyapunov exponent.

We used the pretrained ResNet-101 [25], which is a convolutional deep neural network and is available in Matlab codes. The DNN of ResNet-101 has 101 layers. The input and output of ResNet-101 are images and the corresponding image categories. The coefficients of the DNN are trained using the celebrated ImageNet database [26]. The evaluation of the Lyapunov exponents is based on the image set ‘Caltech 101’, which contains 101 classes of images. Each image is normalized to the size of

For traditional dynamical systems, the Lyapunov exponents can be estimated using the approaches in [27] (pp.71–73). However, these algorithms are designed for dynamics having identical mappings in each round, while the DNN dynamics are random, and have different nonlinear mappings in each layer due to different coefficients. Moreover, the number of layers of ReSNet-101 (i.e., 101 layers) may not be sufficiently large such that the maximum increasing direction emerges spontaneously from any initial direction. Therefore, we propose the algorithm in Procedure 1 to approximately estimate the Lyapunov exponents.

**Algorithm 1** (H).

Consider each image as a point in the ‘image space’, whose dimension is 50176. Therefore, the Lyapunov exponent, estimated using Procedure 1, can be considered as the corresponding Lyapunov exponent at the corresponding point.

In Fig. 6, the trajectories beginning from an image point and a perturbed point are plotted. Since the trajectories are in a very high dimensional space, we plot only the trajectories on two dimensions. We observe that, although the perturbation is very small, the two trajectories are significantly different. Note that the trajectories return to the origin frequently, which is due to the function of ReLU.

The histogram of Lyapunov exponents obtained from the 8733 images and the algorithm in Procedure 1 is given in Fig. 7. We observe that the distribution has a bell shape and is concentrated around the value of 0.005. Some are negative, which means that the perturbation diminishes. We notice that the Lyapunov exponents at very few points are close to 0.2. The images of the highest 4 Lyapunov exponents are given in Fig. 8. It is not clear why these 4 images have large Lyapunov exponents. However, the large Lyapunov exponents imply that the DNN classifier is substantially more sensitive at these images, which may imply that they are closer to other classes of images and are thus more difficult to classify. This possible implication will be tested in our future research.

We also plot the expectation and variance of the Lyapunov exponents for the 101 classes in the plane, as shown in Fig. 9. The four classes of images having the greatest variance of Lyapunov exponents are shown in Fig. 10. We notice that the variances of different image classes are around 6e-6, while the overall variance is 7e-6. Therefore, Lyapunov exponents are chaotically distributed among different classes of images.

In this paper, we have proposed to consider DNN as a random dynamical system, and thus study the properties of DNNs from the viewpoint of dynamical systems. To characterize the complexity of the DNN dynamical system, we have analyzed the corresponding topological entropy and Lyapunov exponents. In particular, we have demonstrated that there may exist chaotic behavior due to the nonlinearity of DNN, which explains why DNN can form complicated classification surfaces. In our future study, we will calculate the topological entropy either analytically (e.g., in the high dimensional regime) or numerically (e.g., using the generating function of symbolic dynamics).

Here we provide a rigorous definition for random dynamical systems [28]. A measurable random dynamical system is a 4-tuple

$\Omega$ : the sample space representing the random events in the random dynamical system;$\mathcal{F}$ : the$\sigma$ -algebra of$\Omega$ ;$\mathbb{P}$ : the distribution over$\Omega$ ;$\mathbb{T}$ : the set of time index;$\theta$ : the time shift acting on the random event$w$ .

The random dynamical system provides a mapping

$\begin{aligned}
\psi:\mathbb{T}\times \Omega\times X\rightarrow X,
\end{aligned} \tag{26}$

namely mapping the system state

The dynamical system

There are three equivalent definitions for the topological entropy [29]. We will introduce them separately in the subsequent discussion. For the three definitions, we consider a topological dynamical system represented by a triple

We first define the topological entropy based on spanning orbits, namely a set of typical orbits of the dynamics, such that for any orbit one can always find a sufficiently close one in this set. To this end, we need to define the distance between two orbits, namely the metric

$\begin{aligned}
d^n(x,y)=\max\left\{d\left(T^ix,T^iy\right),i=0,...,n-1\right\},
\end{aligned} \tag{27}$

for orbits with initial points

Now we define the ‘typical set’ of orbits (or equivalently the initial points). We say a set

Then, we define

$\begin{aligned}
r(n,\epsilon)=\min\left\{\# F: F{ is an }(n,\epsilon)-{spanning}\right\},
\end{aligned} \tag{28}$

namely the cardinality of the minimum

The spanning orbits based topological entropy is then defined as follows, where the subscript

**Definition 6**. *The topological entropy of the dynamical system ** is defined as *

$\begin{aligned}
\left\{
\begin{array}{lll}
H_s(T,n,\epsilon)&=\log_2 r(n,\epsilon)\\
h_{s}(T,\epsilon)&=\lim_{n\rightarrow \infty}\frac{1}{n}H_s(T,\epsilon)\\
h_{s}(T)&=\lim_{\epsilon\rightarrow 0}h_s(T,\epsilon)
\end{array}
\right..
\end{aligned} \tag{29}$

Here,

The topological entropy defined from separated orbits is similar to that defined from the spanning orbits. In both cases, we find a set of ‘typical’ initial points or the corresponding orbits, in order to encode the possible orbits. In the spanning orbits case, we begin from a lot of typical orbits, which may be more than enough, and then find the minimum set of orbits. In a contrast, in the separated orbits based definition, we consider points with neighborhoods of given radius. Then, we squeeze as many such orbits (or initial points) as possible into the space. As will be seen later, both selections of typical orbits will yield the same topological entropy in the asymptotic sense.

To this end, we say that a subset

$\begin{aligned}
s(n,\epsilon)=\max\{|F|: F { is (n,\epsilon)-separated}\},
\end{aligned} \tag{30}$

which means the maximum number of points that one can ‘squeeze’ into the space

Then, we define the metric based topological entropy as

**Definition 7**. *We define the topological entropy as *

$\begin{aligned}
\left\{
\begin{array}{lll}
H_m(n,\epsilon)&=\log_2 s(n,\epsilon)\\
h_m(T,\epsilon)&=\limsup_{n\rightarrow\infty} \frac{1}{n}H_m(n,\epsilon)\\
h_m(T)&=\lim_{\epsilon\rightarrow 0}h_m(T,\epsilon),
\end{array}
\right..
\end{aligned} \tag{31}$

Here

As the third type of definition of topological entropy, the cover based definition of topological entropy does not need the definition of metric. Hence it is still valid in spaces without a metric structure. The requirement is the topological structure, namely the definition of open sets. A cover of

We define the joint of two covers

$\begin{aligned}
\mathcal{U}\vee \mathcal{V}=\left\{U\cap V: U\in \mathcal{U},V\in \mathcal{V}\right\}.
\end{aligned} \tag{32}$

We further define

$\begin{aligned}
\mathcal{U}^n=\vee_{n=0,...,n-1}T^{-n}(\mathcal{U}),
\end{aligned} \tag{33}$

namely the joint set of the covers obtained from tracing back the open sets in

Based on the above definitions on cover sets, we can now define the topological entropy:

**Definition 8**. *The topological entropy of the dynamical system ** is defined as *

$\begin{aligned}
\left\{
\begin{array}{lll}
H_c(\mathcal{U})&=\log_2 N(\mathcal{U})\\
h_c(T,\mathcal{U})&=\lim_{n}\frac{1}{n}H_c\left(\mathcal{U}^n)\right)\\
h_c(T)&=\lim_{\mathcal{U}}\uparrow h_c(T,\mathcal{U})
\end{array}
\right..
\end{aligned} \tag{34}$

Here

The above three definitions of topological entropy look different, focusing on different aspects of the dynamical system. In particular, the cover based definition is significantly different from the first two. A surprising result is that they are actually equivalent in metric spaces:

**Theorem 8** ([30]). *In metric spaces, we have *

$\begin{aligned}
h_s(T)=h_m(T)=h_c(T).
\end{aligned} \tag{35}$

The proof is highly nontrivial. The details can be found in Chapter 6 of [31]. Due to their equivalence, we use only the notation

Here we consider the case of hyperbolic tangent function for the DNN. Before we provide the rigorous proof, we provide the intuition for the possibility of the chaotic behavior in the DNN dynamics. The maximum Lyapunov exponent

For scalar dynamics, namely

The reason for the lack of chaotic behavior in the scalar dynamics of DNN is because

The following proof of Theorem 2 follows the above idea of decoupling the system state

*Proof.* There are two approaches to generate chaos in the DNN dynamics by designing the DNN parameters carefully. We assume

**Approach 1**: In this first approach, we force the system state

For the

$\begin{aligned}
\mathbf{W}_t\mathbf{x}_{t-1}=\arg\tanh(\mathbf{x}_t),
\end{aligned} \tag{36}$

such that

$\begin{aligned}
\left\{
\begin{array}{ll}
&\mathbf{W}_t\mathbf{u}_{t-1}=A\mathbf{u}_{t-1}\\
&\mathbf{W}_t\mathbf{x}_{t-1}=\tilde{\mathbf{x}}_t
\end{array}
\right.,
\end{aligned} \tag{37}$

which can be rewritten as

$\begin{aligned}
\mathbf{W}_t\mathbf{B}_t=\mathbf{C}_t,
\end{aligned} \tag{38}$

where

$\begin{aligned}
\mathbf{W}_t=\mathbf{C}_t\mathbf{B}_t^{-1}.
\end{aligned} \tag{39}$

Then, the derivative after the

$\begin{aligned}
\mathbf{u}_t&=&\begin{pmatrix}
1-x_{t1}^2& 0\\
0 & 1-x_{t2}^2
\end{pmatrix}
\mathbf{W}_t\mathbf{u}_{t-1}\\
&=&A\begin{pmatrix}
1-x_{t1}^2& 0\\
0 & 1-x_{t2}^2
\end{pmatrix}
\mathbf{u}_{t-1}
\end{aligned} \tag{40}$

Now, we need to guarantee that

$\begin{aligned}
u_{t1}x_{t_2}=u_{t2}x_{t_1},
\end{aligned} \tag{41}$

namely

$\begin{aligned}
u_{t-1,1}x_{t_2}(1-x_{t1}^2)=u_{t-1,2}x_{t_1}(1-x_{t2}^2).
\end{aligned} \tag{42}$

Unless both

Using induction on (40), we have

$\begin{aligned}
\mathbf{u}_t=A^{t}\prod_{l=1}^t\begin{pmatrix}
1-x_{l1}^2& 0\\
0 & 1-x_{l2}^2
\end{pmatrix}\mathbf{u}_0.
\end{aligned} \tag{43}$

Hence, we have

$\begin{aligned}
\frac{\|\mathbf{u}_t\|}{\|\mathbf{u}_0\|}\geq \left(A(1-r^2)\right)^t.
\end{aligned} \tag{44}$

When we choose a sufficiently large

**Approach 2**: In the second approach, the DNN dynamics force the system state to rational points in

$\begin{aligned}
\max_{\mathbf{x}\in \Omega}\|\mathbf{x}\|_1\leq r<1,
\end{aligned} \tag{45}$

where the maximum is achieved due to the compactness of

We can enumerate all the rational points (i.e., the coordinates are rational numbers) in

$\begin{aligned}
\frac{u_{01}}{u_{02}}\notin \mathbb{Q},
\end{aligned} \tag{46}$

namely the ratio of the two coordinates in

Then, we design the linear mappings in the same way as in Approach 1, except that the design of

$\begin{aligned}
\left\{
\begin{array}{ll}
&\mathbf{W}_t\mathbf{u}_{t-1}=A\mathbf{u}_{t-1}\\
&\mathbf{W}_t\mathbf{x}_{t-1}=\arg\tanh\left(\mathbf{x}_t\right)
\end{array}
\right.,
\end{aligned} \tag{47}$

which is

$\begin{aligned}
\frac{u_{t1}}{u_{t2}}=\frac{x_{t1}}{x_{t2}}.
\end{aligned} \tag{48}$

According to (43), we have

$\begin{aligned}
\frac{u_{t1}}{u_{t2}}=\frac{u_{01}}{u_{02}}\prod_{l=1}^t\left(\frac{1-x_{l1}^2}{1-x_{l2}^2}\right),
\end{aligned} \tag{49}$

which is irrational since

*Proof.* Similarly to the hyperbolic tangent case, we consider only the 2-dimensional case. Since the angle is given by

$\begin{aligned}
\frac{\partial \rho(\mathbf{x}(T))}{x_1(0)}&=&\frac{1}{1+\left(\frac{x_2(T)}{x_1(T)}\right)^2}\\
&\times&\left(-\frac{x_2(T)}{x^2_1(T)}\frac{\partial x_1(T)}{\partial x_1(0)}+\frac{1}{x_1(T)}\frac{\partial x_2(T)}{\partial x_1(0)}\right),
\end{aligned} \tag{50}$

where according to the chain’s rule the partial derivative is the sum of the chains of partial derivatives along all possible paths from

$\begin{aligned}
\frac{\partial x_1(T)}{\partial x_1(0)}=\sum_{i_1,...,i_{T-1}=1}^2\frac{\partial x_1(T)}{\partial x_{i_{T-1}}(T-1)}\times \cdots\times\frac{\partial x_{i_1}(1)}{\partial x_1(0)},
\end{aligned} \tag{51}$

and

$\begin{aligned}
\frac{\partial x_2(T)}{\partial x_1(0)}=\sum_{i_1,...,i_{T-1}=1}^2\frac{\partial x_2(T)}{\partial x_{i_{T-1}}(T-1)}\times \cdots\times\frac{\partial x_{i_1}(1)}{\partial x_1(0)},
\end{aligned} \tag{52}$

and

$\begin{aligned}
&&\frac{\partial x_i(t)}{\partial x_j(t-1)}\\
&=&\left\{
\begin{array}{ll}
W_{ij}^{t-1}, { if }W_{i1}^{t-1}x_1(t-1)+W_{i2}^{t-1}x_2(t-1)>0\\
0,{ otherwise}
\end{array}
\right..
\end{aligned} \tag{53}$

Now, we set the parameters of DNN in the following manner:

$W_{i1}^{t-1}x_1(t-1)+W_{i2}^{t-1}x_2(t-1)<0$ for$t=1,t_0,2t_0,...$ (or equivalently$x_2(t)=0$ ), where$t_0$ is a positive integer.$W_{21}^t=0$ and$|W_{22}|<C$ .$W_{11}^t=a>1$ and$b_1=(1-a)x_0$ .

Given the above parameters, we have

$\begin{aligned}
\frac{\partial x_2(T)}{\partial x_1(0)}=0,\qquad \forall T>1,
\end{aligned} \tag{54}$

and

$\begin{aligned}
\frac{\partial x_1(T)}{\partial x_1(0)}=a^T,
\end{aligned} \tag{55}$

$\begin{aligned}
\frac{\partial \rho(\mathbf{x}(T))}{x_1(0)}=O(a^T),
\end{aligned} \tag{56}$

which tends to infinity since

We first provide numerical results to justify the assumption in Theorem 4, namely

To prove Theorem 4, we need the following simple lemma:

**Lemma 2**. *There exist a positive solution to the following equation with the unknown **: *

$\begin{aligned}
R=\sqrt{E[\tanh^2(z)]},
\end{aligned} \tag{57}$

* where ** is a Gaussian random variable with zero expectation and variance **, when **.*

*Proof.* It is easy to verify that the righthand side of (57) is a continuous function of

$\begin{aligned}
E[\tanh^2(z)]=\sigma^2R+o(R),
\end{aligned} \tag{58}$

which is larger than the left hand side of (57).

We let

We also need the following lemma for the proof of Theorem 4.

**Lemma 3**. *We consider the following scalar dynamics *

$\begin{aligned}
x(t+1)=\tanh(\alpha x(t)),
\end{aligned} \tag{59}$

* where ** is a Gaussian random variable with zero expectation and variance **. For a fixed **, we consider the equation *

$\begin{aligned}
x=\tanh(\alpha x),
\end{aligned} \tag{60}$

* whose nonzero solution is denoted by **. Then, we have *

$\begin{aligned}
Var[x]\leq 4\tanh \left(\sigma\sqrt{\frac{2}{\pi}}h\left(\sigma\sqrt{\frac{2}{\pi}}\right)\right).
\end{aligned} \tag{61}$

*Proof.* The variance of

$\begin{aligned}
Var(x)=\int \tanh^2(uv)f_x(u)f_\alpha(v)dudv,
\end{aligned} \tag{62}$

where