\documentstyle[times,11pt]{article}
\pagestyle{empty}
%\input mssymb
%\def\reals{{\rm I\kern-.17em R}}
\input epsf
%\input amssym.def
\def\reals{{\rm I\kern-.17em R}}
\def\nats{{\rm I\kern-.17em N}}
\hsize=5.6in \hoffset=-0.85in \textheight=9.5in \textwidth=6.5in \voffset=-1.0in
\title{Geometric convergence of filters for hidden
Markov models
\thanks{Part of the work presented in this paper have
been obtained within the framework of the Belgian Program on
Interuniversity Attraction Poles, initiated by the Belgian State, Prime
Minister's Office, Science Policy Programming. The first author was
also supported by a grant from ... during a visit to the Department
of Systems Engineering, Australian National University, Canberra,
A.C.T. The scientific
responsibility for this work rests with its authors. R.K.
Boel is supported by the Belgian National Research Fund
(F.N.R.S.) as Research Associate.}}
\author{R. K. Boel\thanks{Corresponding author}\\
Vakgroep Elektrische Energietechniek,
Universiteit Gent \\ Technologiepark-Zwijnaarde, B-9052 Ghent,
Belgium\\
\vspace{5mm}
Tel:+32-9-2645658 Fax:+32-9-2645840
Email:Rene.Boel@rug.ac.be \\
John B Moore and Subhrakanti Dey \\
%\vspace{5mm}
Department of Systems Engineering, RSISE\\
Australian National University \\
G.P.O. Box 4, Canberra, A.C.T. 0200, Australia \\}
% ----- theorem-like environments
\newtheorem{theorem} {Theorem}[section]
\newtheorem{corollary} {Corollary}[section]
\newtheorem{proposition} {Proposition}[section]
\newtheorem{definition} {Definition}[section]
\newtheorem{example} {Example}[section]
\newtheorem{lemma} {Lemma}[section]
\newtheorem{sideremark} {Remark}[section]
\newtheorem{sidenote} {Note}[section]
% ----- environments
\newenvironment{remark} {\begin{sideremark}\rm}{\end{sideremark}}
\newenvironment{note} {\begin{sidenote}\rm}{\end{sidenote}}
% ----- typewriting commands
\newcommand{\frenchpar} {\hskip\parindent}
% ----- mathematical commands
\def\argmin {\mathop{\rm argmin}}
\def\argmax {\mathop{\rm argmax}}
\renewcommand{\baselinestretch}{1.5}
\newcommand{\alfa} {\alpha}
%\newcommand{\proof} {{\sc Proof.}\hspace{0.25cm}} %bug with \sc
%\newcommand{\proof} {{\bf Proof.}\hspace{0.25cm}}
\newenvironment{proof}{\noindent{\bf Proof}\,\,}{$\hfill\Box$\\}
%\newcommand{\Section}[1]{\section{#1}
% \setcounter{equ}{1}}
%\newcounter{equ}
%\renewcommand{\theequation}{\thesection.\theequ\addtocounter{equ}{1}}
%\newcommand{\reff}[1]{\ref{#1}\addtocounter{equ}{-1}}
%
\newcommand{\ole}{\stackrel{\triangle}{=}}
\newcommand{\uu} {d}
\newcommand{\E} {{\bf E}}
\newcommand{\cov} {{\bf cov}}
\newcommand{\s} {x}
\newcommand{\gama} {\gamma}
\newcommand{\x} {u}
\newcommand{\cq} {{\cal Q}}
\newcommand{\n} {\eta}
\newcommand{\F} {f}
\newcommand{\Q} {{\cal Q}}
\newcommand{\G} {{\cal L}}
\newcommand{\p} {\prime}
\newcommand{\beq} {\begin{equation}}
\newcommand{\eeq} {\end{equation}}
\newcommand{\beqa} {\begin{eqnarray}}
\newcommand{\eeqa} {\end{eqnarray}}
\begin{document}
\maketitle
\date{}
\abstract{\noindent
Hidden Markov models have proved suitable for many interesting
applications which can be modelled using some unobservable
finite state Markov process, influencing measured signals.
This can be used to describe bursty telecommunications
traffic, or the faults in a complicated systems, for modelling
the activity in neurons, for modelling speech patterns, etc. In all these
applications one has to estimate the unobservable underlying state of
the Markov process, using the observed signals. Optimal recursive
filters are well known for this estimation problem. Recently risk sensitive
filters for the same problem have also been obtained. An important
question in studying the quality of such filters is the rate at which
arbitrarily assigned initial conditions are forgotten. In this paper
we show that the effect of initial conditions on these filters dies
out geometrically fast under very reasonable observability
assumptions. This guarantees that the estimate will depend only on the
most recent observations. This in turn allows the numerically stable
implementation of the filter.
The proof is given in the simplest case of finite
state space and of a finite, quantised observations space. However the method
can be extended to more general models
by continuity arguments. }
%\newpage
\section{Introduction}
Consider a complex communication system such as the Deep Space
Network of JPL ( see e.g. Smyth \cite{PS}). There is obviously no way
this complex system can be inspected for detecting failures. Moreover
there can be very many complicated interacting malfunctions to be
detected. The actual "state of health" of this system has to be deduced
from the signals received by the earth stations. One way of
modelling this system is to represent various subsystems as Markov preocesses,
and to model the signals received by the earth stations as depending
on the "state of the system" on the one hand, and on noise
independent of the system state on the other hand.
Space scientists try to filter the signal in such a way as to
recover what is called above "noise". For the sake of maintaining the
long term goals of the space mission, it is also useful to try to
estimate the "state of health" of the system, so that corrective
measures can be taken as soon as necessary.
Typically the "noise"
will vary much more rapidly than the "state of the system". Hence it
is reasonable to model the system as a discrete time
finite state Markov process, with a given sampling interval. The
observations over the sampling interval can be
summarised as a function of the system state and of the noise. The
noise terms vary at a faster time scale than the sampling interval so that
the noise corrupting the system state observations is assumed to
be white.
Similar models are studied in speech regognition \cite{speech}.
The observed speech
signal is segmented into consecutive intervals corresponding to the
consecutive "phonemes" uttered by a speaker. Phonemes are elements of
a finite set of possible "symbols", together forming words and
sentences. The sequence of phonemes forms, to a first approximation, a
finite state Markov process, with a transition matrix which may
depend on the language spoken, or even on the speaker. This
transition matrix is in a sense the grammar, at the level of
phonemes, of the language. With each
phoneme there corresponds an "ideal" signal, which however is
corrupted by many noise terms, both from the speaker and from the
environment. This noise makes it difficult to distinguish the
different phonemes from each other. The next step in speech
processing, after segmentation has been achieved, is to estimate the
sequence of phonemes.
The filters used in practical applications for the above examples
will have to work in real time, in a recursive way. In \cite{Elliot
et al} the optimal filter - in the sense of minimum squared error - for the
hidden Markov model introduced
above, is derived and shown also to be of a recursive form,
consisting of a state transition update and a measurement update (
cf. Kalman filter) for the conditional distribution of the state of the
hidden Markov model.
The conditional distribution of the state at time $k+1$ depends on the
conditional distribution of the state at time $k$ and on the new observation
at time
$k+1$. To start this filter one must make an initial guess at time
$0$. One important ptroperty of a good filter is that the effect of
this initial guess dies out geometrically fast. This insures that a
mistake in the initial guess will not make the filter give wrong
results forever. Intuitively this also guarantees that the effect of
round-off error, outliers, or any other cause of failure of the filter at a
small numnber of points in time, will not corrupt the estimation
results forever. The proof of this geometric convergence property
is the goal of this paper.
In this paper we do not try to solve this problem in its general
form. Rather we try to give a very simple, in fact, trivial, proof of
this property for the optimal filter in the case the state space is
finite, and the measurements are quantised as elements of a finite set.
First of all we note that the conditional distributions form a Markov process
{\em Kunita} \cite{Kunita}. Moreover this Markov process is irreducible in a
certain sense, provided the hidden Markov process is irreducible and the
observations are informative, i.e. no two states generate the same probabity
over the set of observations. Next we show that the update of the conditional distribution
of the state at time $k$
to the conditional distribution of the state at time $k+1$.
This is just a multiplication by the transition operator, a matrix
which has all eigenvalues strictly inside the unit circle, except for a
single eigenvalue $1$. The eigenvalue $1$ simply expresses the fact that the
conditional distribution is a vector of terms summing to $1$ before
and after the update. Within the subset of normalised vectors, the
state update is actually a strict contraction. Finally the
conditional
probability of the state at time $k+1$, given the observed signal up to
time $k$ is updated by using the new measurement $Y_{k+1}$. We
show that on the average this update operator is a contraction
operator, i.e. on the average the difference between two
conditional distributions at time $k$ is reduced at time $k+1$. The
average here is taken over all possible noise terms at time $k+1$.
The proof is a simple explicit calculation of the average of the
Jacobian of the operator.
Hence the filtering algorithm consists of a sequence of
measurement update steps which form a strict contraction, and state
update steps which each are contractions on the average. This is
sufficient to show that the Markov process with state space the
conditional distribution at time $k$ forms a geometrically ergodic
process. This is exactly the geometric forgetting property we want
to prove.
Similar stability results can be derived for the recently developped
risk-sensitive filters for hidden Markov models \cite{SM2}. Also
smoothing filters for HMM show the same exponential forgetting. The methods
to prove the result can also be applied to filters when one assumes that the
exact model parameters are not known, but are estimated on-line. For this
adaptive filter the conditional distribution is no longer a Markov process.
However one can show that the hidden Markov state, the conditional
distribution and the parameter estimate together form a Markov process which
is geometrically ergodic under certain reasonable conditions.
It is clear
that the recursive filter with a state update and a
measurement update as described above, is very similar to a Kalman
filter. For the Kalman filter a
similar exponential forgetting property has been proved
\cite{Anderson and Moore}, \cite{Bougerol}. The method of proof
there is different because the linear strructure of Kalman filters
cannot directly be used here. Nevertheless we reduce the problem in a
sense to a linear problem by using unnormalised conditional
distributions in the proof.
In the next section of this paper we introduce the model, and the
optimal filter, in detail. In section 3 we prove the geometric
convergence.
In section 4, we derive the
risk-sensitive recursive estimates and the
optimizing risk-sensitive filter for hidden Markov
models with finite-discrete states and finite-discrete
observations. We also present the ergodicity results for these
recursive estimates in Section 5. In section 6 we discuss extensions to the case with
continuous observations, and to optimal smoothing.
??extensions
\section{Hidden Markov model}
Consider a first-order homogeneous
finite-state Markov process $X_k$ with state space ${\cal
S}_X$, and transition matrix $A$ defined on a probability space
$(\Omega, \cal F, {\cal P})$.
Also define ${\cal F}_k^0 \ole \sigma (X_0, \ldots, X_k)$ and the corresponding complete filtration $\{{\cal F}_k\}$.
Without loss of generality we can
take $N = \sharp {\cal S}_X$
and denote the $N$ elements of ${\cal S}_X$ by $e_n = ( 0~\ldots 0 ~1
~0 \ldots 0 )$, the unit vectors in ${\bf R}^N$. The elements of the
transition matrix
$$a_{ij} = {\cal P}( X_{k+1} = e_i \mid X_k = e_j ), \;\;
i, j \in \{1,2, \ldots, N
\}$$
represent the probability of reaching the state $e_i$ at time $k+1$
given that the state at time $k$ is $e_j$.
Of course, $a_{ij} > 0$ and $\sum_{i=1}^N a_{ij} = 1,\; \forall j \in \{1,
\ldots, N\}$.
This leads to the
following simple representation of the process $X_k$
( see \cite{Segal}, \cite{Boel}, \cite{Elliott et al}):
\beq
X_{k+1} = A X_k ~+ ~ V_{k+1} \label{eq:renehmm1} \eeq
where $E ( V_{k+1} \mid {\cal F}_k ) ~=~ 0$, i.e. $V_k$
is a $({\cal P}, {\cal F}_k)$-martingale increment sequence. The linear form of
(\ref{eq:renehmm1})
is simply a result of the special structure of the space ${\cal
S}_X$: on the set $\{ 0, 1 \}$ all functions are linear functions.
To keep the derivation simple in the next section we assume that the
Markov process $X_k$ with transition matrix $A$ is aperiodic and
irreducible ( see e.g. Meyn and Tweedie \cite{MT})
At each time instant $k$ a signal $Y_k$ is observed. This observation
takes values in the finite space ${\cal S}_Y ~=~ \{ f_1, ~f_2,
\ldots, ~f_M \}$ where without loss of generality the values $f_m$
can be represented as a unit vector ( all components $0$ except for a
single $1$) in ${\bf R}^M$. Below we will abuse notation and
sometimes write $Y_k = m$ when we mean $Y_k = f_m$. The value of the
random variable $Y_k$ depends on the state $X_k$ and on a noise
term $W_k$. This dependence can be expressed as follows
\beq
Y_k ~=~ C X_k ~+~ W_k \label{eq:renehmm2} \eeq
where $W_k$ forms an independent increment sequence, and the elements
of the matrix $C$ are defined as the conditional probabilities
$$c_{m n} ~=~ {\cal P}( Y_k ~=~ f_m \mid X_k ~=~ e_n )$$
It is obvious that $c_{mn}>0$, and $\sum_{m=1}^M c_{mn} = 1$.
Define $\{{\cal Y}_k\}$ to be the complete filtration generated by
$\sigma (Y_0, \ldots, Y_k)$ and $\{{\cal G}_k\}$ to be the complete filtration
generated by ${\cal G}_k^0 \ole \sigma(X_0, \ldots, X_k, Y_0, \ldots, Y_{k-1})$
. It is easy to see that $E[ Y_k \mid X_k] = CX_k$ and hence
$E[W_k \mid {\cal G}_k] =0$, so that $W_k$ is a $({\cal P}, {\cal G}_k)$-
martingale increment.
In order to calculate the conditional expectation $E ( f(X_k) \mid {\cal Y}_k)$
for any function $f$, we need the
$N$-vector ${\hat p}_k$ representing the conditional distribution
of the state $X_k$ given the observations
$Y_0, Y_1, \ldots, Y_k$:
\beq
\hat{p}_k (i) = {\cal P}( X_k = e_i \mid {\cal Y}_k) \label{eq:estip}
\eeq
It is known ( see e.g. {\em Elliott et al.} \cite{Elliott et al})
that these conditional distributions can be calculated
recursively, i.e. $\hat{p}_{k+1} = F( A.\hat{p}_k, Y_{k+1})$ for a well
defined function $F$.
This recursive transformation consists of two
steps. The function $F$ transforms the conditional probability
$A. \hat{p}_k$ of $X_{k+1}$ given the observations up to time $k$, into the
conditional probability of the same state $X_{k+1}$ given observations up
to time $k+1$. This measurement update is given explicitly by:
\beq
F ( q, Y_{k+1}) = {{ diag(C_{Y_{k+1}, n}). q} \over
{\sum_{n=1}^N C_{Y_{k+1}, n}. q(n)}} \label{eq:nonmap} \eeq
Note that this measurement update consists of a linear transformation
$C_{Y_{k+1}, n}. q(n)$ ( the numerator of (\ref{eq:nonmap})),
followed by a normalisation, expressed by the denominator.
The second step of the recursive transformation calculates the
conditional distribution of $X_{k+1}$ from the conditional
distribution of $X_k$, given the same set of observations $Y_0, Y_1,
\ldots, Y_k$. This is simply a multiplication by the transition
matrix $A$. Since $A$ is a stochastic matrix the updated probability
$\hat{p}_{k+1}$
is automatically normalised, if $\hat{p}_k$ is normalised to one.
This step is therefore a linear transformation.
We assume now that the elements of the matrices $A$ and $C$ are
known. Then {\em Kunita} \cite{Kunita} has shown that the stochastic
process $\hat{p}_k$ is a Markov process, with as state space
the subset of ${\bf R}^N$ of normalised vectors: ${\cal S}_p =
\{ q \in
{\bf R}^N, \sum_{n=1}^N q(n) = 1$. The transition probabilities for
$\hat{p}_k$ can be written down explicitly, since there are only M
possible values in the continuous space ${\bf R}^N$ which
$\hat{p}_{k+1}$ can take, given a value of $\hat{p}_k$. These $M$
values correspond to the $M$ values which $Y_{k+1}$ can take.
\beq
\hat{p}_{k+1} = {{ diag(c_{mn}). A. \hat{p}_k} \over
{\sum_{n=1}^N c_{mn}.A.\hat{p}_k (n)}} \label{eq:estip2} \eeq
with probability ${\cal P}( Y_{k+1} = f_m \mid \hat{p}_k ) =
(C.A. \hat{p}_k)(m) = \sum_{n,l} c_{mn}.a_{nl} \hat{p}_k(l)$.
The initial
condition for the recursive filter is given by the initial value
$\hat{p}_0$ of this Markov process. The question whether the filter
forgets initial conditions geometrically fast is thus equivalent to
the question whether $\hat{p}_k$ is an geometrically ergodic Markov
process.
Clearly the state space ${\cal S}_p$ cannot be completely ergodic.
Let the conditional distribution after a measurement update be such
that we are ( almost ) certain that the state is $e_n$. After the next state
update this becomes $A.e_n = A_{.,n}$, the $n$-th column of
$A$. Whatever the initial distribution $\hat{p}_0$, after one step
the state $\hat{p}_k$ will be inside the convex hull $co(A) = \{ \sum_n
\lambda_n A_{.,n} \mid \sum_n \lambda_n = 1\}$. All states
$\hat{p}_k$ outside this convex hull are transient, and physically
not meaningful. Hence it makes
sense to limit the state space ${\cal S}_p$ to its subset $F(co(A))$, the set
of points reachable form a vector inside $co(A)$ after one measurement update
step.
This reduced state space will be assumed from now on.
\section{Geometric ergodicity of the estimator}
Before checking the geometric ergodicity of the Markov process
$\hat{p}_k$, we first have to see whether the process is irreducible.
Since the state space is continuous, it cannot be irreducible in the
sense of reaching any state from any other state, with positive probability.
However consider any open subset $O$ within the
reduced state space ${\cal S}_p$, and
any initial state $\hat{p}_0$. As soon as all the columns of $C$ are
different, there exists a finite $k$ and a
sequence of observations $Y_0,
Y_1, \ldots, Y_k$ which occurs with non-zero probability, such that
$\hat{p}_k$ is in the open set $O$. This is called forward
accesibility in \cite{MT}, or strong irreducibility \cite{Bougerol}. Since
the state space ${\cal S}_p$ is
compact, this forward accessibility essentially guarantees ergodicity
of the Markov process $\hat{p}_k$.
\begin{remark}
Even if $A$ is not irreducible, the Markov process $\hat{p}_k$ may
have distributions converging to an equilibrium distribution independent of
the inital distribution. Consider as an example the case where $A = I$, i.e.
the undelying state remains constant. We are then actually using the HMM
filter as an identifier. It is known that the estimate converges w.p. 1 to
the correct state as soon as the columns of $C$ are all different. However as
explained in the
introduction we need that the estimate only depends the most rercent
observations $Y_0, Y_1, \ldots, Y_k$ in order to obtain a good filter. To
prove this we need to show
geometrically fast forgetting.
\end{remark}
Consider now the effect of the state transition step. The
multiplication of the intermediate probability ${\cal P}(X_k \mid
Y_0, Y_1, \ldots, Y_k, Y_{k+1} )$ by the matrix A has as effect that
the probability is coming closer towards the equilibrium distribution
$\pi = A. \pi$ of the Markov process $X_k$. The eigenvalue $1$ of the
matrix A has as left eigenvector the vector with all $1$'s, as right
eigenvector the equilibriun distribution. This insures that the sum of
the elements of the distribution is always normalised ( sums to $1$).
All the other eigenvalues are strictly less than $1$ in absolute
value by the Frobenius theorem \cite{Feller} for positive matrices.
Hence within ${\cal S}_p$ the distance between two
distributions $\mid A.p - A.\tilde{p} \mid$ after a state transition
update is strictly less than the distance $\mid p - \tilde{p} \mid $
between the vectors before the update. This state transition update
is a strict contraction operator.
Consider now the measurement update step $F( Y_k, q)$. We have to
show that the distance between two ( conditional ) distributions gets
reduced, on the average, by this transformation. Given the
conditional distribution $q$ the distribution of the
observations $Y_{k+1}$ is ${\cal P}( Y_{k+1} = f_m \mid q ) =
\sum_{n=1}^N \sum_{\ell = 1}^N c_{mn}. a_{n \ell} q(\ell) =
(C.A.\hat{p})(m)$. This is exactly
the normalising factor in the denominator of the measurement update
equation. However, it is still difficult to calculate $E \mid F(
Y_{k+1}, p) - F( Y_{k+1}, \tilde{p}) \mid$ and compare it to $\mid p
- \tilde{p} \mid$, because two different normalising factors are
involved. What can be calculated is the change in normed distance
when $\tilde{q} = q + \delta q $, i.e. the effect of a small
perturbation. To obtain this difference calculate the average
Jacobian with respect to $p$ of the tranformation $F( Y_{k+1}, q)$,
and take the
average over all possible values of $Y_{k+1}$. This derivation leads
to the following expression for the $(i, j)$-th element:
$$ \sum_m [ c_{m i} . \delta_{ i j} - {{c_{mi}.c_{mj}. q(i)}
\over { \sum_l c_{ml}. q(l)}} ]$$
In matrix form this gives an identity matrix minus a complicated
matrix.
To get some insight in the conditions for this matrix to be
contracting, take first the case of a hidden Markov model with only
two unobservable states ${\cal S}_X = \{ e_1, e_2\}$, and two signal values (
$M = 2$. Then the
normalised conditional distribution can be written as $\hat{p}_k = (
\hat{p}_k(1), 1 - \hat{p}_k(1) )^{\p}$. Hence the state space for the
Markov process of conditional distributions can be reduced to a
subset of the interval $[ 0, 1]$. Both the state transition update
and the measurement update can be reduced to one-dimensional
recursions, and the Jacobian to be calculated reduces to a simple
derivative ( rewriting $F$ as a function of $\hat{p}_k(1)$ only).
The derivative is explicitly calculated as
$$E( \Delta = 1
- {{(c_{11} - c_{12})^2. \hat{p}_k(1).( 1 - \hat{p}_k(1))} \over
{\sum_m (c_{m1}.\hat{p}_k(1) + ( 1 - c_{m1}).( 1 -
\hat{p}_k(1)))}}$$
where $\Delta = \mid {{ \partial F} \over {\partial p}}(Y_{k+1},
\hat{p}_k(1)) \mid $.
This is always less than $1$ as soon as the observability condition
$c_{11} \not= c_{12}$ is satisfied. This condition is evidently
necessary since otherwise the observations would carry no information
whatsoever about the unobserved state.
In the case with $N = 2$ and $M \geq 2$ we find a similar expression for the
magnitude of the contraction, involving a product of all the diference
$c_{m1} - c_{k1}$. In fact whenever $N = 2$ we can give a necessary and
sufficient condition for exponential stability of the HMM filter. This
condition states {\em Bitmead and Boel} \cite{BB} that
\beq
E[log \lambda_2(A). \mid {{ \partial F} \over {\partial p}}(Y_{k+1},
\hat{p}_k(1)) \mid] < 0 \label{eq:nsuff} \eeq
where $\lambda_2(A)$ is the second largest eigenvalue of A (
strictly less than $1$). Of course this condition is not directly applicable
since it is in general very difficult to evaluate the expectation. For $N >
2$ there is no such necessary and sufficient condition, because the Jacobian
of the transformation $F$, even reduced to a subspace of dimension $N-1$, is
a matrix. The necessary and sufficient condition above was derived using
commutativity of the differrent update steps.
However it is still possible to obtain geometric ergodicity of $\hat{p}_k$ by
simply showing that is a contraction ( not necessarily strict) ( see e.g.
{\em Bougerol} \cite{Bougerol}. It suffices
e.g. to calculate the eigenvalues of $E(\Delta.\Delta^{\p})$,
and prove that there
is at most a simple eigenvalue $1$, while all the other eigenvalues are
strictly less than $1$. Simple observability conditions on $C$ guaranteeing
this property will be a topic of further research. The rate of forgetting
initial conditions ( or any other past data) is then at least as fast as
$\lambda_2(A)$. This may however be a pessimistic estimate.
\section{Risk-sensitive filtering for hidden Markov models}
In this section, we present the risk-sensitive filtering theory for hidden
Markov models with finite-discrete states and finite-discrete observations,
much along the same line as it has been presented in \cite{SM2} for
continuous-range observation case. We briefly define the problem of
risk-sensitive estimation. A new probability measure is then defined
under which the risk-sensitive estimation problem is solved to obtain
results for recursive estiamtes and the optimizing risk-sensitive
filter.
Finally, we present the
stability results for the recursive risk-sensitive estimate
for a simple HMM with
$2$ states and $M$ output symbols.
\subsection{Problem Definition}
Consider the signal model defined by (\ref{eq:renehmm1}) and (\ref{eq:renehmm2}
).
Our problem objective is to find an estimate $\hat X_k$ of $X_k$,
where $\hat X_k \in \reals^N$, such that the
following criteria is satisfied,
\beq
\hat X_k = \argmin_{\zeta \in \reals^N} J_k(\zeta),\; J_k(\zeta) = E [\theta
\exp(\theta \Psi_{0,k}(\zeta)) | {\cal Y}_k],\;\;
\forall k = 0,1,\ldots \label{eq:chmm} \eeq
where $\theta ( > 0)$ is the risk-sensitive parameter and
\beq
\Psi_{0,k}(\zeta) = \hat \Psi_{0,k-1} + \frac{1}{2}( X_k - \zeta)^{\p} Q_k
(X_k - \zeta), \; Q_k \geq 0 \;\; \forall k
\label{eq:chmm1} \eeq
where
$$
\hat \Psi_{m,n} \ole \frac{1}{2} \sum_{i=m}^{n} ( X_i - \hat X_i)^{
\p} Q_i (X_i - \hat X_i) $$
\begin{remark}
In \cite{SM2}, it has been considered that $\hat X_k \in {\cal S}_X$.
To avoid a technical problem which will be explained in the next section,
we assume here that $\hat X_k \in \reals^N$.
\end{remark}
\subsection{Change of Measure and Reformulated Cost Index}
Define $Y_k^i = \langle Y_k, f_i \rangle$, where $Y_k = ( Y_k^1, \dots, Y_k^M)$
such that for each $k \in \nats$, exactly one
component is equal to $1$, the remainder being $0$.
Define a new measure $\bar {\cal P}$ where $\{Y_k\}, k \in \nats$ is a
sequence of {\em i.i.d} random variable and
$$
\bar {\cal P}(Y_k^j = 1) = \frac{1}{M}$$.
Let $c_k = CX_k$ and $c_k^i = \langle c_k , f_i \rangle$.
Also define
$$
\bar \lambda_k = \prod_{i=1}^M (M c_k^i)^{Y_k^i}, \;\;
\bar \Lambda_k = \Pi_{l=0}^{k} \bar \lambda_l $$
If we set the Radon-Nikodym derivative $\frac{d{\cal P}}{d\bar {\cal P}}
\mid_{{\cal G}_k} = \bar \Lambda_k$, then under ${\cal P}$,
$$ E[Y_k \mid {\cal G}_k] = C^{\p}X_k $$.
Using
a version of Bayes' Theorem, we have
\beq
E [\theta \exp(\theta \Psi_{0,k}(\zeta))| {\cal Y}_k] = \frac{\bar E [
\bar \Lambda_k \theta
\exp(\theta \Psi_{0,k}(\zeta))| {\cal Y}_k]}{\bar E [\bar \Lambda_k |
{\cal Y}_k ]}
\label{eq:cdef2s} \eeq
Hence, we work under $\bar {\cal P}$ where the modified
problem objective is to determine $\hat X_k (\in \reals^N)$ such that
\beq
\hat X_k = \argmin_{\zeta \in \reals^N} \bar E [\bar \Lambda_k \theta
\exp (\theta \Psi_{0,k}(\zeta)) | {\cal Y}_k ]
\label{eq:chmm2s} \eeq
\subsection{Recursive estimates}
\begin{definition}
Define the measure $\alpha_k(j)$ to be the unnormalised information state such
that
\beq
\alpha_k(j) = \bar E [\bar \Lambda_{k-1} \theta \exp (\theta \hat \Psi_{0,k-1}
)
\langle X_k, e_j \rangle | {\cal Y}_{k-1}] \label{eq:alphahmms} \eeq
\label{def:alphahmms} \end{definition}
\begin{remark}
Note that $\alpha_k(j)$ can be interpreted as an information state of
an augmented plant where the state includes the actual state of the system and
part of the risk-sensitive cost. For details, see \cite{jrs1}.
\end{remark}
\begin{lemma}
The information state $\alpha_k = (\alpha_k(1), \ldots, \alpha_k(N))^{\p}$
obeys the following recursion
\beq
\alpha_{k+1} = A {\cal D}_k^{\p} {\cal B}_k^{\p} \alpha_k
\label{eq:recalhmm} \eeq
where
$$
{\cal B}_k = diag \left\{ Mc_1(Y_k), \ldots, Mc_N(Y_k)\right\} $$
$$
{\cal D}_k = diag \left\{ \exp \left( \frac{\theta}{2} (e_1 - \hat X_k)^{\p}
Q_k (e_1 - \hat X_k) \right),
\ldots, \exp \left( \frac{\theta}{2} (e_N - \hat X_k)^{\p} Q_k (e_N - \hat
X_k)\right) \right\} $$
and $c_i(Y_k) = c_{ji} \:$ if $Y_k = f_j,\;\; \forall i \in \{1, \ldots, N\},\;
j \in \{1, \ldots, M\}$.
\label{lemmaalpha} \end{lemma}
\begin{proof}
The proof can be carried out in the same way as it has been done for
continuous-range observations in \cite{SM2}.
\end{proof}
\begin{remark}
Note here that the information state filter is linear and finite-dimensional.
\end{remark}
\begin{note}
{\bf Normalization}: \\
Define the normalized recursive estimates by $\hat \alpha_{k+1}$. It can
be easily shown that
\beq
\hat \alpha_{k+1} = \frac{A {\cal D}_k^{\p} {\cal B}_k^{\p} \hat \alpha_k}
{\sum_{i=1}^N M c_i(Y_k)\exp \left( \frac{\theta}{2} (e_i - \hat X_k)^{\p}
Q_k (e_i - \hat X_k) \right)
\hat \alpha_k(i)} \label{eq:nalpha} \eeq
\end{note}
\begin{theorem}
The optimizing estimate $\hat X_k$ is given by
\beq
\hat X_k = \argmin_{\xi \in \reals^N} \sum_{j=1}^N \prod_{i=1}^M (Mc_k^i)^{Y_k^i} \exp \left( \frac{\theta}{2} (e_j - \xi)^{\p} Q_k (e_j - \xi) \right)
\alpha_k(j) \label{eq:esthmms} \eeq
\label{hmmtheos} \end{theorem}
\begin{proof}
Again, the proof is exactly similar to that one given in \cite{SM2} and hence
not given here.
\end{proof}
\begin{remark}
It should be obvious from the convex nature of the expression on the R.H.S of
(\ref{eq:esthmms}) that $\hat X_k$ exists and is unique.
\end{remark}
\section{Geometric ergodicity of risk-sensitive filters}
In this section, we present the results regarding the geometric ergodicity
or ``exponential forgetting of initial conditions'' for the
risk-sensitive filter for HMMs with finite-discrete states and finite-discrete
observations. As explained before, instead of considering a very general
N-state M-output symbol HMM, we consider a 2-state M-output symbol HMM,
so that we deal with a scalar derivative instead of a Jacobian. We
present the general condition for the geometric ergodicity of the
risk-sensitive filter followed by a simpler sufficient condition which
is observed to be too strict for the standard risk-neutral filter or the
conditional expectation filter, but not so for the risk-sensitive filter
under consideration.
\subsection{Geometric ergodicity of the recursive risk-sensitive filter
for a 2-state M-output symbol HMM}
From the results derived in the previous section, we see that the
normalized risk-sensitive estimates for a 2-state M-output
symbol HMM are given by the following recursion
\beq
\hat \alpha_{k+1} = A F_k(Y_k, \hat \alpha_k) \label{eq:rnalpha} \eeq
where $\hat \alpha_k = (\hat \alpha_k(1)\; 1-\hat \alpha_k(1))^{\p}$ and
$F_k(Y_k, \hat \alpha_k)$ is a nonlinear vector function given by
\beq
F_k(Y_k, \hat \alpha_k) = \frac{diag\left\{c_1(Y_k)\exp\left(\frac{\theta}{2}
(e_1 - \hat X_k)^{\p} Q_k (e_1 - \hat X_k)\right), c_2(Y_k)
\exp\left(\frac{\theta}{2} (e_2 -\hat X_k)^{\p} Q_k (e_2 -\hat X_k) \right)
\right\}\hat \alpha_k}{\sum_{i=1}^2 c_i(Y_k)\exp\left(\frac{\theta}{2}
(e_i - \hat X_k)^{\p} Q_k (e_i - \hat X_k)\right)
\hat \alpha_k(i)} \label{eq:nonvec} \eeq
This recursion can be broken into 2 steps of transformation, a nonlinear
mapping followed by a linear mapping. In section 3, the linear transformation
has already been shown to be a strict contraction due to the fact that $A$ is a
transition probability matrix. Hence, we just deal with the nonlinear transformation and derive the condition under which it will be a contraction in an
averaging sense.
\begin{theorem}
The necessary and sufficient condition for the nonlinear mapping $F_k :
\reals^M \times \reals^2 \rightarrow \reals^2$ to be a contraction $\forall k \in \nats$
in an
averaging sense is given by
\beq
\sum_{m=1}^M \left| g_m(\hat \alpha_k(1), \theta) \right| \; <\; M,\;\;
\forall k \label{eq:gencond} \eeq
where $$
g_m(\hat \alpha_k(1), \theta) = \frac{c_{m1}c_{m2} \exp (\sigma_k^2(1) +
\sigma_k^2(2)) [1 + \theta \hat \alpha_k(1)(1 - \hat \alpha_k(1)) \{(e_2 - e_1)
^{\p} Q_k \frac{\partial \hat X_k}{\partial \hat \alpha_k(1)}
\mid_{Y_k, \hat \alpha_k(1)}\}]}
{[c_{m1}\exp(\sigma_k^2(1))\hat \alpha_k(1) + c_{m2} \exp (\sigma_k^2(2))
(1 - \hat \alpha_k(1)) ]^2} $$
and $$
\sigma_k^2(i) = \frac{\theta}{2} (e_i - \hat X_k)^{\p} Q_k (e_i- \hat X_k) ,
\; i=1,2. $$
\label{theosd1} \end{theorem}
\begin{proof}
The proof technique is similar to the one in Section 3, except for the fact
that we start with the condition
\beq
\bar E \left[ \left| \frac{\partial F_k(Y_k, \hat \alpha_k(1))}{\partial
\hat \alpha_k(1)}\right|_{\hat \alpha_k(1), Y_k=f_m} \left. \right|\:
\hat \alpha_k(1),\: Y_k=f_m \right] \; < \; 1 \label{eq:ineqs} \eeq
which has to be satisfied for the mapping to be a contraction in an
averaging sense.
We work under $\bar P$ because it makes the derivation simpler, since the
observations are i.i.d, independent of the state and uniformly distributed
with $\bar P(Y_k = f_m) = \frac{1}{M},\; \forall m \in \{1,\ldots, M\}$
under $\bar P$. Hence, we have the equivalent condition
$$
\sum_{m=1}^M \left|\frac{\partial F_k(Y_k, \hat \alpha_k(1))}
{\partial \hat \alpha_k(1)}\right|_{\hat \alpha_k(1), Y_k=f_m} \; < \; M $$
The rest of the proof is just algebraic manipulations and is omitted here.
\end{proof}
\begin{remark}
It is obvious from (\ref{eq:esthmms}) is that $\hat X_k$ is a function
of $\hat \alpha_k(1)$, although the functional relationship is not
explicitly known. This prevents us from obtaining any further simplification
of the condition (\ref{eq:gencond}) given above. We assume that
for a given $Y_k$,
$\hat X_k = L(\hat \alpha_k(1))$ where $L \in C^1(\reals)$. We
also assume $\left| \frac{\partial \hat X_k}{\partial \hat \alpha_k(1)}\right|_{Y_k, \hat \alpha_k(1)} \; < \; \infty$. It is for this reason
that we chose $\hat X_k \in \reals^N$, rather than $\hat X_k \in \cal E$, as
mentioned before.
\end{remark}
\begin{corollary}
A sufficient condition for the nonlinear mapping $F_k :
\reals^M \times \reals^2 \rightarrow \reals^2$ to be a contraction
$\forall k \in \nats$ in an
averaging sense is
\beq
\left| g_m(\hat \alpha_k(1), \theta) \right| \; <\; 1,\;\;
\forall m \in \{1,\ldots, M\} \label{eq:sufcond} \eeq
where $\theta \ne 0$.
\label{cor1} \end{corollary}
\begin{proof}
The proof is immediate from Theorem \ref{theosd1}.
\end{proof}
\begin{remark}
It should be noted here, that for $\theta = 0$, when the risk-neutral filter
for HMMs is obtained, (see \cite{SM2}) this sufficient condition
implies observability and hence is much stricter than the observability
condition obtained in Section 3.
In fact, this condition might not be satisfied for all values of
$\hat \alpha_k(1) \in (0,\: 1)$, for a given a set of parameters $A, C$.
\end{remark}
\begin{remark}
It should be also noted that for $\theta \ne 0$, none of the conditions in
Theorem \ref{theosd1} or Corollary \ref{cor1} implies
observability. In fact,
even if $c_{m1}=c_{m2},\;\forall m \;\in\; \{1,\ldots, M\}$, the
conditions (\ref{eq:gencond}) and (\ref{eq:sufcond}) can be satisfied
provided
$$
\left|(e_2 - e_1)^{\p} Q_k \frac{\partial \hat X_k}{\partial \hat \alpha_k(1)}
\left. \right|
_{Y_k, \hat \alpha_k(1)}\right| < \gamma^2(\theta, \hat \alpha-k(1)) $$
Hence,
observability is not a necessary condition
for the nonlinear mapping $F_k :
\reals^M \times \reals^2 \rightarrow \reals^2$ to be a contraction, as opposed
to the case of risk-neutral HMM filters in Section 3.
\end{remark}
\begin{remark}
Risk-sensitive filters for hidden Markov models with $N=2$ and a
continuous-range observation space $\reals^p$
have been derived in \cite{SM2}. The general
condition for the corresponding nonlinear mapping to be a contraction
can be similarly derived where the derivation
involves an integration over the range of the observation
process rather than a summation as in (\ref{eq:gencond}).
\label{re:contrisk}
\end{remark}
Also, it is not difficult to see that a necessary and sufficient condition
similar to (\ref{eq:nsuff}) can be obtained
for the exponential stability of the recursive risk-sensitive estimates.
However, the most fundamental observation that can be made from the
above results is that for sufficiently large values of $\theta$,
$\left| g_m(\hat \alpha_k(1), \theta) \right|$ can be $\geq 1, \forall m$ and
hence none of the conditions (\ref{eq:gencond}) and (\ref{eq:sufcond})
would be satisfied. In other words, the risk-sensitive filter may become
unstable, i.e., a small change in the initial conditions may result in
a chaotic behaviour of the risk-sensitive filter. This restriction
on $\theta$ has been also observed in \cite{SM1} \cite{Speyer} for the
case of risk-sensitive filters for linear Gauss-Markov models. It has been
seen that sufficiently large values of $\theta$ may make a certain matrix
negative definite and resulting in the non-existence of the solution of a
certain Riccatti equation. Simulation studies for risk-sensitive filters
for HMMs with continuous-range observations \cite{SM2} have also shown that
risk-sensitive filters lose their robustness against uncertain noise
environments for sufficiently large values of $\theta$.
However, there is yet no general theory of choosing $\theta$
before starting the estimation process such that the stability of
risk-sensitive filters would be guaranteed throughout.
\section{ Extensions and applications}
As quantisation gets finer and finer, and $M$ increases to infinity, the
geometric ergodicity as obtained for the
conditional density of the state in Section 3 still holds,
i.e. the result is true with a continuous observation space ${\cal
S}_Y$. The rate of forgetting may deteriorate as $M$ increases, but it will
certainly be limited by $\lambda_2(A)$.
See Remark \ref{re:contrisk} for extensions to hidden Markov models with
continuous-range observations for the risk-sensitive case.
It should be noted that in an actual implemetation of the recursive filter,
the normalisation step exprressed by the denominator, may not be executed at
each time step. Only at those points in time when one actually has to
calculate a conditional expectation based on the conditional distributions
$\hat{p}_k$, or when there is a problem with overflow or underflow in the
calculation of the unnormalised probabilities, is it necessary to carry out
this normalisation step. It should be clear from the derivation above that
this does not change anything to the geometric ergodicity, even though the
unconditional distributions may of course be divergent.
The above analysis assumes that the filter is using the correct parameter
values for $A$ and $C$. If this is not the case, for
example, in the case of standard HMM filtering, $\hat{p}_k$ is no
longer a Markov process. However the joint process $X_k, \hat{p}_k$ is still
a Markov process. It is a multiplicative Markov process in the terminology of
{\em Bougerol} \cite{Bougerol}. Because the first component $X_k$ of the
process is not influenced by the second component, the method of
\cite{Bougerol} can be used to derive similar geometric ergodicity properites
for this enlarged process. In fact one can hope that it will be possible to
consider the geometric ergodicity of an even more complicated process. Let
the filter be run in parallel with any recursive estimator of the unknown
parameters of $A$ and $C$. Let the state of the recursive parameter estimator
be $\theta_k$. Then the joint process $X_k, \hat{p}_k, \theta_t$ is again a
multiplicative Markov process, and at least formally it is possible give
conditions for geometric ergodicity. Whether these formal ergodicity
conditions can be transformed in easily verifiable conditions is a topic for
further research.
\begin{thebibliography}{99}
\bibitem{Anderson and Moore}
B.D.O. Anderson and J.B. Moore ?
\bibitem{Bitmead}
R. Bitmead and R. Boel: "On Stochastic Convergence of infinite Products of
Random Matrices and its Role in Adaptive Estimation Theory", Proceedings IFAC
Symposium on Identification and System Parameter Estimation, York, 1985.
\bibitem{Bougerol}
P. Bougerol: "Th\'eor\`emes limite pour les syst\`emes lin\'eaires \'a
coefficients markoviens", Probability Theory and related fields, vol.78,
1988, pp.193-211, and "Comparaisons des exposants de Lyapunov des processus
markoviens multiplicatifs", Annales de l'Institut Henri Poincar\'e, vol. 24,
1988, pp.439-489.
\bibitem{Berman}
A. Berman and R. Plemmons: Nonegative matrices in the Mathematical Sciences,
SIAM Classics in Applied Mathematics, 1994.
\bibitem{Kunita}
H. Kunita: "Asymptotic Behaviour of Nonlinear Filtering Errors of Markov
Processes", Journal of Multivariate Analysis, vol. 1, 1971, pp. 365-393.
\bibitem{MT}
S. Meyn and R. Tweedie: Markov Chains abd Stocahstic Stability,
Springer verlag, 1993.
\bibitem{speech}
L. Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in
Speech recognition", Proceedings of the IEEER, vol. 77, 1989, pp.257-285.
\bibitem{PS}
P. Smyth: Markov monitoring with unknown states, IEEE Journal on
Selected Areas in Communication, vol. 12, no. 9, Dec. 1994, pp.
1600-1612.
\bibitem{SM2}
S. Dey and J. B. Moore, ``Risk-sensitive filtering and
smoothing for Hidden Markov Models,'' {\em Systems and Control Letters},
accepted for publication, 1994.
\bibitem{jrs1}
J. B. Moore, R.J. Elliott and S. Dey, ``Risk-sensitive
Generalizations of Minimum Variance Estimation and Control,'' {\em Proc.
of the IFAC Symposium on Nonlinear Control System Design (NOLCOS)}, California,
June 1995, to appear.
\bibitem{SM1}
S. Dey and J. B. Moore, ``Risk-sensitive filtering and
smoothing via Reference Probability Methods'', {\em Proc. of the American
Control Conference}, Seattle, June 1995, to appear.
\bibitem{Speyer}
J. L. Speyer, C. Fan and R. N. Banavar, ``Optimal
Stochastic Estimation with Exponential Cost Criteria,'' {\em Proceedings of the
31st Conference on Decision and Control}, Vol. 2, pp. 2293-2298, Dec. 1992.
\end{thebibliography}
\end{document}