# Area－Time Efficient Evaluation of Elementary Functions 

Yasuo OKABE and Shuzo YAJIMA<br>Faculty of Engineering，Kyoto University

## ABSTRACT


#### Abstract

This paper describes area－time efficient VLSI algorithms for evaluating elementary functions．For square rooting，a VLSI implementation of $A T^{2}$－optimal，i．e．，$A T^{2}=O\left(n^{2}\right)$ ， circuits are presented，where $A$ is the chip area and $T$ is the computation time in the range $\left[\Omega\left((\log n)^{1+\varepsilon}, O(\sqrt{n})\right]\right.$ for arbitrary $\varepsilon>0$ ．For logarithms and exponentials，an upper bound $A T^{2}=O\left(n^{2}(\log n)^{2}\right)$ is given for any $T \in\left[\Omega\left((\log n)^{2+\varepsilon}, O(\sqrt{n} \log n)\right]\right.$ ．VLSI circuits for these functions with minimum computation time $O(\log n)$ and almost optimal $A T^{2}$－preformance，i．e．，$A T^{2}=O\left(n^{2+\varepsilon}\right)$ ，are also exhibited．These are achieved by using an extension of the Beame－Cook－Hoover Method which we have proposed before．


## 1．Introduction

Much research has long been conducted on arithmetic operations，such as addition， multiplication，division，square rooting and evaluation of elementary functions．Recent advances in large－scale integration technology of circuits have especially motivated the research on hardware algorithms for arithmetic operations，suitable for VLSI implemen－ tations．

In VLSI circuits，the chip area is a more reasonable cost measure than the number of gates．Considering this，VLSI models are proposed as more practical computation models［1］［2］，and much attention has been paid about area－time tradeoffs for many basic operations such as arithmetic operations．

For multiplication，$A T^{2}$－optimal，i．e．，$A T^{2}=O\left(n^{2}\right)$ ，multipliers are known for all computation times in the range $[\Omega(\log n), O(\sqrt{n})]$［3］．For division，Mehlhorn and Preparata exhibited an $A T^{2}$－optimal design of dividers with computation times in the range $\left[\Omega\left((\log n)^{1+\varepsilon}\right), O(\sqrt{n})\right]$ for arbitrary constant $\varepsilon>0$［4］，utilizing Beame－Cook－ Hoover division technique．Mehlhorn also exhibited $A T^{2}$－optimal square rooting circuits for all computation time in the range $\left[\Omega\left((\log n)^{3}\right), O(\sqrt{n})\right][5]$ ．

In this paper，we present several fast and area－time efficient algorithms for evaluat－ ing elementary functions，and give some new upper bounds of the $A T^{2}$－complexity of these functions．First we describe $A T^{2}$－optimal square rooting circuits for all computa－ tion times in the range $\left[\Omega\left((\log n)^{1+\varepsilon}\right), O(\sqrt{n})\right]$ for arbitrary $\varepsilon>0$ ．Next we exhibit
circuits evaluating logarithms and exponentials with $A T^{2}=O\left(n^{2} \log ^{2} n\right)$ for all computation times in the range $\left[\Omega\left((\log n)^{2+\varepsilon}\right), O(\sqrt{n} \log n)\right]$ for arbitrary $\varepsilon>0$. These are based on the arithmetic-geometric mean iteration [6], and implemented using our square rooting circuits. We also consider implementations of circuits with minimum computation time, and present circuits computing these functions with time $O(\log n)$ and area $O\left(n^{2+\varepsilon}\right)$ for arbitrary $\varepsilon>0$.

Throughout this paper, we are concerned with the evaluation of square root $\sqrt{x}$ $(1 / 4 \leq x<1)$, exponential function $\exp x(0 \leq x<\ln 2)$, logarithmic function $\ln x(1 \leq x<2)$, etc., for an $n$-bit unsigned fixed-point binary number $x$, with absolute error $<2^{-n}$. For any number, the approximate values of these functions can easily be computed from the value for $x$ in the above domains. We adopt the VLSI model developed by Brent and Kung [2] as our computation model, and consider time and area as our cost measures.

## 2. Area-Time Optimal Square Rooting Circuits for $T=\Omega\left((\log n)^{1+\varepsilon}\right)$

In this section, we present an area-time optimal square rooting circuit for $T=\Omega\left((\log n)^{1+\varepsilon}\right)$. Our square rooting algorithm is a modification of MehlhornPreparata's area-time optimal division [4]. In our algorithm, $1 / \sqrt{x}$ is first evaluated by the Newton iteration as a root of the equation $u^{-2}-x$, and $\sqrt{x}$ is calculated by the equation $\sqrt{x}=x \cdot(1 / \sqrt{x})$. The iteration rule is

$$
u_{i+1}=\frac{1}{2} u_{i}\left(3-x u_{i}^{2}\right) .
$$

Since the Newton iteration has a self-correcting property, it is sufficient to compute with $2^{i+1}$-bit precision at the $i$-th stage of the iteration. Using this rule, it is immediately shown that there exists an $A T^{2}$-optimal, i.e., $A T^{2}=O\left(n^{2}\right), n$-bit square rooting circuit for any $T \in\left[\Omega\left((\log n)^{2}\right), O(\sqrt{n})\right]$. (See [5].)

We now describe an area-time optimal square rooting circuit with computation time $T$ in the range $\left[\Omega\left((\log n)^{1+\varepsilon}\right), O\left((\log n)^{2}\right)\right]$ for any $\varepsilon>0$, which is a modification of Mehlhorn-Preparata's area-time optimal divider with computation time in the same range [4]. It is easily shown that almost all techniques for division are also applicable to square rooting, except for the polynomial approximation of inverses; the approximation, however, essentially depends on the peculiarity of the Taylor expansion of $1 / x$.

Instead of the polynomial approximation, we present a new technique for approximation of inverse of square roots, based on the Newton iteration. Consider the algorithm described below:

## Algorithm INSQRT2( $x$ )

Given: an integer $s \in[2, l)$ and an integer $k=\left\lceil\log _{4} s\right\rceil$
Input: an $l$-bit number $x \in[1 / 4,1)$
Output: an $l+3$-bit number $v \in(1,2]$ s.t. $v$ gives the leading $(l+3)$ bits of $1 / \sqrt{x}$
function $f_{x}{ }^{k}(u \in[1,2)$ :real number $)$ : real number; begin
(f1) for $i:=1$ to $k$ do $u:=1 / 2 u\left(3-x \cdot u^{2}\right)$;
(f2) $f_{x}{ }^{k}:=u$;
end;
(1) begin $i:=0 ; u_{0}:=1$;
(2) while $1-x \cdot u_{i}^{2}<2^{-(l+3)}$ do begin
(3) $u_{i+1}:=$ leftmost $2^{i+1}$ bits of $f_{x}{ }^{k}\left(u_{i}\right)$;
(4) $i:=i+1$ end;
(5) $v:=u_{i}$;
end.
In line (3) of Algorithm INSQRT2,

$$
f_{x}{ }^{k}(u)=f_{x}\left(f_{x}\left(\cdots f_{x}(u) \cdots\right)\right)
$$

is evaluated in one step, instead of $k$-time evaluation of $f_{x}$.
The following theorem tells us the area-time preformance of computing $f_{x}{ }^{k}(u)$ in lines (f1)-(f2) [7]. This theorem is proved by using an extension of Beame-CookHoover's method [8].

Proposition 1. [Okabe, Takagi and Yajima, 1987] There exists a circuit which computes the $n$-bit approximate value of a polynomial of degree $k$ in an $n$-bit number, in time $O(\log (n k))$ and with area $O\left((n k)^{2}\right)$ if $k \geq O(\log n)$ or $O\left(n^{2} k \log n\right)$ if $k<O(\log n)$.

Since $f_{x}{ }^{k}(u)$ is a polynomial of degree $4^{k}(\approx s)$ in $u$ and $x, f_{x}{ }^{k}(u)$ can be computed in time $O(\log l)$ and with area $O\left(l^{2} s(s+\log l)\right)$. The number of iterations required in lines (2)-(4) is $\left\lceil\log _{2} l\right\rceil / k$, and thus the total computation time of the square rooting circuit is $O\left(\log ^{2} l / \log s\right)$ and the chip area is $O\left(l^{2} s(s+\log l)\right)$. We now obtain the following lemma:

Lemma 1. For any $s \in[2, l]$, there exists a circuit which computes the $l$-bit inverse of the square root of an $l$-bit number in time $O\left((\log l)^{2} / \log s\right)$ and has area $O\left((l s)^{2}\right)$ if $s \geq O(\log l)$ or $O\left(l^{2} s \log l\right)$ if $s<O(\log l)$.

We are now ready to describe our square rooting algorithm with optimal $A T^{2}$ performance for the computation time $T=O\left(\log ^{1+\varepsilon} n\right)$ for any $\varepsilon>0$.

The successive refinement technique [4] for square rooting can be written as:

```
Algorithm INSQRT3(x)
Given: an integer sequence \(l_{1}<l_{2}<\ldots<l_{J}=l\)
Input: an \(l\)-bit number \(x \in[4 / 1)\)
Output: an \(l\)-bit number \(v \in(1,2]\) s.t. \(v x=1+\varepsilon, \varepsilon<2^{-l}\)
(1) begin \(v:=1\);
(2) for \(i:=1\) to \(J\) do
(3) begin \(t_{i}:=\) leftmost \(\left(l_{i}+1\right)\) bits of \(x\);
                \(z_{i}=v^{2} t_{i} ;\)
                    \(x_{i}:=\) leftmost \(\left(l_{i}+1\right)\) bits of \(z_{i} ;\)
                        \(v_{i}=\left(\right.\) leftmost \(\left(l_{i}+1\right)\)-bit overinverse of square root of \(\left.x_{i}\right) ;\left\{\right.\) i.e., \(\left.\left[2^{l_{i}^{+1}} / \sqrt{x_{i}}\right] \cdot 2^{-\left(l_{i}+1\right)}\right\}\)
                        \(v:=v \cdot v_{i} ;\)
        end;
    end.
```

In line (6), Algorithm INSQRT2 presented in the previous subsection is called as a subroutine. To proceed the same discussion as in the case of division, we need the next lemma:

Lemma 2. If an $l$-bit number $x \in[/ 4,1)$ has $\left(l^{\prime}-1\right)$ zeros immediately to the right of the leading 1, the $l$-bit inverse of the square root of $x$ can be computed in time $\left.T=O\left(\log \left(l / l^{\prime}\right) \cdot \log l / \log s\right)\right)$ and with area $A=O\left((l s)^{2}\right)$, for any $s \in\left[O(\log l), l / l^{\prime}\right]$.

Proof. In Algorithm INSQRT2, the first approximation $u_{0}=1$ has a precision of at least $l^{\prime}$ bits. This implies that $O\left(\log \left(l / l^{\prime}\right)\right)$ iterations are sufficient to compute $1 / \sqrt{x}$ to a precision of $l$ bits.

For $i=2, \ldots, J$, it is obviously verified that $x_{i}$ in Algorithm INSQRT3 satisfies the condition of $x$ in the above lemma for $l=l_{i}$ and $l^{\prime}=l_{i-1}$.

Only a straightforward discussion remains. Choose

$$
l_{i}=\frac{n}{(\log n)^{1+\varepsilon} s_{i}}, \quad s_{i}=\left(\frac{n}{(\log n)^{1+\varepsilon}}\right)^{1 / s s(\log n)^{\varepsilon(i-1)}}(i=1, \ldots, J)
$$

where $J$ is a largest value of $i$ for which $s_{i}>2$. (Indeed $J=\theta(1 / \varepsilon)$.) Then it is proved that the successive refinement modules based on Algorithm INSQRT3 can be implemented as a VLSI circuit with $T=O\left((\log n)^{1+\varepsilon}\right), A=O\left(n^{2} /\left((\log n)^{1+\varepsilon}\right)^{2}\right)$, though we will omit the proof.

The Newton iteration techniques utilized here are completely equal to those of Mehlhorn-Preparata's for their division algorithms [4]. Thus we have:

Theorem 1. For any fixed $\varepsilon>0$, the $n$-bit square root of an $n$-bit number can be evaluated with optimal $A T^{2}$-performance $O\left(n^{2}\right)$ for any $T \in\left[\Omega\left((\log n)^{1+\varepsilon}\right), O(\sqrt{n})\right]$.

Note that similar results can also be derived for the computation of $\sqrt{k} \sqrt{x}$ (the $k$-th root of $x$ ) for any fixed integer $k$.

On the other hand, by choosing $s$ as $s=l^{\varepsilon}(1 \geq \varepsilon>0)$ in Algorithm INSQRT2, the resulting circuit achieves $T=O((1 / \varepsilon) \log l)$ and $A=O\left(l^{2(1+\varepsilon)}\right)$. Thus we get the following result:

Theorem 2. For any $\varepsilon>0$, there exists a circuit which computes the $n$-bit square root of an $n$-bit number in time $O(\log n)$ and with area $A=O\left(n^{2+\varepsilon}\right)$.

## 3. Area-Time Efficient Evaluation of Exponentials and Logarithms

In this section, we will consider the area-time efficient implementation of circuits for exponentials and logarithms.

First we present an algorithm for evaluating $n$-bit logarithms with $A T^{2}=O\left(n^{2} \log ^{2} n\right)$ based on the arithmetic-geometric mean iteration of Gauss [6]. The algorithms is as follows:

Algorithm LOG1 $(x)$
Input: an $n$-bit number $x \in[1,2)$
Output: an $n$-bit number $y \in[0, \ln 2)$ s.t. $|y-\ln x|<2^{-n}$
(1) begin $a_{0}:=1 ; b_{0}:=2^{2-n} \cdot x^{-1} ; i:=0$;
(2) while $a_{i}-b_{i}>2^{-n}$ do begin
(3) $a_{i+1}:=\left(a_{i}+b_{i}\right) / 2$;
(4) $\quad b_{i+1}:=\sqrt{a_{i} \cdot b_{i}}$;
(5) $i:=i+1$; end;
(6) $y:=\pi /\left(2 a_{i}\right)-n \ln 2$ end.

Using the result on area-time optimal square rooting circuits, the following theorem follows immediately.

THEOREM 3. For any fixed $\varepsilon>0$, the $n$-bit logarithm of an $n$-bit number can be calculated with $A T^{2}=O\left(n^{2}(\log n)^{2}\right)$ for any computation time $T \in\left[\Omega\left((\log n)^{2+\varepsilon}\right), O(\sqrt{n} \log n)\right]$.

Once an efficient evaluation of logarithms is established, the exponential function $y=\exp x$ can be evaluated as the root of the equation,

$$
f(y)=\ln y-x=0
$$

for any $x$, by applying the Newton method. The algorithm can be written as:

Algorithm EXP1 $(x)$
Input: an $n$-bit number $x \in[0, \ln 2)$
Output: an $n$-bit number $y \in[1,2)$ s.t. $|y-\exp x|<2^{-n}$
begin
(1) $y_{0}:=(l$-bit approximation of $\exp x) ; i:=0$;
(2) while $|\ln y-x|>2^{-n}$ do
begin
(3) $y_{i+1}:=y_{i}(1+x-\ln y)$;
(4) $i=i+1$;
end;
(5) $y:=y_{i}$
end.
We now consider the efficient implementation of this algorithm for each computation time $T$ in the range $T \in\left[\Omega\left((\log n)^{2+\varepsilon}\right), O(\sqrt{n} \log n)\right]$.

THEOREM 4. For any fixed $\varepsilon>0$, the $n$-bit exponential of an $n$-bit number can be evaluated with $A T^{2}=O\left(n^{2}(\log n)^{2}\right)$ for any computation time $T \in\left[\Omega\left((\log n)^{2+\varepsilon}\right), O(\sqrt{n} \log n)\right]$.

The theorem is easily proved for $T$ in the range $\left[\Omega\left((\log n)^{4}\right), O(\sqrt{n} \log n)\right]$, by choosing $y_{0}:=1$ as an initial approximate value in line (1), utilizing the efficient Newton iteration technique in [5] and evaluating logarithms in line (3) by Algorithm LOG1. Thus we consider the implementation of Algorithm EXP1 for $T \in\left[\Omega\left((\log n)^{2+\varepsilon}\right), O\left((\log n)^{4}\right)\right]$.

First we propose a new algorithm for fast evaluation of a good initial $l$-bit approximate value of $\exp x$ in line(1). Let $x$ be an $l$-bit binary number, and suppose $l=s^{k}$. Let $x_{0}$ be the leftmost bit of $x$, and $x_{i}$ be the $\left(s^{i-1}+1\right)$-th to $s^{i}$-th bits of $x(i=1, \ldots, k)$. Then $x=x_{0}+x_{1}+x_{2}+\cdots+x_{k}$ and therefore

$$
\exp x=\exp \left(x_{0}+x_{1}+x_{2}+\cdots+x_{k}\right)=\left(\exp x_{0}\right)\left(\exp x_{1}\right)\left(\exp x_{2}\right) \cdots\left(\exp x_{k}\right) .
$$

This leads to the algorithm shown below:

## Algorithm $\operatorname{EXP} 2(x)$

Given: integers $s, k$ s.t. $s \leq l$ and $k=\left\lceil\log _{s} l\right\rceil$
Input: an $l$-bit number $x \in[0, \ln 2)$,
Output: an $l$-bit number $y \in[1,2)$, s.t. $|y-\exp x|<2^{-l}$
(1) begin
(2) for $i:=1$ to $k\{$ in parallel $\}$ do begin
(3) $x_{i}:=$ the $\left(s^{i-1}+1\right)$-th to $s^{i}$-th bits of $x$;
(4) $y_{i}=2 l$-bit approximation of $\exp x_{i}$; end;
(5) $y:=\prod_{i=1} y_{i}$
end.
Consider the Taylor expansion of $\exp x_{i}$,

$$
\exp x_{i}=1+x_{i}+(1 / 2!) x_{i}^{2}+\cdots+(1 / n!) x_{i}^{n}+\cdots
$$

where $x_{i}<2^{-s^{i-1}}$.
This means that the sum of the first $2 l / s^{i-1}$ terms is an approximate value of $\exp x_{i}$ with the precision of $2 l$ bits, since $x_{i}^{2 l / s^{i-1}}<2^{-2 l}$. Thus we may assume that $y_{i}$ in line (4) is a polynomial of $x_{i}$ of degree $2 l / s^{i-1}$.

Since $x_{i}$ is an $\left(s^{i}-s^{i-1}\right)$-bit number, Step (4) is carried out in time $O(\log l)$ and with area $O\left(l^{2} s(s+\log l)\right.$ ) from Proposition 1. (Note that $\left(s^{i}-s^{i-1}\right) \cdot\left(2 l / s^{i-1}\right)=O(l s)$.) All $x_{i}$ 's can be calculated in parallel (lines (2)-(4)), and multiplied up pairwise in a binary-tree-form (line (5)). Thus computation time required in line (2)-(4) is $O(\log l)$ and area $O\left(k \cdot l^{2} s(s+\log l)\right)$. Using $(2 k-1)$ multipliers with time $O(\log l)$ and area $O\left(l^{2}\right)$ in line (5), their product can be obtained in time $O((\log k) \cdot(\log l))$ and with area $O\left(k \cdot l^{2}\right)$. Hence the computation time $O((\log k) \cdot(\log l))$ and the chip area $O\left(k \cdot l^{2} s(s+\log l)\right)$ in line (5). Thus we have the following lemma:

Lemma 3. For any $s \in[2, l]$, there exists a circuit which computes the $l$-bit exponential of an $l$-bit number in time $O((\log l) \log (\log l / \log s))$ and has area $O\left((l s)^{2} \log l / \log s\right)$ if $s \geq O(\log l)$ or $O\left(l^{2} s \log ^{2} l / \log s\right)$ if $s<O(\log l)$.

Let us continue the proof of the theorem for $T \in\left[\Omega\left(\log ^{2+\varepsilon} n\right), O\left(\log ^{4} n\right)\right]$. Let $l=n / T$, and consider the Algorithm EXP1 for this $l$. From the above lemma for $s=2$, it follows that there is a circuit, say $F_{E}$, which computes the initial $l$-bit approximation of $\exp x$ in line (1) in time

$$
T_{E}=O((\log l) \log \log l)=O((\log n) \log \log n)
$$

and has area

$$
A_{E}=O\left(l^{2} \log ^{2} l\right)=O\left(n^{2} \log ^{2} n / T^{2}\right)
$$

For this initial approximate value $x_{0}$, the number of required iterations of lines (2)(4) is $m=\left\lceil\log _{2}(n / l)\right\rceil=\left\lceil\log _{2} T\right\rceil$. Suppose $c \geq 0$ be a number which satisfies $T>\log ^{2+c} n$. Let $i_{0}=\left\lceil\log _{2}\left(\log ^{2+c} n\right)\right\rceil$. We realize the first $i_{0}$ iterations on a single circuit $F$ using feedback, and the later $m-i_{0}$ iterations on their own circuits $F_{i_{0}+1}, \ldots, F_{m}$ respectively.

At each iteration step, $F$ computes at most $2^{i_{0}}=\left\lceil n \log ^{3} n / T\right\rceil$-bit logarithm and multiplication. From Theorem 3, there exists a circuit $F$ which computes each step of the former $i_{0}$ iterations in time $T_{F}=O\left(\log ^{2+c} n\right)$ and has area

$$
A_{F}=O\left(\left(l \cdot 2^{i_{0}}\right)^{2} \cdot\left(i_{0}+\log _{2} l\right)^{2} /\left(\log ^{2+c} n\right)^{2}\right)=O\left(n^{2} \log ^{2} n / T^{2}\right)
$$

The circuit $F_{m-j}$ computes the $2^{m-j}$-bit logarithm and $2^{m-j}$-bit multiplication. We choose the computation time of $F_{m-j}$ as

$$
T_{m-j}=T / 2^{j / 2}
$$

for each $j=0, \ldots, m-i_{0}-1$. Note that

$$
T_{i_{0}+1}=T / l^{\left(m-i_{0}+1\right) / 2}=O\left(\left(T \log ^{2+c} n\right)^{1 / 2}\right)>O\left(\log ^{2+c} n\right)
$$

From the result of Theorem 3, there exists a circuit $F_{j}$ which has area

$$
A_{m-j}=O\left(\left(2^{m-j}\right)^{2}(m-j)^{2} / T_{j}^{2}\right)=O\left(\left(\left(2^{m}\right)^{2} m^{2} / T^{2}\right) 2^{-j}\right)=O\left(\left(n^{2} \log ^{2} n / T^{2}\right) 2^{-j}\right)
$$

for any $j=0, \ldots, m-i_{0}-1$.
These imply that our circuit has area

$$
A_{r}=A_{E}+A_{F}+\sum_{j} A_{j}=O\left(n^{2} \log ^{2} n / T^{2}\right)+\sum_{j} O\left(\left(n^{2} \log ^{2} n / T^{2}\right) 2^{-j}\right)=O\left(n^{2} \log ^{2} n / T^{2}\right)
$$

and computes the $n$-bit logarithm in time

$$
T_{r}=T_{E}+T_{F} \cdot i_{0}+\sum_{j} T_{j}=O((\log n) \log \log n)+O\left(\log ^{1+c} n\right) \cdot O(\log n)+\sum_{j} T / 2^{-j / 2}=O(T)
$$

since $T \geq O\left(\log ^{2+c} n\right)$. Thus the theorem follows.
Next, let us consider the time-optimal, i.e. $T=O(\log n)$, circuits for exponentials and logarithms, with suboptimal $A T^{2}$-performance.

Consider Algorithm EXP2 in the previous subsection, and choose $s=l^{\varepsilon / 3}$ for any $1 \geq \varepsilon>0$. The Circuit in Lemma 3 operates in time $T=O(\log l)$ and has area $A=O\left(l^{2+(2 / 3) \varepsilon}\right)$. Thus we have:

Theorem 5. For any $\varepsilon>0$, there exists a circuit which computes the $n$-bit exponential of an $n$-bit number in time $O(\log n)$ and with area $O\left(n^{2+\varepsilon}\right)$.

We will show the corresponding result for logarithms. Let $x$ be an $l$-bit binary number and suppose $l=s^{k}$. It is easily verified that $x$ can be written as $x \simeq \eta_{1} \cdot \eta_{2} \cdots \cdot \eta_{k}$, where $\eta_{i}$ is an $s^{i}$-bit number which has at least $s^{i-1}$ consecutive 0 's immediately to the right of the leading 1 . Since

$$
\ln x=\ln \left(\eta_{1} \cdot \eta_{2} \cdots \eta_{k}\right)=\ln \eta_{1}+\ln \eta_{2}+\cdots+\ln \eta_{k},
$$

$\ln x$ is obtained as:

Algorithm LOG2( $x$ )
Given: integers $s, k$ s.t. $s \leq l$ and $k=\left\lceil\log _{s} l\right\rceil$
Input: an $l$-bit number $x \in[1,2)$
Output: an $l$-bit number $y \in[0, \ln 2)$, s.t. $|y-\ln x|<2^{-l}$
(1) begin $x_{0}:=x ; y:=0$;
(2) for $i:=1$ to $k$ do begin
(3) $\quad \eta_{i}=$ leftmost $s^{i}$ bits of $x_{i-1}$;
(4) $\quad x_{i}:=x / \eta_{i}$;
(5) $\quad y_{i}=l$-bit approximation of $\ln \eta_{i}$;
(6) $y:=y+y_{i}$; end
end.
Consider the Taylor expansion of $\ln \eta_{i}$,
$\ln \eta_{i}=\ln \left(1+h_{i}\right)=h_{i}-(1 / 2) h_{i}{ }^{2}+\cdots+(-1)^{n}(1 / n) h_{i}{ }^{n}+\cdots$
As is obvious in line (3)-(4), $h_{i}<2^{s^{i-1}}$. This means that sum of the first $2 l / s^{i-1}$ terms is an approximate value of $\ln \eta_{i}$ with the precision of $2 l$ bits. Thus we may assume that $y_{i}$ is a polynomial of $h_{i}$ of degree $2 l / s^{i-1}$. Since $h_{i}$ is an $\left(s^{i}-s^{i-1}\right)$-bit number, $l$-bit approximation of $\ln \eta_{i}$ can be computed in time $O\left(\log \left(s^{i} \cdot\left(2 l / s^{i-1}\right)\right)\right)=O(\log (l s))$ and with area $O\left(\left(s^{i} \cdot\left(2 l / s^{i-1}\right)\right)^{2}\right)=O\left((l s)^{2}\right)$ (Proposition 1).

Choosing $s=l^{\varepsilon / 3}$ for any $1 \geq \varepsilon>0$, this satisfies $k=\theta(1 / \varepsilon)$. Step (4) can be carried out in time $O(\log l)$ using a divider with area $O\left(l^{2+(2 / 3) \varepsilon}\right)$ (see [4]). For each $i=1, \ldots, k$, $y_{i}$ can be evaluated (line (5)) and added up (line (6)) in time $O(\log (l s))=O(\log l)$ and with area $O\left((l s)^{2}\right)=O\left(l^{2}\right)$ Thus the total computation time of our circuit is $O(k \log l)=O(\log l)$ and the chip area is $O\left(l^{2+(2 / 3) \varepsilon}\right)$. We get:

TheOrem 6. For any $\varepsilon>0$, there exists a circuit which computes the $n$-bit logarithms of an $n$-bit number in time $O(\log n)$ and with area $O\left(n^{2+\varepsilon}\right)$.

## 4. Considerations

Trigonometric functions such as sines, cosines and arctangents can be evaluated from exponentials and logarithms with complex arguments, since

$$
\begin{aligned}
& \log (v+i w)=\log |v+i w|+i \cdot \arctan (w / v) \\
& \exp i \theta=\cos \theta+i \cdot \sin \theta
\end{aligned}
$$

[6]. It is not difficult to modify our algorithms to accept complex arguments. Thus the same upper bounds as for exponentials and logarithms can be obtained for these functions.

## Acknowledgements

We would like to express sincere appreciation to Dr. H. Hiraishi, Dr. N. Takagi, N. Ishiura, and all the members of the Yajima Laboratory at Kyoto University, for their helpful discussions. We would also like to thank Dr. K. Iwama in Kyoto Sangyo University and Dr. H. Yasuura in Kyoto University, for their comments and criticism on parallel and VLSI algorithms.

## References

[1] C. D. Thompson: "A Complexity Theory for VLSI", Tech. Rep. CMU-CS-80-140, Dept. of Comput. Sci., Carnegie-Mellon Univ., Pittsburgh, Pa (Jan. 1979).
[2] R. P. Brent and H. T. Kung: "The Area-Time Complexity of Binary Multiplication", JA CM, 28-3, 521-534 (1981).
[3] K. Mehlhorn and F. P. Preparata: "Area-Time Optimal VLSI Integer Multipiler with Minimum Computation Time", Inform. and Control, 58, 137-156 (1983).
[4] K. Mehlhorn and F. P. Preparata: "Area-Time Optimal Division for $T=\Omega\left((\log n)^{1+c}\right)$ ", Inform. and Comput. 72, 270-282 (1987).
[5] K. Mehlhorn: " $A T^{2}$-Optimal VLSI Integer Division and Integer Square Rooting", Integration, 2, 163-167 (1984).
[6] R. P. Brent: "Multiple-Precision Zero-Finding Methods and the Complexity of Elementary Function Evaluation", Proc. Symp. on Analytic Computational Complexity, J.F.Traub, Ed., Academic Press, New York, 151-176 (1976).
[7] Y. Okabe, N. Takagi and S. Yajima: "Log Depth Circuits for Elementary Functions Using Residue Number System", 1987 LA Symposium in Summer, (July 1987), in Japanese.
[8] P. W. Beame, S. A. Cook. and H. J. Hoover: "Log Depth Circuits for Division and Related Problems", SIAM J. Comput., 15-4, 994-1003 (1986).

