Computing logarithms is a fairly common operation in digital signal processing. Most often, only convolutions (multiplication with accumulation) and amplitudes with phases have to be considered. As a rule, to calculate the logarithms on an FPGA,
the CORDIC algorithm is used in a hyperbolic version, requiring only tables and simple arithmetic operations. However, this is not always convenient, especially if the project is large, the crystal is small and dances with optimization begin. It was with such a situation that I had to face one day. Both ports of the RAM block (Cyclone IV) were already working tightly, leaving no free windows. I did not want to use another block for hyperbolic CORDIC. But there was a multiplier, for which a decent free window appeared in the time diagram. After thinking a day, I composed the following algorithm, in which tables are not used, but there is multiplication, more precisely squaring. And since circuit squaring is simpler than the general case of multiplication, perhaps this algorithm is of interest to specialized microcircuits, although of course there is no difference for FPGAs. More details under the cut.
Explaining what's what, it's easier for real numbers. Let's start with them. We proceed to the integer implementation later.
Let there be a number
X. Find the number
Y such that
X=2Y .
We also assume that
X lies in the range from 1 to 2. This does not limit the generality too much, since
X can always be transferred to this interval by multiplication or division by a power of two. For
Y, this will mean adding or subtracting an integer, which is easy. So
X lies in the interval from 1 to 2. Then
Y will lie in the interval from 0 to 1. We write
Y as an infinite binary fraction:
Y=b020+b12−1+...+bn2−n+...
Odds
bi there is nothing else in this record than just the bits of the binary representation of the number
Y. Moreover, since
Y is less than 1, it is obvious that
b0 = 0.
Let's square our first equation:
X2=22Y and as before, we write the binary representation of
2Y . It's obvious that
2Y=b120+b22−1+...+bn2−(n−1)+...
Those. bits
bi remained the same, just the powers of two moved. Bat
b0 not present in the view because it is zero.
Two cases are possible:
one)
X2>2 , 2Y> 1,
b1=12)
X2<2 , 2Y <1,
b1=0In the first case, we take as the new value of
X X2/2 in the second -
X2 .
As a result, the task was reduced to the former. The new
X again lies in the range from 1 to 2, the new
Y from 0 to 1. But we learned one bit of the result. Taking the same steps in the future, we can get as many bits of
Y as possible.
Let's see how it works in a C program:
#include <stdio.h> #include <math.h> int main() { double w=1.4; double s=0.0; double a=0.5; double u=w; for(int i=0; i<16; i++) { u=u*u; if(u>2) { u=u/2; s+=a; } a*=0.5; } w=log2(w); double err=100*abs(2*(sw)/(s+w)); printf("res=%f, log=%f, err=%f%c\n",s,w,err,'%'); return 0; }
We calculated the logarithm with an accuracy of 16 bits and compared with what the mathematical library gives. The program brought:
res = 0.485413, log = 0.485427, err = 0.002931%
The result coincided with the library with an accuracy of 0.003%, which shows the efficiency of our algorithm.
Let's move on to an integer implementation.
Let N-bit binary unsigned numbers represent the interval [0, 1]. For convenience, we consider the unit number
2N , but not
2N−1 , and accordingly a deuce number
2 N + 1 . We will write a program in the image and likeness of the previous one, but working with integers:
#include <stdio.h> #include <math.h> #define DIG 18 // #define N_BITS 16 // unsigned ONE=1<<(DIG-1); // unsigned TWO=ONE<<1; // unsigned SCALE=1<<(N_BITS+1); // unsigned myLog(unsigned w) { unsigned s=0; unsigned long long u=w; for(int i=0; i<N_BITS+1; i++) { s<<=1; u=(u*u)>>(DIG-1); // ! if(u&TWO) // { u>>=1; s+=1; } printf("%X\n", (int)u); } return s; } int main() { double w=1.2345678; unsigned iw=(unsigned)(ONE*w); double dlog=log2(w); unsigned ilog=myLog(iw); unsigned test=(unsigned)(SCALE*dlog); int err=abs((int)(ilog-test)); printf("val=0x%X, res=0x%X, log=0x%X, err=%d\n",iw,ilog,test,err); return 0; }
Having played in a program with different bit depths (DIG), calculation accuracy (N_BITS), and logarithm arguments (w), we see that everything is calculated correctly. In particular, with those parameters that are specified in this source, the program produces:
val = 0x27819, res = 0x9BA5, log = 0x9BA6, err = 1
Now everything is ready in order to implement a piece of iron on veril, doing exactly the same thing as the
myLog function in C. The variables
s and
u in our function can be printed in a loop and compared with what the verilog simulator produces. The correspondence of these variables to the iron implementation is very transparent and understandable.
u is a working register that takes new values of
X during iterations.
s is a shift register in which the result is accumulated. The interface of our module will look like this:
module logarithm( input clk, // input wr, // input[17:0] din, // output[nbits-1:0] dout, // output rdy // ); parameter nbits=16; //
The input bus is adopted 18-bit, respectively, the width of the multipliers in Cyclone IV. The numbers on our module should come normalized. Those. with high bit equal to one. In my project, this was done automatically. But in which case to implement the normalizer, I think it’s not difficult for anyone. The accuracy of the calculations is set by the nbits parameter, by default equal to 16. The module counts one bit per cycle, and for 16 cycles it calculates the logarithm with an accuracy of 16 bits. If you need faster with the same accuracy or more precisely with the same speed, I hope it will not be difficult for anyone to divide the module into several devices and convey.
Here is the complete module and test code //--------------------- logarithm.v ------------------------------// module logarithm( input clk, // input wr, // input[17:0] din, // output[nbits-1:0] dout, // output rdy // ); parameter nbits=16; // reg[4:0] cnt; // reg[17:0] acc; // - reg[nbits-1:0] res; // always @(posedge clk) if(wr) cnt<=nbits+1; else if(cnt != 0) cnt<=cnt-1; wire[35:0] square=acc*acc; // wire bit=square[35]; // wire[17:0] next = bit ? square[35:18] : square[34:17]; // always @(posedge clk) if(wr) acc<=din; else if(cnt != 0) begin acc<=next; #10 $display("%X", acc); end always @(posedge clk) if(wr) res<=0; else if(cnt != 0) begin res[nbits-1:1]<=res[nbits-2:0]; res[0]<=bit; end assign dout=res; assign rdy=(cnt==0); endmodule //======================== testbench.v =====================// module testbench(); reg clk; // always #100 clk=~clk; reg wr; // reg[17:0] din; // wire rdy; // wire[15:0] dout; // logarithm log2( .clk (clk), .wr (wr), .din (din), .dout (dout), .rdy (rdy) ); // n task skipClk(integer n); integer i; begin for(i=0; i<n; i=i+1) @(posedge clk); #10 ; end endtask initial begin // $dumpfile("testbench.vcd"); $dumpvars(0, testbench); clk=0; wr=0; din=18'h27819; skipClk(3); wr=1; skipClk(1); wr=0; @(rdy); skipClk(3); $display("value=%X, result=%X", din, dout); $display("Done !"); $finish; end endmodule
Run the test with this script:
Running the test, we see the final output of the simulator - value = 27819, result = 9ba5. Verilog gave the same thing as C. The timing diagram here is quite trivial and is not of particular interest. Therefore, I do not bring it.
Compare the intermediate output of the simulator (acc) and the program in C (s):Verilog C
30c5d 30C5D
252b1 252B1
2b2bc 2B2BC
3a3dc 3A3DC
35002 35002
2be43 2BE43
3c339 3C339
38a0d 38A0D
321b0 321B0
273a3 273A3
30163 30163
24214 24214
28caf 28CAF
34005 34005
2a408 2A408
37c9d 37C9D
30a15 30A15
Make sure they match bit by bit. In total, the implementation on a verilo repeats the model on C up to a bit. This is the result that should be achieved by implementing algorithms in hardware.
That’s probably all. I hope someone will find this my experience useful.