Minimalist Floating Point Type
Introduction
The work presented here consists of a small, portable floating point data type and library ("SFP" for "Small Floating Point"). The primary design goal has been to target very minimal hardware. In particular, devices previously dominated by application-specific fixed-point mathematics are targeted by this work. Furthermore, this work attempts to target devices for which high-level language tools (with their built-in types and operations) do not exist, or are not practical.
The library is presented here in two forms. First, a C version is provided. This is designed to be compiled for target platforms. If necessary, this compilation can be done by hand. Second, this article includes a version written in assembly language, targeting Microchip Technology's "PIC" 8-bit microcontrollers. In particular, the PIC 16F690 and 16F1827 are targeted, and the last two major generations of 8-bit PIC microcontrollers are thus supported by the resultant code base, with only trivial changes.
Portions of the library have been ported to the Freescale Semiconductor 6800 family. These are not presented here, but are a possible topic for a future article.
Here are some highlights of the SFP system presented here:
- Super-wide dynamic range (1039)
- 2.4 digit precision
- Fully shift-based multiply and divide
- Table-based logarithm and exponentiation functions
- Correct rounding of results to nearest SFP approximation
- Exhaustively exercised in distributed testing
- Symmetrical design reduces bit arithmetic
- 15-byte peak memory usage (plus code)
- Flat function call model (zero internal calls)
- Optional reentrant implementation
- Three lightweight, non-conformant subsets
- Modular design; each operation is free-standing
Background
Floating point data types are a compact format for numerical data spanning a wide dynamic range. If one must store small numbers and large numbers together in the same type, it is cheaper with respect to storage to do so using a floating point type than using most other forms of representation. This is true because the internal representation of a floating point number includes a exponent. Just as it is quicker for a person doing calculations by hand to write a number in scientific notation (e.g. 1.1E+15 compared to its equivalent 1,100,000,000,000,000), it takes less space on a computer to represent very large (or very small) numbers in floating point. In both cases, it is the presence of the exponent that enables the briefer - and more efficient - representation. Truly general fixed point types, for example, end up being huge, e.g. the 128-bit C# decimal
.
Data size notwithstanding, there are some real obstacles to implementing floating point operations on the very same cheap hardware as could otherwise benefit from its compactness. The smallest standardized floating point type until very recently was the IEEE 754 binary floating point "single," a 32-bit type [1]. Non-IEEE efforts tended to adopt a 32-bit minimum size as well. Hardware types invariably used a size of at least 32 bits, too (e.g. [4]).
Beyond their innate size (4 machine words on an 8-bit integer device), such types typically have a 24-plus-bit significand (the fractional portion of the number stored alongside the exponent, sometimes called the "mantissa") that thus spans multiple machine words on 8-bit hardware. Floating point routines end up having to do what amounts to 24-bit fixed point arithmetic - along with the bit level operations necessary to extract the significand.
Smaller floating point data types than the IEEE "single" were, until recently, not standardized and mostly intended for classroom instruction; see, for example, this survey of formats. By 2008, though, 16-bit floating point data types were just beginning to gain more prominence. IEEE standardized a 16-bit "half precision" binary floating point type in 2008, albeit in a form that left the multi-byte significand, and the difficulties associated with it, in place.
This new IEEE 16-bit type also does not offer the range to serve as the basis of a general-purpose mathematics library; the largest representable number is ~65,000, similar to that of an unsigned 16-bit fixed point number. Nevertheless, the decision made by IEEE to essentially shrink their floating point standard is an atypical and interesting one, made as it was in days where programmers invoke Moore's Law as often as they stop to think about optimization. More to the point, this design decision reinforces the potential usefulness of the sub-32-bit floating point type. This new IEEE type is not an academic curiosity; it has become an important part of 3D graphics, e.g. of the architectures of DirectX and OpenGL, and of compatible GPUs.
The rest of this submission consists of an attempt to develop a general-purpose floating point data type that is even more amenable to implementation on 8-bit hardware than the IEEE "half precision" type. The representation developed uses an 8-bit exponent in conjunction with a 7-bit significand and a sign bit. The byte in which the latter two values are stored is termed the "significand byte" in the text below.
The exponent is a two's complement number (for easy addition and subtraction) while the significand byte holds the significand proper in bits 0-6, with the sign bit is in bit 7. The significand proper's portion of the significand byte is an unsigned 7-bit number which ends up getting applied in conjunction with an implicit denominator of 128 and 8th bit of 1. The number with significand byte M and exponent byte E therefore evaluates as:
when the sign bit is clear and:
otherwise, where M' is equal to M modulo 128. These two formulas can be unified into one as shown below:
The smallest representable quantity is therefore:
This evaluates to plus / minus 2-128. The largest representable quantity is:
This value is approximately (just less than) plus or minus 2128.
Necessarily, tradeoffs have been made which will render the SFP type unsuitable for many purposes. The 8-bit value of the significand (including its implicit top bit) yields:
log<sub>10</sub>(2<sup>8</sup>) = log<sub>10</sub>(256) = 2.4
significant digits, which is comparatively poor.
However, the test applications developed for the new data type ("SFP") also demonstrate some potential usefulness in real world applications. A PID position controller was developed, and exhibited good stability and tuning characteristics compared to an earlier, fixed-point version of the same. Additionally, programs were written to calculate very large factorials on some cheap (less than $2.00 / 2 volts with 256 bytes RAM) 8-bit microcontrollers, with surprising accuracy.
Moreover, in all cases, the use of floating point resulted in a more abstract, natural, and universally applicable programming idiom. Fixed point representations implemented on an 8-bit integer device necessarily "require the programmer to keep track of implicit scale factors," [5] which will end up being application-specific. Such details are abstracted away by the use of floating point, allowing the programmer to freely refer to, say, 0.001 and 123456789.0 using the same type and the same operations.
A comparison with the IEEE 16-bit and 32-bit floating point data types is made in the table below:
TYPE SIZE EPSILON* RANGE PRECISION
IEEE** 16bit 4.88e-04 6.8e+05 3.3 digits
IEEE** 32bit 5.96e-08 4.6e+38 7.2 digits
SFP 16bit 8.84e-02 3.3e+39 2.4 digits
* A measure of peak rounding error
** In "Round Nearest" mode; rounding for logarithms, etc. not specified for IEEE types
The range of the SFP type is superior to that of both IEEE types. SFP can thus represent a wider array of numbers than these types. In particular, the deficiency of the IEEE half precision type with respect to even moderately large or small numbers has been addressed. The reduction in precision, especially compared to the 32-bit type, is an obvious tradeoff.
The IEEE half precision format, while deviated from, was selected as a starting point for the SFP design decisions. Targeted changes were made to the representation used by IEEE, and these are enumerated below this paragraph. Most of these relate to the role anticipated for SFP, i.e. selected, well-understood applications running on low-cost / low-power 8-bit computing devices like the PIC and 6800 families:
- The exponent is expressed in standard two's complement, vs. exponent bias (or "excess") format. This allows single-operation addition and subtraction of exponents, which are important operations, while also making comparison of SFP numbers slightly more difficult.
- More bits are allocated to the exponent. This increases dynamic range to a level suitable for general-purpose use.
- IEEE has special values set aside for codes (infinity, NaN, etc.) In SFP, out-of-domain conditions "bump" intuitively into the extrema of the SFP type instead. Or, if this is not possible, behavior is undefined.
There are other differences, mostly involving the scope of each type and the allowability of subsets. IEEE mandates the existence of printing and conversion functions, and specifies some strict standards for these. IEEE floating point also requires more mandatory library functions, including a more general "power" function. Interestingly, IEEE floating point also does not mandate a rounding strategy for what it terms "library" functions (versus core operations), and this includes the logarithm function. SFP uses the same rounding strategy for logarithms as for the other functions.
Other aspects of IEEE are adopted without major change:
- The normalization strategy is the same; the entire term other than the term 2E lies between 1 and 2 (or -1 and -2). Since the exponent's base is 2, a given real number can only be represented using one possible exponent.
- The representations of the significand and sign bit are identical in format to IEEE.
- The IEEE "round-to-nearest" or "RN" strategy is used. (Round-to-even is not enforced for ties, a fact which has no impact on accuracy or precision.)
- Selection of a radix of 2 is shared with IEEE and all other modern float formats.
Ultimately, it is hoped that what is presented below offers a novel perspective on what might be called "low-budget calculating," although the limitations of the type must admittedly rule out many applications. One final point in favor of SFP is that it has been exhaustively tested using brute force comparison of results, for all possible parameters, to results obtained using GCC floating point. These efforts are documented further below; suffice it to say here that floating point arithmetic represents a problem domain in which high-profile problems have been the rule, and testing itself has been a challenge (e.g. [2] and [3]). The original version of the Intel Pentium is a prominent example; also, the unsuccessful IBM RT PC was "was scandalized mid-life with the discovery of a bug in the floating point square root routine." Every effort has been made to avoid these in the construction of the SFP library.
The C Implementation
File "sfp.cpp," present in the root folder of the source code archive, contains the C implementation of SFP. This was developed using the GNU C++ compiler (GCC; more specifically, MinGW) and is intended as a demonstration and starting point. Portions of "sfp.cpp" devoted to testing the library use C++ features. The implementation proper uses only features of the C language.
In unedited form, this program is of little use; GCC already contains a full retinue of built-in floating point types. The real utility of SFP lies in its portability to varied hardware. It is the intent of the author that selected subsets of "sfp.cpp" be excised and compiled (by hand, if necessary) for a variety of 8-bit devices. Nevertheless, "sfp.cpp" serves as the central locus for the definition and testing of SFP, and this program is explored in detail before any of the specific implementations.
The build command used during development is shown below this paragraph. This will build file "sfp.exe." Because MinGW was used during development, on a Windows system, the full path to the compiler is specified; on a UNIX system with better access to the GNU compilers, this may not be necessary, i.e. the command can start with "C++," omitting the folder path and extension.
c:/mingw/bin/c++.exe sfp.cpp -osfp.exe
This should be executed from the root folder of the source code download, i.e. from the folder where "sfp.cpp" is resident. A command shell (UNIX) or Command Prompt window (Windows) should be used.
When loaded, "sfp.exe" enters a comprehensive test loop at the top of "main()
." This is designed to test its results for accuracy against the compiler's floating point system. Such a test will typically span several days, and a "/reverse
" command-line option is provided to help distribute the load of testing. Specifically, it is possible to execute "sfp.exe" on one computer, and simultaneously begin the command "sfp.exe /reverse
" on another, and the computers will attack the problem of comprehensive testing from opposite ends of the test sequence. This technique was used during final testing of SFP, to ensure not just 100% coverage of code, but also 100% coverage of all possible parameter values. If the program completes without halting and displaying the verbiage "EXCEPTION!" then the comprehensive test has succeeded.
The rest of "sfp.cpp" other than the test code consists mainly of static
methods implementing the supported operations. In general, these have a signature somewhat like this:
typedef unsigned char byte;
void sfp_sub
(
byte min, char ein,
byte min2, char ein2,
byte *mout, char *eout
);
Parameters "min
" and "ein
" are the input significand and exponent bytes for the left-hand operand, respectively. Similarly, the parameters "min2
" and "ein2
" are the input significand and exponent bytes for the right-hand operand. Pointers "mout
" and "eout
" point to the output significand and exponent bytes, respectively.
The supported operations are addition, subtraction, multiplication, division, exponentiation (2X), and logarithm (base 2). The names used for these in the C implementation are, in order, sfp_add
, sfp_sub
, sfp_mul
, sfp_div
, sfp_pow
, and sfp_log
.
"Primitive C"
The library functions in "sfp.cpp" are written using a subset of C, which has been designated "Primitive C." This special dialect is designed to record the operations performed by the library code at the byte level, assuming only the most basic capabilities (8-bit addition, subtraction, shift, and bitwise operations).
Primitive C lacks control structures, and types of greater than 8 bits, except for their brief use in overflow checks. In short, it does not allow anything that might be absent from a hypothetical 8-bit system, or just difficult to hand-compile. The only multiplication and division allowed are multiplication and division by 2, i.e. single-bit shifts; this reflects the capabilities offered by the most rudimentary processors, such as the PICs used here. On systems without high level language capabilities, Primitive C is designed to serve as a sort of generic quasi-assembly language, and to be quickly translatable into real code.
The use of this special subset of C seems likely to provoke criticism. It must be admitted that the result is somewhat ugly. Beyond the issue of aesthetics, the idea of taking a subset of a language and artificially constraining one's development activities to it seems problematic. Such a decision invites one to ask why the entire language was not used.
The author's initial experience with such a language subset was in a class that used the work of Wakerly [5] as a text. Wakerly uses a primitive subset of Pascal to model the Motorola (now Freescale) 68000 processor. Like Wakerly's Pascal, Primitive C attempts to describe algorithms of a certain sort using a nomenclature that will be familiar to programmers.
A distinction must be drawn between writing C that follows strict portability standards, and writing something like Primitive C or Wakerly's special Pascal. Portable C resides at a higher level of abstraction than does Primitive C. Consider, for example, that the work described here does assume an 8-bit signed char
type. The possibility of building a type that would fit into a C short
, whatever that might be on a given platform, was considered, but rejected. What was deemed necessary here was an unambiguous 8-bit library, targeted at a specific sort of device. The possibility of even further abstraction was thus set aside.
An example of a sequence of Primitive C statements is shown below:
//Handle log(1)
if(*mout) goto nzmdss;
if(util) goto nzmdss;
*mout=128;
*eout=-128;
return;
nzmdss:
A number of salient features of Primitive C can be discerned from this fragment. To summarize, it sets 8-bit parameter *mout
to 128, and sets 8-bit parameter *eout
to -128, unless either *mout
or util
is non-zero.
First, it should be noted that the Primitive C code does not spell out how parameters should be passed, how values should be returned, or even how each function's local variables should be allocated. In the fragment shown above, *mout
("mantissa out") is starred because it is a pointer to a location where part of the return value is stored. In contrast, many intermediate values are stored in simple variables which can be accessed without indirection, and their names thus appear in the same context without a star, e.g. util
.
This syntactic difference, though, is not necessarily evident at any given location in the assembly language code. In the very simplest implementations, all of these locations tie to hard-coded memory locations, and no equivalent syntactic difference is evident in the object code. Even in an implementation where *mout
and *eout
do end up getting returned on a call stack, these return values may still need to be manipulated in intermediate locations for most of an operation. In such cases, no difference between the use of identifiers that are starred in the C implementation and those that are not will be evident in the eventual assembly language code.
As hinted by the fragment, variables are 8-bit unsigned values (byte
s), as are most parameters. Parameters explicitly dedicated to exponents (e.g. *eout
above) use a standard (i.e. signed
) char
. Basically nothing with a curly brace, other than the function declaration itself, is allowed. Note, for instance, that it would be much easier to express the last fragment as shown below, except that this would not conform to Primitive C as described:
//Handle log(1)
if((!*mout)&&(!util))
{
*mout=128;
*eout=-128;
return;
}
All non-parameter variables have type byte
, necessitating typecast operations when dealing with signed data, e.g.:
//Handle overflow condition
if((((char)eout_tmp))!=-128) goto cnc45;
if(!b128) goto cnc46;
//Return +/- max. SFP value.
*eout=127;
*mout=127;
if(neg==neg2) goto caaa0;
*mout|=128;
caaa0:
return;
cnc46:
eout_tmp=127;
b128=true;
cnc45:
Here, eout_tmp
is holding signed data, and this treatment is made explicit by the typecast in the first statement. A few other key aspects of Primitive C are in evidence as well. The statement *mout|=128;
sets the top bit of an unsigned variable; this evaluates to a single line of assembly language in simple fashion. Finally, the pattern used for program flow control is once again seen above. In general, it is not acceptable in Primitive C to follow an if
with anything other than a goto
statement.
All of the architectures to which SFP has been ported thus far make use of at least two basic registers: a condition code register (and, in particular, a Carry bit) and an accumulator. In designing Primitive C, decisions had to be made about what sort of abstraction to construct around these commonalities.
In the case of the accumulator, it was decided that no specific Primitive C declaration or variable ought to be created. Rather, it was determined that the availability of the accumulator should be assumed by the Primitive C statements. This liberalizes the definition of Primitive C, and avoids having to deal with an accumulator variable. Consider, for example, the following snippet:
if(ein==ein2) goto eein2outw;
The equivalent PIC assembly code is:
movfw ein1 ;EIN1 in ACCUMULATOR
xorwf ein2,w ;XOR ACCUMULATOR and EIN2
btfsc STATUS,Z ;IF ZERO...
goto eein2outw ;...BRANCH
Note that the use of register w
is only implied in the Primitive C code.
This last fragment also exhibits a peculiarity of PIC assembly language: all conditional operations are based on conditional skips, not on conditional branches. The opcode btfsc
stands for "bit test [static RAM] file and skip [next instruction] if clear." In the last snippet shown, for example, the intent is to branch if the Zero bit of the status register is set. So, a goto
is placed below the btfsc
instruction. The goto
gets skipped if the Zero bit is clear and (importantly) executed if it is set.
In general, anything that follows the broad if
/goto
pattern can be a valid Primitive C conditional statement, and it is assumed that the accumulator register will be used as an intermediary to evaluate the necessary conditions. Another example of how the accumulator is used implicitly is shown below:
fullmin=128+min;
The equivalent PIC assembly code is:
movfw min ;MIN IN ACCUMULATOR
addlw .128 ;ADD 128 TO ACCUMULATOR
movwf fullmin ;STORE IN FULLMIN
This simple treatment of the accumulator abstracts over the specifics of the register, without adding any great difficulty to the task of translating Primitive C to assembly. In the case of the Carry flag, however, it did prove useful to provide an actual variable. Consider the following example:
//Shift quotient_lo left, carrying up into min
c1=quotient_lo&128;
quotient_lo*=2;
if(!c1) goto qu5e;
++min;
qu5e:
In summary, quotient_lo
is multiplied by 2, and any carry out of the most significant bit is handled by incrementing variable min
. This evaluates, in the PIC implementation, into the following assembly language block:
;Shift quotient_lo left, carrying up into min
bcf STATUS,C
rlf quotient_lo,f
btfsc STATUS,C
incf min,f
In the assembly language implementation, it is necessary to clear the Carry bit before executing the left shift (rlf
). The contents of this bit are placed into the bottommost bit of the result of any left shift on both the PIC and 6800 platforms. After that, note that the incrementation of min
is predicated on the CPU's Carry bit, not upon any actual variable named c1
. Correspondingly, in "sfp.cpp," variable c1
is declared in a special "Register Variables" section, and is not counted toward the 15-byte peak memory utilization quoted for SFP. Nevertheless, it is necessary to use something like an automatic variable (c1
) to make the purpose of the Primitive C clear.
Many assembly language arithmetic routines rely on the interplay of the Carry bit and shift mnemonics such as rlf
. Consider, for example, the following snippet, taken from the PIC multiplication routine:
;Shift hi_byte:make_mout
bcf STATUS,C
rrf hi_byte,f
rrf make_mout,f
In the example above, the bottom bit right-shifted out of hi_byte
in the first rrf
operation gets placed into the top of make_mout
after the second such operation. This happens naturally; rrf
constructs the top bit of its result from the Carry bit, and correspondingly shifts the bottommost bit of its operand into the Carry bit. The Primitive C present in "sfp.cpp" spells this behavior out exactly, using register variable c1
as proxy for the Carry bit:
// Shift hi_byte:make_mout
c1=hi_byte&1;
hi_byte/=2;
make_mout/=2;
if(!c1) goto chri8;
make_mout|=128;
chri8:
This is a very typical pattern on 8-bit devices. Such shifting behavior provides a foundation for performing calculations involving more than 8 bits.
Memory Allocation
Variables such as util
and make_mout
have been taken mostly as givens in the examples above, with no attention devoted to their allocation. In fact, this is an area where some leeway must be granted to the author of any one implementation of SFP.
These issues are handled in the simplest, cheapest way in the supplied C code. That is, it is assumed that no one instance of any SFP function will be running at any given time, at least not in the same allocation space. So, all storage ultimately resides in a series of global variables shared by all functions. These have names like loc000
, loc001
, etc., all the way through loc00E
. It is the size of this suite of variables that establishes the "peak storage requirement" quoted for SFP above, and storage requirements are indeed minimized by such an approach.
However, it would hardly be convenient to write an entire floating point library using variables with names like loc00E
, especially in the reduced subset of the C language in use here. So, prior to each SFP operation, blocks like the one shown below are present:
/*Define LOG16 mem. map*/
#define arg1 loc000
#define karg1 loc001
#define util loc002
#define exam loc003
#define neg loc004
#define rev loc005
#define moutL loc006
#define moutH loc007
Below each operation, these definitions are undone using #undef
.
This default memory allocation setup is undeniably primitive. In many ways (its use of global accessibility, its use of static allocation) this setup exhibits exactly how one should not allocate memory. However, these decisions are only defaults, and will actually make sense for many applications.
One of the target platforms for this work, for example, was the Heathkit ET-3400. This is a system with just 256 bytes of RAM, for both program and data. On such a device, there is no real possibility of allocating local variables from a call stack, or even parameters. Static memory locations are simply selected, and the programmer must count himself fortunate to be able to execute floating point routines at all.
On other systems, more expansive design visions can be implemented. The PIC implementation provided with the article, for instance, does use a true call stack for parameters and return values. It also uses global variables in conjunction with #define
for util
, make_mout
, and the like; on systems requiring reentrancy, util
, make_mout
, etc. should be allocated as automatic (i.e. stack) variables, so that dueling threads-of-execution do not use shared storage in ways that conflict.
Arithmetic
The arithmetic operators provided in the SFP library are based on the work of Wakerly [5]. These algorithms require 16-bit multiplication and division, and implementing these optimally requires some work. In all cases, such calculations are performed using shift operations, not repeated addition or subtraction.
The shift-based, 16-bit multiplication used by sfp_mul()
was taken from a computer engineering syllabus placed online by Rensselaer Polytechnic Institute. The analogous division routine used within sfp_mul()
is an example of Non-Restoring Division, and is similar to the SRT Division Algorithm implemented in silicon in several later (i.e. post-bug) Pentium processors.
In both cases, the shifting techniques described in the "Primitive C" section above provided the underpinnings for much of what these sources describe at a higher level. Achieving properly rounded results was largely a matter of maintaining lower byte storage areas to hold values shifted out of a register during such operations. The top bit(s) of these were used to guide the rounding process.
Logarithms
The algorithm used by SFP to calculate radix 2 logarithms is original with the library. The task of taking an SFP number and finding its logarithm is eased by the fact that the second, exponent byte of the number can serve as an estimate of the result. Consider once more the formula for the value of an SFP number having significand byte M and exponent E:
Because there is no overlap between SFP numbers, the left-hand factor of this expression,
must lie between 1 and 2, and the resultant SFP number must lie between plus or minus 2E and plus or minus 2(E+1).
The base 2 logarithm of the number will thus lie between E and E+1, and an SFP approximation of this number can be obtained using the normalization process.
If the parameter's exponent is 3, for example, then the correct logarithm will lie between 3 and 4, and an initial approximation can be constructed as 3, or (3/128)?27. All normalized SFP numbers consist of a factor 2X times a factor between 1 and 2. Because of this, the value 3/128 in this estimate must be shifted left 6 times (i.e. multiplied by 64) and its exponent decremented by 6 in compensation. This yields a representation for this estimate of (192/128)?21,or M=64, E=1.
Further detail is obtained by using the parameter's significand byte M. In the C implementation provided, this is handled using a lookup table. This table is indexed by the parameter significand, modulo 128, and the result is added to the aforementioned approximation before normalization. The values in the table are approximations of:
In the formula above, x
is the index value. Because the original SFP parameter was expressed as a product, the logarithm of said number will thus be the sum of its two factors - this is the main point of logarithms - and adding such a value directly into the end result is mathematically justified.
Exponentiation
The overall translation realized by the SFP exponentiation function is to take an SFP number having value V and return 2V. For an SFP number having significand byte M and exponent byte E, the strategy used is to return an SFP number with an exponent equal to the value of (M mod 127). This number is then shifted based on E.
Consider some easy examples. If M=0 and E=7, no shifting is required. The factor formed by the significand of 128 (M plus the 128 accounted for by the "hidden" implicit top bit) is the correct exponent of the answer; in other words
(These numbers are outside the domain of the type, so the maximum type value is returned.)
Similarly, if M=0 and E=6, a single right shift (division by 2) applied to the significand yields the correct result:
The number of shifts applied to the output exponent is 7-E, where E is the input exponent, e.g.:
and
and so on.
Additional detail is provided using another lookup table, this one having 256 entries. The index into this table is constructed by taking the bits shifted out of the result's exponent byte in performing the divisions shown above. The component obtained from the lookup table accounts for the fractional portion of the division's result. For instance:
In the progression shown above, L is the value obtained from the lookup table. In essence, it ends up determining the value of the result's significand, and the formula for each entry in the table is
(2<sup>i/256</sup>-1)?128
where "i
" is the index, i.e. the offset from the table start.
In all these manipulations, the question of domain looms large. Passing 127.5 or greater into sfp_pow()
returns the maximum possible SFP value, for example, and in the implementations provided, several such situations are checked for near the top of the function.
Rounding
The simplest floating point algorithms handle the issue of rounding using truncation. This was deemed inadequate for SFP; the precision of the type is already low, and simply abandoning meaningful data would only worsen things.
Instead, it was decided that each SFP operation should return the SFP number closest to the theoretical result of the requested operation. This theoretical result must be calculated using:
for each SFP parameter, not using any pre-approximation value; other than that this "round-to-nearest" strategy is easy to describe, and easy to test for (given an existing mathematics library of sufficient quality).
However, rounding results to the nearest SFP number actually presents some difficulties. At each shift operation present in each operation, for example, the possibility of data loss exists. One approach seen at several places in SFP is to shift bits down into a bit- or byte-sized memory location, in cases where detail would otherwise be abandoned. The exponentiation function's use of this strategy was already discussed, and in several places where floating point values must be normalized this tactic repeats itself. Many processor architectures facilitate such operations by shifting into, and out of, the machine Carry bit. This is true of both the PIC and 6800 families.
Where lookup tables are used, an additional issue is the interplay between the code's rounding and the inherent rounding applied to the table values, if any. If the least significant bit of the table value was rounded up, one might ask, then how should this affect rounding inside the function itself?
The collective implications of all such situations are difficult to fathom, and the problems they create are only made worse by the relative scarcity of prior work involving 16-bit floating point implementations. (32-bit rounding techniques do not necessarily scale down well; lookup tables, at least of the sort described here, are probably impractical in size, for example.)
So, the approaches selected are necessarily ad hoc in style. In all cases, though, the design process was informed by an examination of the tradeoffs involved. Such tradeoffs had to be made between, for example, adding additional lookup data versus retaining more bits during calculation or hard-coding corner cases.
In the logarithm function, for example, a 128-entry lookup table indexed by (M mod 128) is used. Initial attempts to achieve proper rounding using 128 8-bit table entries resulted in many rounding errors. Adding another 128 entry lookup table essentially fixed these errors. (GCC still identifies a single "corner case" where hard-coded handling is used.)
Because these tables are 128 entries in size, 256 words of lookup data (and one conditional) are required. The values in the lookup table used by sfp_log()
are expressed in increments of 2-16, i.e. in 65,536ths. Initially, the table values were rounded, i.e. the last bit of the number in the table was selected such that the closest possible approximation was yielded. In some cases, later rounding operations inside of sfp_log()
interacted with this rounding in undesirable ways, and the table values were tuned by hand to force correct results.
The power operation presents different tradeoffs. The lookup table in this case happens to be 256 entries long, since it is indexed by a full unsigned byte. This means that 8 bits of lookup data per entry by themselves take up 256 words of storage. This number is considerable; it is not possible to allocate on the base Heathkit ET-3400, for instance, even before one considers the code itself and the 15 byte peak storage requirement. (On the PICs, lookup tables can be stored in the somewhat larger program storage area.)
So, the efficiency of adding additional lookup data (giving a minimum total of 512 words, on an 8-bit device) was weighed against the possibility of adding corner case logic, and it was found that the latter option was superior. Only 71 such cases need to be handled, and this can be accomplished using fewer than 256 words, even on an 8-bit PIC. In both cases, the design selected seemed to reside at an architectural "sweet spot."
SFP Subsets
Two subsets are defined by "sfp.cpp" using the C preprocessor:
- If
FASTMATH
is defined, thensfp_pow()
does not apply any special handling to corner cases, and errors are therefore possible. These are single bit errors, although it must be remembered thatsfp_pow()
will invokesfp_div()
in some cases, further decreasing accuracy. - If
BOUNDSCHECK
is not defined, exponent overflow is never checked for. This applies to the overall operation, as well as some intermediate calculations that might exceed the range of the system even if the eventual result would be representable. Nevertheless, it is still allowable to use some numbers of very large/small magnitude even without bounds checking. While the absolute limits have not been completely explored as of this time, conservatively it can be said that operations having best results with exponents ranging from 2-126 to 2125, inclusive, should at an absolute minimum return correct results even without bounds checking.
Both of these options reduce the size of the resultant library code considerably.
In addition, it should be realized that most of the SFP operations are independent of the other operations. The logarithm function can be taken by itself and implemented in a system where a logarithmic scale needs to be applied, for instance, without any need to implement the other SFP operations. Subtraction does execute a goto
into the addition routine, and negative exponentiation uses a goto
into the division routine to invert the final result.
Exceptional Conditions
In the default implementation, where BOUNDSCHECK
is defined, the domain of all functions encompasses all representable numbers, and correct results will be returned in all cases. This specification is qualified only by the caveat that taking the logarithm of a negative number will result in undefined behavior.
This means that the closest SFP number to the answer will be returned, allowing only for single-bit FASTMATH
errors. If the answer is too big to represent, the largest representable number will be returned, with appropriate sign. Similarly, if the number is too small to represent, the smallest representable number will be returned, again with appropriate sign.
One neat benefit of this setup is the fact that division by zero is really no longer an exceptional case at all. Rather, it quite logically returns infinity, as approximated by the closest SFP number to infinity. Similarly, SFP does not have a distinct zero; the smallest possible values (where M is in {0, -128} and E=-128) serve as a proxy for zero where such a proxy makes sense. The logarithm of 1 is calculated as M=0, E=-128, for example. SFP also does not include any special code values like NaN or Infinity, nor does it declare or throw any sort of exception.
In some cases, an operation upon nonsensical parameters cannot effectively be mapped to an SFP extreme. The logarithm of a negative, for example, is mathematically undefined within the real number space, and invoking such an operation in SFP results in undefined behavior. Applications susceptible to such problems will need to check for them using logic external to SFP.
SFP does not have a distinct zero; the smallest possible values (where M is in {0, -128} and E=-128) serve as a proxy for zero where such a proxy makes sense. The logarithm of 1 is calculated as M=0, E=-128, for example. SFP also does not include any special code values like NaN or Infinity, nor does it declare or throw any sort of exception.
None of these SFP design decisions are meant to imply that detection of exceptional conditions is unnecessary for a given application, and any production application using floating point ought to rest on a thorough understanding of the type used. Developers of SFP applications, in particular, must be cognizant of the way that the limited precision of the library affects addition and subtraction, e.g. adding a small quantity to a large one may produce no effect. Such considerations must be understood and handled using means outside of SFP.
At the same time, though, this is not so different from the burden placed on any programmer using floating point. The issue of adding very small values to very large ones, without effect, is inherent to floating point. More generally, exceptions are a means of communicating situations back to the application programmer, but they do not remove the underlying considerations of domain that accompany binary floating point.
The design of SFP acknowledges the fact that full IEEE 754 exception handling is impractical on many devices. This changes the exact nature of the interface between SFP and its application programs, without setting SFP apart from other floating point libraries in any fundamental sense.
The PIC Implementation
The PIC implementation supplied with this article is present in the "pic" subfolder of the root folder of the source code "ZIP" file. The PIC code provided includes functions for addition, multiplication, and division. These are based on the Primitive C code with BOUNDSCHECK
undefined, and are named addf
, mulf
, divf
, respectively. A demo program is also provided, which calculates 34 factorial (34! - very close to the top of the range of the SFP type) and prints it repeatedly.
The operator functions are stored in files named "sfpaddf.asm," "sfpmulf.asm," and "sfpdivf.asm." These files are independent of each other. It is not necessary to assemble, link, or otherwise include an operator's file if that operator is not necessary for an application.
A print
function is provided, along with comparison and conversion functions. Most of these functions are stored in common file "sfp.asm," which is mandatory for all applications. All of the files mentioned thus far, i.e. all of the SFP library files, are stored in a common folder, which is shared by all targeted PIC processors.
Ultimately, whichever functions are necessary for a given application are each assembled individually, along with the application's own files, using "MPAsm.EXE." Then, these must be linked together with "MPLink.EXE." This is demonstrated by "make.bat," the build script for the PIC SFP large factorial demo. Another batch file, "clean.bat," is provided to remove all of the build process' outputs. (It should be noted that all ".bat" files provided are intended only for the Windows "Command Prompt.")
The print function (printf
) is rudimentary; rather than deal with issues of formatting, it outputs SFP numbers in a fixed-size format similar to scientific notation, e.g.:
(-192/128)*2^+001 (+128/128)*2^-001
The caret character '^' stands in for exponentiation. The numbers shown above are -3 (top) and 0.5.
The conversion functions are utof
, which creates an SFP number from an 8-bit unsigned value, and ftou
, which converts an SFP number into an 8-bit unsigned value. These reside in source files "sfputof.asm" and "sfpftou.asm." Debugging function dbgpkf
is also provided (in "sfp.asm"). This function prints the float stored atop the parameter stack, but leaves it in place. It is thus designed to be inserted directly into problematic code without disrupting its operation.
The PIC 16F690 and 16F1827 are targeted. These two devices typify the last two generations of PIC 8-bit microcontrollers. For example, Microchip Technology sells a demo board and/or starter kit based around each of these exact devices. In general, members the last two major generations of 8-bit PIC microcontrollers are supported by the resultant code base, with only trivial changes.
Each device-specific subfolder of the "pic" folder contains its own version of "make.bat," which is used to build "target.asm," along with all of the SFP assembly files, the stack implementations, etc. into a HEX file. Note that "make.bat" assumes that MPASM and MPLINK are installed in their default locations, i.e. under "C:\Program Files\."
The HEX file can be programmed onto a 16F690 using basically any Microchip Technology programmer. If the MPLAB IDE is used, the correct menu selection to load the HEX file is the "Import" item of the "File" menu. Then, "Program" must be selected. After programming, and after any reset or loss of power, the PIC will then begin emitting the value of 34!, which is the largest factorial that can be expressed using SFP. More details about programming, and about setting up serial communication, are available from this article.
The actual text transmitted on the PIC serial port is:
(+225/128)*2^+127
This evaluates to 2.9908e+38. The actual value of 34! is 2.9523e+38. The error is thus 1.3031%, which must be weighed against the fact that 32 multiplications are performed to obtain this result. This yields a per-calculation error of just .00040722%.
High-Level Structures
The PIC implementation uses dual software stacks (in addition to the small, hardware-resident return address stack). The first of these (i.e. the stack indexed using FSR
on the PIC 16F690, or FSR0
on the PIC 16F1827) is used to pass SFP parameters and return values, and to effect the calculation of factorials made by the supplied demo program. Macros PUSH
and POP
assist with these operations; push operations take the value to be pushed from the accumulator w
, and pop operations are executed into w
. SFP values are passed by pushing the significand byte followed by the exponent byte. For binary operators, the first SFP number pushed is analogous to the left-hand operand, e.g. the numerator for division.
The code for these stacks, and for a few other items of infrastructural code, is specific to the generation of PIC device being targeted. The PIC 16F1827 has two index registers, for instance, whereas index register swapping is (somewhat painstakingly) implemented in software on the 16F690. As such, these portions of the code reside in two application-specific subfolders, "pic/pic16f690" and "pic/pic16f1827." The file containing the demonstration program's entry point is also device-specific, and distinct versions of this file ("target.asm") reside in these same device-specific subfolders.
Note that much of the code at the top of "target.asm" is boilerplate UART setup code; more information about this code is available here. Here, let it suffice to say that the PIC implementation emits its output on pin TX, at 57,600 baud, with 8-bits per byte, no parity, and one stop bit.
One major obstacle to the calculation of large factorials on the 8-bit PIC is its small, built-in stack. The PIC call
and return
facilities rely on a hardware stack of limited depth. In the 16F690 (prior generation 8-bit PIC), this stack stores return addresses for just 8 function calls, with no provision for detecting overflow. This means that one obvious implementation of factorial (exemplified by the Scheme function below this paragraph) will quickly break down, and this is true regardless of the data type used for the numbers involved:
(define (fact n)
(if (= n 0)
1
(* n (fact (- n 1)))))
This Scheme code is not tail recursive, and thus requires a return address be placed onto the stack for each multiplicand of the factorial (e.g. 20,19,18,17... etc. down to 2 for 20!). This is not feasible on the minimalist hardware targeted by the work presented here.
As an alternative, consider the C++ factorial program shown below. This program is very similar in construction to the supplied PIC demonstration programs.
#include<stack>
#include<vector>
#include<cstdio>
using std::stack;
using std::vector;
stack<float, vector<float> > s;
void longfact();
int main(int argc, char *argv[])
{
puts("");
longfact(); //Endless loop
return 0;
}
void build(); //Build list like "0.0, 5.0, 4.0, 3.0, 2.0"
void mulstr(); //...invoke stack multiply on list until 0.0
void fact()
{
build();
mulstr();
}
void longfact()
{
do{
s.push(34.0);
fact();
printf(" %f",s.top());
s.pop();
}while(1);
}
void build()
{
do{
if(2.5<=s.top()){s.push(s.top()-1.0);}else break;
}while(1);
}
void mulstr()
{
do{
float f=s.top(); s.pop();
if(s.top()<0.5)
{
s.push(f);
break;
}else{
float g=s.top();s.pop();s.push(f*g);
}
}while(1);
}
This program first builds a list of floating point numbers something like "0.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0" on stack s
, and then performs stack-based multiplication until the 0.0 is found. No recursive function calls are used in this code. This is true of the PIC code in each version of "target.asm" as well. If expressed in Scheme, such an implementation would use only tail recursion.
The code in "target.asm" implements the same basic algorithm as this C++ code, and thus operates at a high level of abstraction. This code is reminiscent of the object code of a compiler or the threaded code used to execute some high-level languages. The next few snippets of code from "target.asm," which execute sequentially, typify the level of abstraction at which the code in this file operates:
movlw .32
PUSH
movlw .1
PUSH
The snippet above places the SFP approximation for 2.5 (part of the C++ factorial algorithm just shown, and approximated as M=32, E=1, or 160/128?21 ) onto the stack.
The next sequence takes the first two parameters pushed for the executing function (build
- again, this is evident in the C++) and copies them to the top of the stack using function parm
. Before calling parm
, the byte index of the parameter requested (e.g. 0 for the topmost parameter) must be pushed onto the main stack:
movlw .1
PUSH
FAR_CALL build,parm
movlw .0
PUSH
FAR_CALL build,parm
Next, the gtf
function from "sfp.asm" is invoked, to compare the two SFP numbers now present at the stack top:
FAR_CALL build,gtf
The gtf
function consumes these two floating point numbers (four bytes) from the stack top, and places a single byte value on the stack in return. This will be non-zero if the first SFP number passed is larger than the second number passed (i.e. than the topmost SFP number), and 0 otherwise. This result value is then used to choose between two conditional branches:
POP ;XOR literal 0 with W, result in W, set Z flag based on result
xorlw .0
btfsc STATUS,Z
goto hlllb51J5
goto hlllb51J6
While somewhat primitive compared to the application code typically written for a contemporary general computer, the code shown above is, it is hoped, legible and intuitive by the standards of PIC 8-bit assembly language. It does track the C++ factorial program just shown very closely. One must also consider the possibility of generating such assembly language sequences automatically, as the object code of a high-level language. What is present in "target.asm" is a thin layer scripting the actions of SFP, the stack library, the "kernel" .asm files, and a few assembler macros built to support these operations.
Throughout the execution of the various instances of build
and mulstr
necessary for the calculation of 34!, the second stack is used to hold base pointers, which are utilized by the parm
function from the "kernel" assembly file. In terms of the C++ factorial implementation shown above, these "base pointers" equate to pointers to the top of s
as it stood at the beginning of any given function call.
This results in a slightly different idiom for manipulating the central stack. In the C++, in function mulstr
, for example, top
, pop
and push
are combined to effect comparisons involving the second (bottom) of two SFP values stored atop the stack. In "target.asm," a pointer placed atop the second stack at the start of each call to mulstr
is used to obtain this SFP value instead, and kernel function parm
effects the necessary dereferencing.
Future
This article leaves open many directions for future work. One obvious avenue for exploration is the application of SFP to real (or at least interesting) problems. The PID controller mentioned earlier will quite likely grow into an article of its own at some point in the future. Many applications will require additional library functions to be developed.
Another possible topic is a more complete PIC implementation (e.g. with bounds checking). The techniques described here are also designed to be ported to other platforms, including, but not limited to, the 6800 and its many descendents.
During development of SFP, parallel development of a similar 32-bit type was under way. This type augments the significand byte with two more bytes of data, leaving the significance of the existing SFP bytes completely unchanged. Part of the appeal of this type is that it is trivially interchangeable with the SFP type presented here. If we represent a 32-bit "extended SFP / ESFP" number as M, N, O and E, where N and O are extra significand bytes, then this number has an SFP approximation of M,E (M can be rounded up based on the top bit of N). Similarly, any existing SFP operation can be trivially turned into a preliminary ESFP implementation, with a known level of inaccuracy.
The level-of-completion for this 32-bit work is lower than for SFP; exhaustive testing is not practical, nor are the lookup tables used for the SFP exponentiation and logarithm operations practical. These obstacles notwithstanding, this 32-bit type may grow into a submission similar to this one. Such an implementation would have to be tested as well as possible using non-exhaustive means, and could rely on the existing 16-bit exponentiation and logarithm routines.
Finally, an expanded treatment of the high-level strategies applied in the PIC large factorial demonstration will eventually be forthcoming. In examining the ways in which others have been able to stand up similar high-level facilities quickly and cheaply, FORTH can be identified (respectfully, and in full recognition of the incompleteness of this author's own treatment) as perhaps the most similar existing effort. This is exemplified by in FORTH's stack orientation, its reliance on something resembling threaded code, its lack of formal type checking, and its occasional reliance on a cross-compiler.
A future article will describe how it is possible to take the high-level form seen in "target.asm" and build a cross-compiler capable of generating such object code from a high-level language. Much of the underlying work necessary for such a cross compiler (and runtime) has already been done, and this treatment encompasses a broad array of PIC devices, with the promise of even further portability. Taken in conjunction with the SFP article at hand, and its predecessor, this future article will provide an overview of what might be called "computing from the ground up." More formally, this material provides a high-level overview of a fairly abstract computing system, built up quickly and cheaply from the IC level.
References
[1] IEEE 1985. IEEE standard for binary floating-point arithmetic. ACM SIGPLAN Notices 22, 2 (Feb.), 9–25
[2] Karpinski, R. 1985. "Paranoia: A floating-point benchmark." Byte Magazine 10, 2 (Feb.), 223–235
[3] Monniaux, D. (May 2008). "The pitfalls of verifying floating-point computations". ACM Transactions on Programming Languages and Systems 30 (3): article #12.
[4] Motorola, Inc.Programmer's Reference Manual (68000). Schaumburg, IL: Motorola Press, 1992.
[5] Wakerly, J. Microcomputer Architecture and Programming: The 68000 Series. New York: Prentice-Hall, 1989.
History
This is the second version of this article. Compared to the first version, the formatting of the article has been cleaned up, and some passages have been clarified. The SFP library has not changed since the initial release of the article.