SDS Numeric Data Types, Conversion, and Missing Representation.

SOI TN 94-115

R S Bogart
1994.05.31

1.  Among the various types used by SDS for internal data representation, the
following are numeric types designated as follows:

BYTE	i1	fixed-point 8-bit integers in the range [-128,127]
SHORT	i2	16-bit integers in the range [-32768,32767]
INT	i4	32-bit integers in the range [-1073741824,1073741823]
UBYTE	u1	8-bit integers in the range [0,255]
USHORT	u2	16-bit integers in the range [0,65535]
UINT	u4	32-bit integers in the range [0,2147483647]
FLOAT	r4	floating-point 32-bit IEEE real numbers
DOUBLE	r8	floating-point 64-bit IEEE real numbers
COMPLEX	c8	ordered pairs of r4's representing complex numbers

In the floating-point formats, the following special IEEE representations
for data exist in addition to the representations for numbers:
	NaN		Not a Number
	Infinity	Larger than any real number
	-Infinity	Smaller than any real number
(+/- Infinity may only exist for doubles.)

In principle the supported data types could be implemented on another
architecture, but the rules governing intertype conversions are based
on the internal representations described here.  These rules must be obeyed
in any implementation, even though they might not make sense in another
architecture.

Conversions between complex numbers and the other numeric types do not
in general make sense, so complex numbers will not be discussed further.

2.  Associated with a SDS data set there are possibly two attributes: SCALE and
OFFSET.  If the SDS is multivariate, there may be unique pairs of these
attributes for each variable.  The SCALE and OFFSET attributes may be set or
not.  If they are set, they are real (double-precision) numbers representing a
linear mapping from integers to real numbers.  If one is set and the other is
unset, the default value for the unset attribute is used.  The default values 
are 1.0 for SCALE and 0.0 for OFFSET.  Setting a value of 0.0 for SCALE or NaN
or +/- Infinity for either attribute causes both to be unset.  The values of 
SCALE and OFFSET can be set, unset, or modified by appropriate functions.

The SCALE and OFFSET attributes have different meanings depending on whether
the associated data variable type is fixed-point or floating-point.  If the
data are floating-point, then the SCALE and OFFSET attributes are used to set
corresponding attributes in an external data format (such as the BSCALE and
BZERO keys in FITS) so that the data will be written to the external format as
scaled integers, provided that the format supports such a feature; this
conversion is automatic.  Likewise, on input from an external data format
providing scaling for integers, conversion and scaling of the data is
automatic.  Scaled integers in the external data representation should be
thought of as simply a form of media compression of floating point numbers
below the access level of the SDS I/O functions.

For fixed-point data, the SCALE and OFFSET attributes imply an automatic type
conversion to a floating-point representation on output.  They evidently cannot
be set on input, so they must be set internally.  The purpose in setting them
is for internal type conversion, as there can be no advantage in externally
representing integers as floating-point numbers.

Scaling parameters provided for floating-point data in an external data
representation cannot be meaningful, as the floating-point number set with
+/-Infinity is closed under multiplication and addition, so any floating-point
number can be represented directly.  (This is not quite true: scaling could be
used to represent double-precision numbers, which have a larger range in the
IEEE format, with single-precision numbers, but that case is highly unlikely to
arise in real life.)  Any scaling parameters apppearing in such an external
data representation should be ignored.  If the present rules are followed such
a dataset cannot arise from manipulations with the SDS library.

The net effect of these rules on input from and output to external data formats
can be summarized as follows:

    Extern Data Rep  Extern Scaling	    Intern Data Rep  Intern Scaling

	integer		no		<->	integer		unset
	integer		yes		<->	real		set
	real		no		<->	real		unset
	real		no		<-	integer		set
(	real		yes		 ->	real		unset	)

Note that the system could have been closed by having external scaled reals
correspond to internal scaled integers, but such a correspondence makes no
sense.

3.  The SCALE and OFFSET attributes govern internal data type conversions
between integer and real types according to the following rules:

If the scaling attributes are set, they are used to convert between reals and
integers and are preserved.  The two data representations are equivalent and a
conversion of integers with set attributes to reals and then back to integers
will reproduce the original data exactly, except for limits incurred by range
and precision.

If the scaling attributes are not set, integers are converted to reals using
the default scaling values, but the scaling attributes remain unset.  When
reals with unset scaling attributes are converted to integers, the scale and
offset will be automatically determined to optimally use the integers to
describe the data.  Thus, when the scaling attributes are unset, a conversion
of integers to reals and back again to integers will not in general reproduce
the same set of values.  It is not possible to produce an integer dataset
without scaling attributes by direct conversion of a real dataset.

	Datatype	Scaling	   	Datatype	Scaling

	integer		set	<->	real		set
	integer		unset	 ->	real		unset (default scaling)
	integer		autoset	<-	real		unset

4.  When casting on input or output takes place (datatype not declared
SDS_ANY), the rule is that the implied I/O conversion takes place closer to
the I/O than the internal conversion.  On input, for example, if a dataset
containing integers with scaling in the external format is read in cast to
an internal integer datatype, the data will first be read in as reals with
advisory internal scaling set, and then the scaling will be used to convert
internally to integers.  Practically, this ensures preservation of the
external representation's scaling.  If integer data are cast to reals on
output, the actual external data representation will either be reals without
scale parameters or scaled integers, according as the scaling attributes are
unset or set.

5.  Conversions within type (signed integer, unsigned integer, or real) from
a smaller representation to a larger representation (e.g. i1 to i2, u2 to u4,
r4 to r8) involve no changes in either the data values or the scaling
attributes.  Conversions from integer types of 1- or 2-byte length to reals
and from i4 or u4 to r8 also involve no changes in data values or the
scaling attributes.  Likewise, conversions of unsigned integers to signed
integers of a smaller size (u4 to i2 or i1, u2 to i1) do not affect values
or scaling attributes.  In all these cases the original set of possible
values is a proper subset of the target set of possible values.

Conversions within type from larger to smaller representations may invoke
automatic scaling, but this should be avoided if possible.  If the data
values are contained in the range of numbers represented in the target
type, there are no changes in values or in scaling attributes.  If the
range of data values is smaller than the range of numbers represented in
the target type, then the OFFSET attribute is suitably modified or set to
center the representations of the data values in the target range as closely
as possible, while the SCALE attribute is unchanged, or set to its default
value if it had been unset.  If the range of data values exceeds the range
of representable numbers in the target type, then both the scaling attributes
are modified (or set) to approximately center the data values in the target
numeric range and to have the data range cover approximately 3/4 of the
target numeric range.  In deciding whether scaling attributes need to be
modified or set, the largest and smallest values in signed integer types
and the largest value in unsigned integer types are to be excluded.  Thus,
a set of i4 data values in the range [-32767,32766] could be converted to
type i2 without setting or modifying the scaling attributes.  A set of
i4 data values in the range [-32767,32767] would require the SCALE
attribute to be set (to about 0.75) for conversion to type i2.  Strictly
speaking, these remarks apply to conversions from type r8 to r4 as well,
but they would hardly ever apply, as the likelihood of the data range
exceeding the range representable within type r4 is very small; still
less of the data range being outside but smaller than the representable
range.  Note that any conversion requiring the setting of scaling attributes
is in fact a conversion to reals, since that is how the data will be treated
thenceforward.

Conversions between signed integers and unsigned integers are governed by
the same rules that apply to conversions within type.  If the conversion
can be effected without setting or changing the scaling attributes, that
is done.  If the conversion can be effected by only changing or setting the
OFFSET attribute, that is done.  Otherwise, both the scaling attributes are
set to leave the data centred within a range 3/4 the range of represntable
values in the target type.  For example, a set of i4 data values in the
range [0,65535] would be converted to type u2 without affecting the scaling
attributes.  A set of i4 data values in the range [-32767,32767] could be converted to type u2 by setting the OFFSET attribute to 32767.0.  A set of
u2 values in the range [0,65535] could be converted to i2 by setting the
OFFSET attribute to -32768.0, but would force the SCALE attribute to be
set to about 3/1024 on conversion to u1.

In conversions of data from types i4 or u4 to type r4 care must be exercised
for the fact that the integer types represent numbers with 8 bits greater
precision.  Integers with absolute value exceeding 8388607 cannot necessarily
be exactly represented in the r4 representation.  The scaling attributes
cannot be used to shift the data range in this case because the scaling
attributes imply that the floating-point data represent the actual data
values.  It would be useful to have the conversion routine set a warning
value in soi_errno in cases in which precision may have been lost in the
conversion.  Note that this applies regardless of whether the scaling
attributes are preset.  Because the scaling attributes imply that the real
numbers represent the data values, the real number values cannot be changed.

Conversions from reals to integers are governed by the scaling attributes if
they are set.  (If only one is set, the default value of the other is used.)
If the scaling attributes have not been set, autoscaling should always be
applied: the data values should be centred in the representable range and
occupy 3/4 of it.  (If there is only a single data value, the SCALE attribute, if unset, should be set to its default value of 1.0.)

6.  In general, any data value may be "missing".  It may be missing
because it is unavailable (clouds on that day), because there can be
no data for a given array element (e.g. corresponding to locations off the
solar disk), because the analyst has decided the datum is clearly
invalid (e.g. bad pixels), or for other reasons.  SOI has adopted
the philosophy that there will always be at least one value that must
be tested for prior to using data in a computation if there is any
concievable possibility that some data might be missing.  For the floating
types, the IEEE NaN will be used to represent such data.  For data
stored as integer types there is no simple solution since any chosen value
must come from the space of otherwise allowable data values.  Therefore,
when doing calculations using integer typed data which could be missing,
the particular "fill_value" applicable to that data must be used.

The SOI vds and sds function libraries check the data for fill_values
on input and output.  The following rules are applied:

External form is an integer type:
    External fill_value is specified:
        Internal type is the same as the external type:
            NO checking for missing is done and the internal fill_value
            is set to be the same as the external fill_value.
        Internal type is different from external type:
            ALL data is checked before conversion.  The internal fill_value
            is set according to the internal type as one of:
                BYTE    -128
                SHORT   -32768
                INT     -1073741824
                UBYTE   255
                USHORT  65535
                UINT    2147483647
                FLOAT   AQuietfNaN()
                DOUBLE  AQuietdNaN()
    External fill_value not specified:
            NO checking for missing is done.  The internal fill_value must
            be somehow left undefined to avoid later use.  Method TBD.

External form is a floating type:
    External fill_value is specified:
        Internal type and fill_value are the same as the external type:
            NO checking is done.
        Internal type or fill_value is different from external type or fill:
            ALL data is checked before conversion.  The internal fill_value
            is set according to the internal type as one of:
                BYTE    -128
                SHORT   -32768
                INT     -1073741824
                UBYTE   255
                USHORT  65535
                UINT    2147483647
                FLOAT   AQuietfNaN()
                DOUBLE  AQuietdNaN()

    External fill_value not specified:
        This case can not happen for external floating types since
        the external fill_value is defined for all our supported
        external protocols.

Versions of library functions located in /usr/local/src rather than
in ~soi/CM/src use the historic value MISSING defined as -(8388608.0*1E10).
The keyword "MISSING" is reserved for that value and is not to be used
in SOI code.

The external fill_values for floating values in external protocols are:
        FITS	NaN
	CDF	NaN
	WSO_DS	MISSING

SDS library routines (really macros) are provided to test individual
values for missing.  They are:

int sds_smissing(SDS *sds, short s);
int sds_usmissing(SDS *sds, unsigned short u);
int sds_imissing(SDS *sds, int i);
int sds_uimissing(SDS *sds, unsigned int u);
int sds_fmissing(float f);
int sds_dmissing(double d);

They are implemented as:

#define sds_smissing(sds,x) (x == *(short*)sds->fillvalue)
#define sds_imissing(sds,x) (x == *(int*)sds->fillvalue)
#define sds_smissing(sds,x) (x == *(short*)sds->fillvalue)
#define sds_uimissing(sds,x) (x == *(unsigned int*)sds->fillvalue)
#define sds_usmissing(sds,x) (x == *(unsigned short*)sds->fillvalue)
#define sds_fmissing Is_fNaN
#define sds_dmissing Is_dNaN


7.  Along with Missing attributes, INFINITE attributes should be provided
for integer representations of converted real data with values of Infinity
and -Infinity.  The same remarks apply as to conversion, namely that data
values corresponding to an INFINITE attribute should be mapped through to
the INFINITE attribute in the target representation without undergoing
scaling.  When INFINITE attributes must be set, they should be set to the
largest representable integer in signed integer formats and the 2nd largest
representable integer in unsigned integer formats for +INFINITY (in order
to distinguish from the Missing value).  Negative INFINITY should be set to
the 2nd smallest representable integer for signed integer formats, which does
correspond to the negative of positive INFINITY.  Negative INFINITY does not
make sense in unsigned formats; conversion of -Infinity in a real format to
unsigned integer format will result in replacement with the Missing attribute.