7#$` jlflfzzzz||||}}}}0x{f} }~T*~~Wz~T}v~T~T8~T~T~T~T~T~T SOI-MDI Technical Notes Internal and File Data Structures Description, Use, and Tutorial Alfred Tom October 29, 1992 SOI-TN-92-102 Stanford-Lockheed Institute for AstroPhysical and Space Research Summary of IDS Data Structure and Routines 10/29/92 Alfred Tom Introduction This document describes the data structures and file systems used for storing data. The data may have been generated internally or imported from an external test or measurement site. There are three main data structures: two file systems and one internal data structure for use in data programs and routines. This document attempts to describe the data structures and how they are used. The first data structure is the Flexible Image Transport System (FITS). It is used primarily for data importing and exporting and is a NASA standard. However, because FITS is basically a serial oriented file system designed for tapes, it is very inconvenient for data manipulation and inefficient for storage. For this reason, a second file system is adopted for most work. The Common Data Format (CDF) file system was developed by the National Space Science Data Center. The complete package contains command line routines, "C" programming language library routines, and window-oriented, data-viewing applications. Because of the volume and complexity of the SOI projects data, it is also necessary to have a standard internal data structure for use in storing data in memory. The Internal Data Structure (IDS) is basically a structure in memory that keeps data organized when it is read from FITS or CDF files into the programs variables. Appendix C shows how the IDS structure is represented in the C programming language. All data manipulation inside a program is done with IDS. This document is separated into seven sections. The first describes the background and reviews prior work done in data structures for the SOI project. The second and third sections briefly introduce the FITS and CDF file systems. The fourth section describes the IDS structure in detail. The fifth describes the use of IDS-FITS C library routines. The sixth describes the use of IDS-CDF "C" library routines. The seventh describes command line routines that manipulate data files. There are also three appendices. Appendix A describes the GDS-FITS routines that were used when GDS was the internal data structure. Appendix B gives an overall description of the CDF functions used in the IDS-CDF library. For a list of FITS keywords that are used in the SOI project, see SOI Technical Notes #XXX. 1- Background Prior to CDF and IDS, there were other data structures used for file systems and internal data structures. DS was used for data files and GDS was used as the internal data structure. FITS was always used as the import/export filing system. DS was the original data file format. It also described its own internal data structure. A short primer for the DS data format can be found in the directory /usr/local/doc. It was replaced by CDF because CDF is a standard format documented by NASA. CDF also matches fairly naturally with the FITS data format. GDS was simpler version of the IDS internal data structure that replaced it. It was suited for FITS in that it stored FITS keywords as well as data. However, it was not sophisticated enough for use with the CDF file system. It is important to note that before the implementation of CDF, a library of routines was developed for manipulation of data between GDS and FITS. These can be found under the directory ~/soi/src/libgds.d. This library is described in Appendix A. 2- Flexible Image Transport System (FITS) The Flexible Image Transport System (FITS) is a file structure standardized by NASA for the import and export of data. Because file transfer was originally intended to be done using magnetic tapes, the FITS structure is a serially oriented system with certain limitations. Each FITS file consists of an integer number of logical records. Each FITS record contains exactly 23040 bits, or 2880 bytes. A record can be one of two types: header record or data record. A Header is a set of logical header records that describe the data that follows it (in data records). However, one can have a Header with no data records following it. In this case, the Header would be for information only. A Header and Data Unit (HDU) is comprised of a set of header records followed by zero or more data records. The Header portion of each HDU is arranged in a matrix of rows and columns. Each row is called a card image and contains 80 printable characters (80 columns). The number or rows is variable. Each card image describes one keyword. A keyword is a descriptor variable that describes the data or file in some way. One example of a keyword is DATE, which describes the time the data was taken. All card images are arranged in the same way. The first 1-8 columns of a row are reserved for the keyword. Keywords are always in upper case and left justified. The number 9 column is = if the keyword has a value, (space) if it does not. The 10th column is always (space). Columns 11-80 are reserved for the value of the keyword and comments. The FITS standard requires that each FITS file contains at least one HDU. This is called the Primary HDU. Subsequent HDUs are optional and are called Extensions. For the Primary HDU, there are several mandatory Keywords. The first Keyword is required to be SIMPLE. Its value is either T or F, for whether the file completely conforms to the FITS standard or not. The second is BITPIX, which describes the type of format the data is in (integer, float, double, etc.). Its value is an integer which is either 8, 16, 32, -32, or -64. The third keyword is NAXIS, which is the number of axis or dimensions of the data. It is 0 if there is no data. If NAXIS is greater than zero, the following keywords (NAXIS1, NAXIS2...) describe the number of elements on each axis. They are required (if NAXIS is greater than zero) as well. The last Keyword for the Primary HDU is required to be END, which has nothing following it and signals the end of the Header. There are several keywords that are used by the SOI project. These are listed in SOI Technical Notes #XXX. The data records is the information following a Header. It begins in the first space of the logical record immediately following the last logical record of the Header. If the data type is char, the data is recorded in 7-bit ASCII characters with the 8th parity-bit set to zero. For unsigned integers, each data element is one byte. For short signed integers, a data element is 2 bytes in 2s complement. For long signed, a data element is 4 bytes in 2s complement. For floating point numbers, the IEEE-754 standard is used for 32 and 64 bit numbers. The data is recorded in column major format such that the smallest dimension varies the fastest , and big endian such that the least significant bit has the largest address. This description is a primer only and is incomplete. For more information on the FITS standard, see the NSDSSO FITS Draft Standard document. 3- Common Data Format (CDF) The Common Data Format (CDF) is a flexible data file system developed by NASA for storing experimental data. The CDF standard includes file applications and library functions as well as a description of the file format. All information is described in detail in the National Space Science Data Center CDF Users Guide. The following is just a brief introduction to the format, capabilities, and information storing variables of CDF. A CDF is merely a file that stores a sequence of entries (data) under a variable name. A CDF can contain more than one variable or set of entries. For example, two variables for a CDF could be magnitude and distance, and the entries could be floating points in decibels and meters. This is the basic structure of a CDF. In order to arrange actual data values for variables in a way that makes correlations and similarities obvious, each entry is actually an array of values called a variable array. In other words, Each variable has a sequence of arrays (of zero or more dimensions) that in turn have actual values (such as 3) inside of them. These arrays are numbered in sequence and are related to arrays in other variables by some parameter, such as time. A sequence of arrays for a certain variable is called a CDF record. As mentioned before, a CDF may have more than one CDF record, one for each variable. The variable arrays, also called variable records, have a set number of dimensions. This number is the same for each variable or CDF record in the CDF. Each variable or CDF record also has an associated name and number. Variables are numbered sequentially in a CDF. For example, a variable called location could have a sequence of arrays numbered zero to twenty that relates to the second the measurement was recorded. This is the CDF record. Each array, or variable record, has two dimensions for the x and y coordinates. Each dimension would have two entries for the two different coordinates where measurements were taken. Because of the nature of the variable record/array method of storing data, there is a lot of potential for storing duplicate information in a CDF record. For example, If one had data measurements for two locations, where measurements at each location were done at the same time, one variable would be time and the other the locations distance from center. Each variable would have two entries in each array. However, the time variable would have the same value for both of the entries because the two locations took measurements at the same time. In order to compress this information and decrease recording redundancies, the CDF system requires a CDF file to declare whether the values in a variable array vary for the variable array and each dimension of the array. So, for our last example, the variable called time would declare the variance in the arrays first (and only) dimension as NoVary since the values in the array do not change across the first dimension. It also would declare the variance for the whole variable array to be NoVary. For the location variable, the variance along the first dimension would be Vary since the values change along the array. The variable arrays variance would be Vary as well. Once this is done, the CDF does not have to physically store the values if the values does not vary along a dimension. The values (now called virtual values) are implied by the variance (Vary/NoVary) of the dimension. This variance parameter is not stored in the data section of the CDF, but in a different part of the CDF called the metadata, which is described below. Each CDF has two parts: the data and the metedata. These parts can be stored separately or in the same file. In single file CDFs, the metadata and data are stored in different sections of the same file. In multifile CDFs, the metadata is stored in a separate .cdf file whereas the data is stored in one or more .v files. A CDF is not required to have a data part. The data section of the CDF is stored as described above in CDF records. The metadata section contains information about the data and the experiment. It is equivalent to, but more extensive than, the header section of a FITS file. Information in the metadata is stored in units called attributes. An attribute is similar to the FITS keyword. Each attribute has a name , such as datatype or variance, and a sequence number. An attribute also has a parameter called scope. Global scope attributes apply to the entire CDF. Variable scope attributes apply to certain variables. If an attribute is variable scoped, it contains several entries, one for each variable. Other attribute parameters are data type (how its entry values are stored), number of entries (variable for strings, 1 for non-strings), and sequence number. Some common attributes names are Title, Max entry, Datatype, and Majority. For more detailed information, see the CDF Users Guide. 4- Internal Data Structure (IDS) The Internal Data Structure (IDS) is a structure used by datafile manipulation routines to describe how data read in from a CDF or FITS file is stored in memory. For programs written in a language such as C, it is the data storage structure used for data manipulation. The C representation of the IDS data structure is a struct called IDS. The code for this structure is found in Appendix C. Each IDS struct contains three sub-structures: GENERAL, VARIABLE, and ATTRIBUTE. The GENERAL struct contains all-encompassing information about the file from which the data originated. It includes five elements. The first two elements in the GENERAL struct are pointers to the structs HISTORY and COMMENTS. HISTORY is a chained structure of two elements. The first element points to the next HISTORY structure. The second element is a char pointer that contains one line of the history text. Therefore history text is stored one line at a time in chronological order as null terminated strings using linked structures. The COMMENT struct is essentially them same except for the fact that it stores lines of comments instead of history. The third element, *filename, is a pointer to a char that contains the name of the CDF or FITS file from which the IDS originated. epoch is a double that stores a value that can be converted to a date and time by the appropriate CDF routine. numDims is a long that stores the number of dimensions in the CDF or FITS file. Note that the meaning of dimension is different for FITS and CDF representations. In CDF, dimensions refers to how the data is organized in the CDF file. In FITS, dimensions is the number of variables that are represented in the data portion of a FITS file. numDims refers to the CDF meaning of dimensions. The VARIABLE struct stores the actual data read from the originating file. The data is stored in sets corresponding to variables. Examples of variables are degrees and magnitude. The VARIABLE struct also stores information about each variable and its corresponding set of data. Therefore, the VARIABLE struct is basically a chain of linked units, where each unit is one struct that contains the one variables information and data. There are eleven elements in the structure. *nextVar points to the next VARIABLE struct in the chain. *recinfo is a pointer to a RECORD struct that reveals record numbers from which the data values were read. *diminfo is a pointer to a DIMENSION struct that contains a description of which elements in the CDFs variable array comprise the data in the IDS. *stats is a pointer to a STATS struct that contains statistics on the data pointed to by *values. *varName is a pointer to a char that contains the name of the variable. *fillvalue is a pointer to the value that is used for a no entry place holder in the data. *values is a pointer to the actual data for the variable. dimVary[] is a pointer to an array of unsigned shorts. This array containes the variance of each dimension of the variable. recVariance is an unsigned short that tells whether the values change in the variable record. datatype is an unsigned short that reveals the variables storage data type (int, long float, etc.). numBytes is also an unsigned short that says how many bytes comprise each entry (for example a long entry would have numBytes equal to four). The RECORD struct has three elements. startrecord is a long that contains the starting variable record from which the data was written. numrecords, also a long, is the number of records in the data. recinterval is a long that contains the interval between the records that were written to the IDS. If this value is 2, every other variable record was read. The DIMENSIONS struct describes the elements of the array that were actually written to the IDS. startind[] is an array of longs of length numDims that tell where on each array dimension the read routine started to read entries into the IDS. numind[] is an array of the same form that tells how many entries in each dimension were written to the IDS. indinter[] is also an arrayof the same form, but lists the interval between the indices that were written to the IDS. If the value of an element of indinter is 2, every other element in that dimension was read. The STATS struct contains ten elements that aid in the plotting of the data. minact is a double that contains the actual minimum value of the data. maxact contains the actual maximum value. minval contains the legal minimum value. maxval contains the legal maximum value. mean is a double that contains the mean of the data. stddev contains the standard deviation of the data. skew contains the skew factor of the data. *axislabel is a pointer to a char that contains the desired labeling of the variables axis. This may or may not be the same as *varName. numvals is a long that contains the number of values in the data. statvalid is an unsigned short that tells whether the statistics in STATS is valid or not. The last major struct is the ATTRIBUTE struct. This is similar to the VARIABLE struct in that it is a chain of linked units. Each unit describes one attribute (or keyword) of the CDF of FITS file. Each ATTRIBUTE struct contains seven elements. *next is a pointer to the next ATTRIBUTE struct. *attrName is a pointer to a char that contains the name of the attribute. *attrValue is a pointer to the actual value of the attribute. entryNum is a long that corresponds to the number of the variable that the attribute is describing. Note that for multiple variable files, there is going to be more than one ATTRIBUTE struct for each attribute; one for each variable. scope is an unsigned short that contains the scope of the attribute. datatype is an unsigned short that lists the data type of the attributes value. numBytes is similar to the VARIABLE structss numBytes but describes the attributes value instead. 5- IDS-FITS C Library Routines The IDS-FITS C library routines are functions used in programs to read, write, and manipulate data in FITS files. They are contained in the directory ~soi/src/libids.d. The library includes .h files as well as the .c files that contain the actual functions. The main .h files are ids.h and fits.h. ids.h is also used by the IDS-CDF routines. fits.h contains all the keyword definitions, constant definitions, and other parameters that are used in the IDS-FITS routines. These .h files are found in the directory ~soi/src/include. The IDS-FITS library routines are in the process of being written. 6- IDS-CDF C Library Routines The IDS-CDF routines are similar to the IDS-FITS routines. The only difference is that they work with the IDS-CDF interface rather than the IDS-FITS interface. The major .h files are ids.h and cdf.h. There are also minor .h files: . These are located in the directory /usr/local/src/cdf/include. One major difference between the FITS routines and the CDF routines is that the CDF file system provides the programmer with a set of C library routines that are already designed to interface with CDF files. Therefore, the only thing necessary to interface with IDS is to put the data retrieved using the already coded CDF routines into the IDS structure. On the other hand, with FITS, all retrieving routines had to be hand coded from the ground up. All the IDS-CDF functions are contained in .c files that reside in the directory ~soi/src/lib/libcdf.d. The CDF functions supplied with the CDF standard are found in the directory /usr/local/src/cdf/cdf21-dist/cdf21-dist/src/lib. The IDS-CDF library routines are in the process of being coded. A description of the CDF library routines used in the IDS-CDF functions is found in Appendix B. 7- CDF-FITS Command Line Routines These have yet to be written. For a more detailed description of the CDF-FITS command line routines, see SOI Technical Notes #XXX. References National Space Science Data Center, CDF Users Guide for Unix Systems, Version 2.1, January 9, 1992. NASA Science Data Systems Standards Office, Flexible Image Transport System (FITS), Draft Standard, NSDSSO 100-0.1, December, 1990. R. S. Bogart and J. Suryanarayanan, SOI Technical Note 92-085, SOI FITS Keyword List, rev. 1992.IX.24. Appendix A GDS-FITS C Library Routines Before IDS, GDS was used as the internal data structure. It was designed to interface with FITS, so a set of GDS-FITS routines were written to facilitate data travel over the GDS-FITS interface. Appendix B CDF Library Functions Appendix C IDS Structure Representation in C typedef struct ids_struct { GENERAL *Geninfo; VARIABLE *Varinfo; ATTRIBUTES *Attrinfo; } IDS; typedef struct general /* This structure holds info */ { /* pertaining to the whole ids. */ HISTORY *history; COMMENTS *comments; char *filename; /* Contains the name of the */ /* originating file. */ double epoch; /* Time of last modification */ long numDims; /* CDF dimensions of the struct. */ } GENERAL; typedef struct history { HISTORY *next; /* Pointer to next line of hist. */ char *text; /* Holds one line of hist. */ } HISTORY; typedef struct comments { COMMENTS *next; /* Poiter to next line of comments. */ char *text; /* Holds one line of comments. */ } COMMENTS; typedef struct variable { VARIABLE *nextVar; /* Pointer to next variable struct. */ RECORD *recinfo; /* Info on record number of data. */ DIMENSION *diminfo; /* Info on dimension layout. */ STATS *stats; /* Statistics on variable. */ char *varName; /* Name of variable. */ void *fillvalue; void *values; /* Pointer to data of variable. */ unsigned short dimVary[]; /* Dimensional variance. */ unsigned short recVariance; /* Record variance. */ unsigned short datatype; /* type of data. */ unsigned short numBytes; /* Bytes per data element. */ } VARIABLE; typedef struct record /* Stores what records in the CDF */ { /* the data values are from. */ long startrecord; long numrecords; long recinterval; /* interval between records. */ } RECORD; typedef struct dimension /* Stores what indices of */ { /* what dimensions the data is from. */ long startind[]; /* Starting indice of each dim. */ long numind[]; /* Num indices from each dim. */ long indinter[]; /* Interval between indices. */ } DIMENSION; typedef struct stats { double minact; /* Actual min of the data. */ double maxact; /* Actual max of the data. */ double minval; /* Legal min of the data. */ double maxval; /* Legal max of the data. */ double mean; double stddev; double skew; char *axislabel; /* Vars axis name on a graph. */ long numvals; /* Number of elements of data. */ unsigned short statvalid; /* Are the statistics valid? */ } STATS; typedef struct attrubutes { ATTRIBUTES *next; /* Next attribute struct. */ char *attrName; /* Name of attribute. */ void *attrValue; /* Pointer to attribute value. */ long entryNum; /* Corresponds to var it describes. */ unsigned short scope; /* Global or local. */ unsigned short datatype; /* attribute values type. */ unsigned short numBytes; /* Bytes per type. */ } ATTRIBUTES; uy{:|units (CDF expression)CDF-relevant dimensional information on the dataformat of the datareadread inIt is related to the CDF data structure. It is also related to the CDF data structure. sstatistics describingname for labeling on a graphthat is used to describe the datathe ID number of the variable is either Global or Localthe number of bytes in the valuebft* % 4DnD`00HHK=K]OPPPPPQ-QcQQQQQRRRRSSS0`````````a"a(a-aQaRagaxaaaaaaalfllllmm m;mUnn5nnoSoiojokouovo}o @   $$ $$U BCbcdefqrst*+ ! $ % & 5 6 ( ) e fBCDopcd߼߷!!!! !!!!!!!!!GCDab \ ]""%9%:++--00001122225588 99>a>b??BBEEHHHHIIJJK<K=K^K_LLLLNZN[OBOCOOOOPP P'P(P!! !! !!! !!!! !!!IPPPPQQQQQQRRRRRRRSSSSSS1S2S3S4SNSQSeSzSSSSTT2TGTTU UGUTUUUlUoUUVVVV V^VVVVVW WHWWXX1XsXYYjYYYZ ZDZ[ZoZZZ[[B[[\ \\\.\1\u!!!!!V\u\]]E]U]f]u]]^F^Q^R^l^o^__H__`'`z`llllnnpprrvvwwww!!! !!% It usually contains the time the IDS was created or last modified. 6- NGDS Internal Data Structure The NGDS internal data structure was designed to compensate for the weaknesses of IDS. IDS is best used as a stucture for storing data. Because of the code overhead that occurs everytime that data is accessed from an IDS (from calculating NOVARY and VARY values, for example), the IDS is inefficient for manipulation of data by math or utility functions. Therefore, NGDS meant to be a derivative of the IDS structure that us used for manipulating an IDS data by math or utility function. The most obvious difference between IDS and NGDS is size. An NGDS only holds the raw data and five other values. The top layer of the NGDS stucture consists of an integer descriptor, which denotes the structure as an NGDS, a pointer to *data, which is the data array, and a pointer *info to an INFO stucture. The info structure contains four elements. The first is a pointer *parent back to the IDS from which the NGDS came. The second is the integer rank of the NGDS, and the third is the integer datatype of the NGDS data. The most important difference between IDS and NGDS is the simplicity of the data storage method. The NGDS data array is simply a zero dimensional array of data type datatype that can be interpereted as an array of dimension rank. This allows math function wishing to access the data to do so in an efficient manner. An NGDS is created whenever a function desires to manipulate data from an IDS. If a pointer to the IDS is present, the function calls a routine to create an NGDS from the IDS, and this NGDS is passed to the function as data. For a detailed description of these and related routines, see the technical notes on IDS/NGDS routines.op:pBpPqs_vvw/wwwwwwwwxyyz;zQz @ As of now, there is only one type of NGDS. However, when the need arises, it could be arranged so that there are different types of NGDS, tailored to the needs of various functions. If this is the case, the type integer in the NGDS *info pointer would hold a value that denotes the type of NGDS.type, which is explained below. The fourth is the unsigned short datatype which describes the data elements in the array.elements bycertain function 8- IDS Math/Utility Functions Besides file I/O between FITS, CDF, and other external storage mediums, the other use of IDS (and NGDS) is data manipulation by math and utility functions such as dopplergram or plot. These functions generally take NGDS as data inputs and perform some function on these structures. As mentioned above, the function would call certain utility routines to make sure that the function does receive NGDS pointers no matter what pointer might be passed to it by the user. These routines are describedin a separate technical note, as are the math and utility functions (technical note #xxx). (technical note #xxx)an integer descriptor, which basically denotes the structure an IDS, and j@` &1],7BCO/W\]$bje@W    T 479W=?SADsGGHHIJLvLwMMO/O0WX<X=X\X]gghhghrhshhhi#iiiijHjj !!!!! ! !!!!!!!!!!!!@!!!!!!!!!!!!!!!!!!!!!oz1:P\uw2345V HH(FG(HH(d @=/RBH -:LaserWriter ChicagoGenevaMonacoPalatinoTimes HelveticaCourierSymbolVTTYFont VT100111#20T1156 78:,:=:|:::<= => >g>k??????@!@"AACYCCCCCE"ECFaFeFhFGHGeGGGGHHHQHcH~HIXIxIJ JJJJJJJKK}KKKLuMMM(M7M`MtOO-O.OP]X;X=XZX[X\X]XYZ6ZjjzQ2lf6{`8o`:``;1;`=`>`>`?a"a(a-@6aQ@saRBTCagDaxD+D^aEaaaGaGaHllllmm m;mUnn5nnoSoiojokouovo}op:pBpPw/pqwwqqwqz;s^vvHwwwwwwxyyP`8tech notes: data structures Alfred Tom Alfred Tom