Aa@$($HH $ d    dFootnote TableFootnote**. . / - 8\</\TOCHeadingy*Ձyqz{{ |M7})~y6\{y*0Wߔrzs+͔:VG]}XPx`c}kQ4n??gTNĘ' ИcK`R#|N1GV1Xcq^M!}`Uubdn51Z6+<=2? F | <$daynum> <$monthname> <$year>"<$monthnum>/<$daynum>/<$shortyear>T;;<$monthname> <$daynum>, <$year> <$hour>:<$minute00> <$ampm>X"<$monthnum>/<$daynum>/<$shortyear>ak<$monthname> <$daynum>, <$year>$"<$monthnum>/<$daynum>/<$shortyear> <$fullfilename>  <$filename>  <$paratext[Title]>  <$paratext[Heading]>  <$curpagenum>d  <$marker1> <$marker2>bl (Continued)Pagepage <$pagenum>Heading & Page<$paratext> on page <$pagenum>See Heading & Page%See <$paratext> on page<$pagenum>.q Table & Page7Table<$paranumonly>, <$paratext>, on page<$pagenum>+ (Sheet <$tblsheetnum> of <$tblsheetcount>)E!wwPxxP'yyPzzPR{{QGAV||QcAWQAbP51ZP?9QAP|P!@anuQdAu@ ye@ohn@$or@;mo@u, @:mi @X @/da @aak @ne> @e>$@n>/@rea@ufi@ fi@pa@ @Hdi@uag@ ma@ ma@ Co@@pen@n& @ton@>@ ge@x  @u.q!@a7"@oy>#@n $@%@bhe&@set'@E(@wP)@xP*@yP+@zPR,@{QG-@|Qc.@Q/@P50@P1@Q2@P3@P!4@@a5@Qd6@@ 7@@o8@@$9@@;:@@u;@@:<@@X=@@/>@@a?@@n@@@eA@@nB@@rC@@uD@@ E@@F@@G@@HH@@uI@@ J@@ K@@ L@@M  @@pN  @@nO  @@tP  @@>Q  @@ R@@xS@@uT@@aU@@oV@@W@@X@@bY@@sZ@@[@@w\@@x]@@y^@@z_@@{`@@|a@@b@@c@@d  @@e!!@@f""@@g##@@h$$@@i%%@@j&&@@k''@@l((@@m))@@n**@@o++@@p,,@@q--@@r..@@s//@@t00@@u11@@v22@@w33@@x44@@y55@@z66@@{77@@|88@@}99@@~::@@;;@@<<@ @==@ @>>@ @??@ @@@@ @AA@@BB@@CC@@DD@@EE@@FF@@GG@@HH@@II@@JJ@@KK@@LL@@MM@@NN@@OO@@PP@@RR@@SS@@TT@ @UU@!@WW@"@XX@#@YY@$@\\@%@^^@&@__@'@aa@(@ee@)@ff@*@gg@+@hh@,@ii@-@jj@.@kk@/@ll@0@mm@1@nn@2@oo@3@pp@4@qq@5@rr@6@ss@7@tt@8@uu@9@vv@:@ww@;@zz@<@ {{@=@ ~~@>@ @?@ @@@ @A@@B@@C@@D@@E@@F@@G@@H@@I@@J@@K@@L@@M@@N@@O@@P@@R@@S@@T@ @U@!@W@"@X@#@Y@$@\@%@^@&@_@'@a@(@e@)@f@*@g@+@h@,@i@-@j@.@k@/@l@0@m@1@n@2@o@3@p@4@q@5@r@6@s@7@t@8@u@9@v@:@w@;@z@<@{@=@~@>@@?@@@@@A@@B@@C@@D@@E@@F@@G@@H@@I@@J@@K@@L@@M@@N@@O@@P@@R@@S@@T@@U@@W@@X@@Y@@\@@^@@_@@a@@e@@f@@g @@h @@i @@j @@k @@l@@m@@n""@@o##@@p$$@@q%%@@r&&@@s''@@t((@@u))@@v**@@w++@@z,,@@{--@@~..@@//@@00@@88@ ::@=OO@@>PP@@?QQ@@@RR@@ASS@@BTT@@CUU@@DVV@@EWW@@FXX@@GYY@@HZZ@@I[[@@J\\@@K]]@@L^^@@M__@@N``@@Oaa@@Pbb@@Qcc@@Rdd@@See@@Tff@@Ugg@@Vhh@@Wii@@Xjj@@Ykk@@Zll@@[mm@@\nn@@]oo@@^pp@@_qq@"@`rr@#@ass@$@btt@%@cuu@&@dvv@'@eww@(@fxx@)@gyy@*@hzz@+@i{{@,@j||@-@k}}@.@l~~@/@m@0@n@8o@:p@O@q@P@r@Q@s@R@t@S@u@T@v@U@w@V@x@W@y@X@z@Y@{@Z@|@[@}@\@~@]@@^@@_@@`@@a@@b@@c@@d@@e@@f@@g@@h@@i@@j@@k@@l@@m@@n@@o@@p@@q@"@r@#@s@$@t@%@u@&@v@'@w@(@x@)@y@*@z@+@{@,@|@-@}@.@~@/@@0@@8@@:@@O@@P@@Q@@R@@S@@T@@U@@V@@W@@X@@Y@@Z@@[@@\@@]@@^@@_@@`@@a@@b@@c@@d@@e@@f@@g@@h@@i@@j@@k@@l@@m@@n@@o@@p@@q@@r@@s@@t@@u@v@@w@@x@@y@@z@@{@@|@@}@@~@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  @@  @@  @@  @@  @@@@@@@@@@@@@@@@@@@@@@@@@@@ @@ @@ @@ @@ @@  @@!!@@""@@##@@$$@@%%@@&&@@''@((@@))@@**@@++@@,,@@--@@..@@//@@00@@11@@ 22@@!33@@"44@@#55@@%77@@&88@@'99@@(::@@);;@@*<<@@+==@ @,??@ @-\\@ K JK @YY@YPQ@Ea@1.0 J@hYh@YY@ZY@1.1 Y@YYY@Y@1.2 @Yb@2.0 s@YY@Y YY!@YY@Y#1.3 @YY@`&Vf'RX@V)YY*@YY@W,WT-@XT@T/T{0@yT@v 2 1.4.3 3Y@Y"4cY5@sX@X&8Xc9@Xc@X);XX<@XX@a,?3.0 Y-\YY YY @YY@YYY@Y@3.3 JY3.2 hYYYYYYYYYYYYYX@Ya4.0 sYa5.0 Y a6.0 YYaY7.0 YYYa@ 11.0 Z@c&X'1.5  Y@ )1.6  Y*@ X c@Y,_V-@Ic@Y/1.4 YY@e 2X 1.YYHY Table 1: QYHsHXHXHX Hc!HX"Hc#HX$HX%HX&HX'HX(Ha)H3.*k-\+k,k -k.k @/k0F@1F2k3k@4k5F@6F.3 7F8k9Y.2 :c;Y<Y=Y>Y?Y@YAv 1.4.1 YBv 1.4.2 CtYDYXEYYFYaGY4.HkIFJF KFLkMkYNkaOl7.PlQlRk@Sl0 TlUlVlWlXlYlZl[l\m]m^p_\`\a\b\cldlelflglhliljlkllGmG: nGoGpGqGrlsltlul@vlwlxlylzl{G|G}\~\==lllllllllGGGGGGlllmml.1 ll 1.lCt^DY^EY\FY\GY\Hk\IFlJFlKFlLklMklNklOllPllQllRkGSlGTlGUlGVlGWlGXllYllZll[ll\ml]ml^pl_\l`\la\Gb\Gcl\dl\el\fl\gllhlZilljllklllGlmGlnGYoGlpGlqGXrlXsltl3.1 @YvlYwlYxlYylYzlY{GY|GY}\f~\YjYnnnnlGGGGGGlllllllll.1 ll 1.\Ct\DY\EY\FY\GYlHklIFlJFlKFlLk=MkmNk=Ol=PlGQlGRkGSlGTlGUlGVllWllXllYllZll[ll\ml]ml^pl_\ G`\ Ga\ \b\ =cl =dl=ellfllgllhllilljllklllGmmGmnGGoGGpGGqGGrlGslGtll.1 lvl lwl!mxl"myl#lzl$l{G%l|G&l}\'G~\(GY)\j*\Y+\n,\n-ln.ln/ll0lG1lG2lG3lG4lG5lG6Gl7Gl8ll9Gl:Gl;Gl<ll=ll>ll?ll@llAl\Bl\Cl\Dl\EG\FGlG\lH\lI\lJ\lKl=LlmMl=Nl=OlGPlGQlGRlGSG2.1 VlTYlUYlVYlWYlXYlYl2.2 ]mZYl[Yl\` G]Y G^Y \_Y =`Y =aY=bYllcidfeffOga 12.0 hYmGinG 12.1 jYpGkYqGlYrlmsl 12.2 nY.1 ovl 12.3 pYxlqYylrlzlsl{GtG|GuG}\vl~\wGYxGjyGYzln{ln|ln}ln~lllGlGlGlGYGYGrlYlYlYlYlYlYlYl}l 3.3.3 ldYa 10.0 Y\G\ul\lxlxlxll=lml=l=lG=G=G=G=GG.1 GlGlGlGlGllll.2 lllll Gl Gl \l =l =G=Gl\lxxxll lhYlil=jY=kY=lY=mGGnYG@GGpYGqYlrllslltGluGlvllwGlxGlyGlzlG{lG|l\}lx~lxlxllllllYlYlrlYlYlYlYGYGYGYG}GG llll lYlGlul\lxlxGxGl\lxlxlxll=l=l=l=lGlGlGlG lG GG Gl Gl GlGlGllllllllllGlGl\lxlxGxul\lxlzlxll= l=!l="l=#lG$wG%wG&wG'wG(GG)Gl*Gl+Gl,Gl-Gl.ll/ll0ll1ll2lG3lG4l\5lx6lx7Gx8Gl9\l:xl;xl<xl=ll>ll?ll@llAlGBwGCwGDwGEwGFGGGGlHGlIGlJGlKGlLllMllNllOllPlGQlGRl\SlxTlxUGxVGlW\lXxlYxlZxl[ll\ll]ll^l l_l G`l Gal Gbl GclGdGGeGlfGlgGlhGliGljllklllllmllnlGolupl\qlxrlzsGxtGlu\ lvx!lwx"lxx#lyl$wzl%w{l&w|l'w}l(G~l)Gl*Gl+Gl,GG-GG.lG/lG0lG1lG2ll3ll4ll5ll6ll7Gl8GY9\l:xl;xf<xY=lv>l 3.3.1 YlvG 3.3.2 wYa8.0 YGal 9.0 lYIGxJGxKGxLlYMlue>dulw|ldvG}}xH$ wxu lH$ GUU`l GHz xwyu GHz GUU`l H$ yxzu H$ xlUU` GHz zy{u xHz GwUU`l GHH{z|u GHHG`l HH|{u HHGY` lHH}v vHH@GptNMODES - Number of modes the SHT is computed for.For lMAX of 1000, this is about 5x105. e p$SHT - Spherical Harmonic Transform. 8pFFT - Fast Fourier Transform. Z`Summary of Input Data r ^The main input data consists of a sequence of 2D doppler images for the dynamics data. There ^are 60 days of observation each with one observation per minute - 86,400 total observations. XEach observation is 1024*1024 elements, stored initially as 16 bit integers (e.g. Short \DAT[1024][1024][86400]) or a total of 168.75 GB of input data. The data may have gaps when bno data was received from the experiment, individual images may be truncated due to telemetry Pdropouts. `!Summary of Computing Assumptions  9`Because disk I/O, network I/O, and memory bandwidth are, in many cases, the limiting factors in i9 lparallelizing algorithms, four baseline computer architectures are used to discuss the algorithms. rmP9KThe architectures are not meant to be complete, merely representative. ta &x_1. Single large computer system (see Figure 1), which does all the computing and I/O. 0arh2. A server with all the disk storage, networked to compute servers (the network speed $tiTand disk speed can be matched, in this implementation most disk I/O is associated 2H][6with network I/O of the same data). See Figure 2. J`:psConsider two cases: b`rea. Slow network and disk z`mab. Fast network and disk ~ouaaFapprox[prompt[],plus[cross[num[2,"2"],over[cross[num[1024,"1024"],cross[num[1024,"1024"],num[4,"4"]]],times[num[128,"128"],char[k]]]],over[cross[num[1024,"1024"],num[1024,"1024"]],indexes[1,0,num[10,"10"],num[6,"6"]]]]]ourdeshede H$  epH$ 1SiUU`r  tHUX  thHUX 2UUh w)Draft - |7/8/926 HH eeHHanat`le tdwiH;3K ^ H H FootnotepsHE;? ^ HMHMo Single LinezH'% aFootnote oxro croHlADf ^ ,"HuHue Double Line8"]Hz %02 Double Line,nu   111HdHH UUHH}@! `)Parallel Processing in Dynamics Pipeline -UT UT`'Peter N. Milford and Sylvain Korzennik MUR UT`ee SOI-TN-063 zUP UT` Introduction t leaThe goal of this technical note is to outline basic helioseismic processing for the SOI dynamics dpipeline from the perspective of parallel processing. It contains an outline of how the algorithms P8parallelize and a sample parallel architecture.  dIn general, the algorithms are discussed from the perspective of a coarse-grained (at the per image in%alevel), loosely-coupled parallel computing system. This architecture seems to match the majority mof the algorithms well. In addition, many of the algorithms can be efficiently parallelized on a per @ pixel basis.  VThe algorithms discussed are the MDI baseline dynamic processing, including: simple @infcalibration, detrending, remap, spatial FTT, SHT, FFT, l-n diagram and peak fitting. . _The intent is to categorize the approximate order of operations required on the data and how utl<sm_they scale with image size, number of observations, and number of compute nodes. Though fully el Jta^categorizing the input and output requirements of the algorithms and computer architecture is XPZoutside the scope of this document, some relevant estimates are included. z`t Notation g gNPIX - Number of pixels in a single image, dopplergram, etc. (For MDI observations, normally @of1024*1024 = 106) n y dNOBS - Number of observations for analysis, e.g. for 60 days of dynamics there are about PalN60*24*60 dopplergrams (86400) (1 dopplergram per minute for 60 days).  indO( ) - order of. An estimate of the number of operations and how they scale with increasing N, . de.g. subtracting two images (matrices) requires on order of NPIX operations. O(subtract) = jO(NPIX); doubling the number of pixels in the array doubles the number of operations. The actual zibnumber of operations (scale factor) required per pixel may be small, e.g. for subtraction: ~1, or scblarge, e.g. for remap: ~100. Other algorithms scale differently with the number of pixels; e.g., $yFast Fourier transforms scale as NPIX log2 NPIX. The scale factors are only crudely estimated by 2P*1Ranalogy with estimates based on the APU Science Primitives estimates [1]. J`foLNNODES - Number of computation nodes in a loosly coupled network. grb  d\Server - In some architectures there may be just one server machine connected to the disk e ppe`system. In the extreme there may only be one computer, acting as both a compute server and as a (m~@on disk server. PHT%   Single Line  n e    tioHZzi TableFootnoter2Ä ed~~y1,HH .gHH}@ d  whpi<3. Shared memory multiprocessor system (see Figure 3). e >8gk4. Distributed I/O system (see Figure 4). The goal of this architecture is to increase I/O in timL@APUline with compute cycles. It is probably the most realistic system considered here. mdlyHH meHH  55@r  `isMSample discussion of a simple algorithm: Removing instrument velocity offset , codIt is assumed that the instrument has a (nearly) fixed velocity offset as a function of position on -K^the CCD. Prior to the SHT and FFT, this should be removed by subtracting it. This is a simple ;@2subtraction requiring O(NPIX) operations. Sp;+The computation for each image is: k`<For i = 0, Npix - 1{ `= wFor j = 0, Npix - 1{ d`>ssBVcorrected,j] = Vin,j] - Voffset,j] `?Fi}} 4).x@hicWhere Voffset is fixed for long time periods (e.g., 1 day or 1440 images). s p ea^The simplest implementation would be to read a Doppler image to correct, read the calibration @1image, compute the difference, save the results.  `gRequires O(NPIX) calculations, O(2*NPIX) disk reads and O(NPIX) disk writes. ! d bA first trivial speed up would be to only read the calibration image once for a whole sequence of /@ P)images, reducing the calculation time to oG` iPO(NPIX) calculations O(NPIX) reads and O(NPIX) writes. erBhLco$ TfoFequal[times[char[P],char[O],char[X],id[(*i1i*)char[i]]],sum[times[char[F],char[F],char[T],string["out"],id[(*i1i*)times[char[P],char[O],char[X],char[c],char[o],char[l],id[(*i1i*)char[i]]]],cross[id[(*i1i*)char[j]],times[string["MASK"],id[(*i1i*)char[i]],id[(*i1i*)char[j]]]]],equal[char[j],comma[num[1,"1"],string["nrows"]]]]]` 1  ` 1 leio` a pMr4*PIatA{*sLav[9  times[char[O],matrix[1,1,cross[(*n*)sqrt[indexes[0,1,char[N],times[char[P],char[I],char[X],char[R]]]],cross[sqrt[indexes[0,1,char[N],times[char[P],char[I],char[X],char[R]]]],times[indexes[0,1,string["log"],num[2,"2"]],sqrt[indexes[0,1,char[N],times[char[P],char[I],char[X],char[R]]]]]]]]] cdN 8Qz c<}` 1  an<}` 1 (fT MoN[XKm'arplus[cross[num[2,"2"],over[cross[indexes[1,0,num[1024,"1024"],num[2,"2"]],num[4,"4"]],times[num[128,"128"],char[h]]]],over[cross[num[2,"2"],cross[indexes[1,0,num[1024,"1024"],num[2,"2"]],num[4,"4"]]],times[num[128,"128"],char[h]]]]harcS٘"]2)Y 1times[char[O],matrix[1,1,cross[indexes[0,1,char[N],times[char[m],char[a],char[s],char[k],char[s]]],sqrt[indexes[0,1,char[N],times[char[P],char[I],char[X],char[R]]]]]]]jav*s = p}],d[0P]YeS٘=sq ?~,tHH R]HH@"2rt N]aThis algorithm can be coarse grain parallelized on a per image basis - each image is independent @ 1Oof all other images, the ordering that the data is operated on doesnt matter. .`o_(Each image in turn can have the calculations fine grain parallelized - on a per pixel basis.) 0,nK  }(m[fFor example - for the full dynamic dataset, with Nnodes compute nodes one could allocate  a/umcimages to each computational node; if each node could read, process, and write results at the same es[mD/+@,cIspeed as a single node, the speedup would be a factor of Nnodes. r`A[N.Server Example for Removing Instrument Offset 0fTo parallelize the instrument offset removal on a loosely coupled network of computers with a @central disk server requires: pFor the Server:  mO(NPIX) reads and O(NPIX) writes per image O(NPIX) network sends and O(NPIX) e@e receives per image.  pllFor the Computation node: $`tebO(NPIX) calculations, O(NPIX) network receives, O(NPIX) sends per image. < onXIf we assume a 1 MFLOP sustained computation rate (memory to memory), sustained network wiJPcojsends and receives at 1 Mhz  128 kb/sec; 32 bit Floating Point images; 106 pixels; bp a^For the server will be doing 2 NPIX network I/O and 2 NPIX disk I/O per image zp#t_server = 2*t_disk + 2*t_network verpin-for disk I/O at 1 Mhz  128 Kb/sec. rpns<32 bit Floating Point data with slow net, non overlap, I/O. pux= z sec. trapir1= 64 + 64 sec/image = 128 seconds per image pO(1assume disk I/O and net I/O are 100% overlapped. ip) 64 seconds PI,pCAssume fast disk (*4 and fast network * 4 and 16 bit integer data) putDp 4 seconds/image. \pFor the computation node requires: treceive + tcalc + tsend to do a complete calculation: ~x, /2tnetio + tcalc y secs nd %'Nb/oaaover[cross[num[2,"2"],indexes[1,0,num[10,"10"],num[9,"9"]]],indexes[1,0,num[10,"10"],num[6,"6"]]]JrIX&aaFpekcross[over[num[0.5,"0.5"],cross[num[2,"2"],num[2,"2"]]],approx[over[num[2,"2"],num[3,"3"]],num[0.1,"0.1"]]]hz TlL''!NdinE,rer#aFHcong[cross[over[num[1440,"1440"],num[10,"10"]],num[6,"6"]],num[15,"15"]]ecɞLr? 2,medrlG,r 2,meHH 4 HHZ@`ds! 64 seconds + 1 second h `noEIf we use integer*2 data; quadruple network speed to 512 Kb/sec then o8`cu%+ 8 second + 1 second is required, tP`-Assume overlaped input and output operations hp( 4 seconds + 1 second is required. 0,0,"`\ At these speeds the server can only keep up with one compute node and should probably do cr@0.the computation as well! "`BnuShared Memory Multiprocessor .0'iA shared memory multiprocessor with large local memories (minimal contention) and a single I/cP,"O channel requires: ]]`MO(NPIX) reads, O(NPIX) writes, O(NPIX) calculations pZShared memory, single I/O channel is similar to a system with a server alone. (`4 MDistributed I/O System Implementation of Removing Instrument Velocity Offset @01 lAs raw data is retrieved from tape or optical disk farm, place into a distributed disk system, e.g. Nonpone disk system per computational node - probably fill each disk in turn with data, e.g. 1/Nnodes \Pseof the data per computer. t0 oWith each computer the same speed and efficiency as the server/compute nodes above, processing thelbrequires ~ 5 seconds per image / N_nodes. E.g., for 1 day of data (1440 images) on 10 nodes each larwith local disks requires ~1440/N nodes * 5 seconds  144*5   10 minutes, where the process PZof filling the local disks is assumed to be as efficient as filling a servers disks. ` Data Types /O    t`For the SHT/FFT, the analysis on the IP algorithms shows the doppler data can be represented as at stb16 bit integers - this does require more detailed analysis of the algorithms, but reduces the I/O @ a !requirements by a factor of two. ` onExternal storage formats t.  rojIf different external data formats than internal formats are used, O(NPIX) calculation stages are <@ _added to each I/O operation  in some cases this will dominate the calculation steps! s aU}q& re\over[num[86400,"86400"],indexes[0,1,char[N],times[char[n],char[o],char[d],char[e],char[s]]]]h disdeH@V  H@Vx0!O^m|!0 dh b fdyMxq t } a6$ t6$`nt r"   d , e UUhre15 ctmR` 1  ExmR` 1 (roiffta rMsz` 1  (Nz` 1 ( fh  Mrʸr` 1  s ʸr` 1 (U}f M` 1  0"` 1 (],],f h  M B` 1   B` 1 ( f! M0M` 1  M` 1  (f" Mx` 1  x` 1  ( af# M/` 1  /` 1  (ntf$ M /R` 1  /R` 1  (1f% 1 M -` 1  -` 1  (taf& 1 M ;|` 1  ;|` 1 (h f' 1 M b` 1  b` 1 (f( 1 M <` 1  <` 1 (h f) 1 Mr*" r*"`* 1NNODE `+Speedup N<}r1~" "<}r1~"`, Algorithm NmRr1~" mRr1~"/ 1`- Nzr;p" zr;p"/R 1`. Architecture `/% Single Cpu -Nʸrr," ʸrr,"(taf0  Nr(0" r(0"(h f1  N Br,"  Br,"`2  NMrw" Mrw"h `3 Architecture `4Server *Nxr'<0" xr'<0"(f5 N/r," }/r,"(f6Al iN /Rr(0" /Rr(0"(f7- N -r^q" -r^q"`8. Architecture u`H/Distributed I/O -N ;|r'<0" ;|r'<0"(tafI  N brI" brI"(h fJ  N <rEO7" <rEO7"(fK  NHR 3HDd!tuHH HH?S@<00iAssume the computer has a two cycle multiply accumulate (MAC) F*, F+ operation, then the Pscale factor reduces to 2. N .0Assume that only every 2nd (or possibly even third) m is required. This reduces the number of <@(masks and operations by a factor of 2. T tuPAssume only up to l of 800 is required, a reduction by 1/3 in number of bXmasks. zp xW 0.5 hour   hour or 6 minutes per image per slow processor 0 }on 1l, 1 MFLOP processors each with 250 MB  1 GB * .64/2  80 MB  320 MB of local @memory x5this requires ~  hours per day of input data upYwith 5 MFLOP computation node this requires ~ 3.0 hours per day. s0ombN.B. Main computer memory requirements can be reduced by computing using two passes, each P rwith half the masks. 0pmeDisk I/O for SHT dH0qEach image is read, and for each image 2xNmask POXs are saved to disk. There are then eight 106 Vctior 1/3x106 POXs to write to disk as 16 bit integers or 32 bit floats. These are saved into many bdvsmall files, for example, one file for each l,m or instead, one file for each l. The goal is to rge`generate files (or parts of files) small enough that a full 60 days can be read into memory and 2PGBXtemporal FFT can be calculated. (Locality of reference may be important for this . . .) UT UT` I/O Requirements h`pu'To prepare for the transpose data. . . P cUR UT` Transpose  oThe temporal FFT requires input data ordered as all times for a particular l, m. The normal mpu@Houtput of the SHT is all l, m. for a particular time, me'` d3e.g., desired for FFT DAT[t][l][m] h i?`4normal output of SHT DAT[l][m][t] W`VThe output from the SHT can be arranged to minimize the work required by a transpose: o mabMethod 1:Keep each l, m in an individual file, append to each file as the POX is }@ calculated. oa~"P63rae f"R6ug thb%$ mmor$$orl FZ<$ rfer=$ . rX6en hXrforX6UTX upudauuticdorZZlme,W%!Q ɹV si.W.lShared* n* `L hOu<}1~ to<}1~!: `Mod OpmR1~ ndmR1~"OX`Nca lOdz+= z+=#ug `O Disk I/O mOrʸr, orʸr,$er `PCompute O(0 h(0% `QUTConvert O B,  B,&`R OM+= M+='l `S Disk I/O ,Ox'<0 x'<0( `TNet I/O O/, /,) `UCompute O /R(0 M/R(0*R `VConvert O -EO7 l-EO7+z= `W Disk I/O gO ;|'<0 m;|'<0,r `XNet I/O erO bI bI-0 `YCompute O <EO7 <EO7.B `ZConvert `[ RO* */l `\ SP<}1~ <}1~0 `] V offset `^remove ,PmR1~) mR1~)1U`_ Order P0z+=) 0z+=)2`` 2*NPIX EO7Pʸr,) ʸr,)3W`a NPIX 0P(0) (0)4X`b 2*NPIX P B,  B,5 `c<Server PM+= M+=6 `d P[x'<0 x'<07 `e P\/, /,8 `f P ]/R(0 /R(09 `g P -EO7) -EO7): `h P ;|'<0) ;|'<0); `ir P bI) bI)<0 `j P <EO7) <EO7)=IX `k P* *><er`l= Q<}1~ <}1~?(fm<0 QmR1~ mR1~@(fn Qz+= z+=A(fo0 Qʸr, ʸr,B(fpO7 Q(0 (0C(fq<0 Q B,  B,Dr `rCompute QM+= M+=E  `s 2*NPIX Qx'<0 x'<0F `t 2*NPIX *Q/, er/,G `u NPIX Q /R(0  /R(0H `v 2*NPIX Q -EO7 -EO7I( fw Q ;|'<0 0;|'<0J( fx Q bI O7bIK( fy Q <EO7 <0<EO7L( fz Q*   *M`{ R<}1~  <}1~N(f| 0RmR1~,    mR1~,O/`}Scale Rz+=,    z+=,P `~ Rʸr,,   ʸr,,Q`1 R(0,  (0,Rw `10 R B,  B,S `Server R7M+= M+=T  ` R0x'<0 x'<0U ` R/, /,V ` R /R(0 /R(0W ` R -EO7, -EO7,Xe z `  R ;|'<0, ;|'<0,Y r ` R bI, bI,Z1  ` R <EO7, <EO7,[10 ` R* *\er`= S<}1~ <}1~](f<0 SmR1~ mR1~^(f Sz+= z+=_(f0 Sʸr, ʸr,`(fO7 S(0 (0a( f<0 S B,  B,b `Compute SM+=  M+=c ` Sx'<0 ! x'<0d ` S/,! " /,e= `1 S /R(0"!' /R(0f<0 `10 S -EO7# -EO7g( z f S ;|'<0$ ;|'<0h( r f S bI% bIi(  f S <EO7& <EO7j( B f S*'") *kCo` T <}1~( <}1~l( f T!mR1~,)'* mR1~,m ` Timea Tz+=,*)+ z+=,n1 /R`' Tʸr,,+*, ʸr,,o10 ` T(0,,+- (0,p ` T B,-,.  B,q `Slow TM+=.-/ M+=r < ` Tx'<0/.0 x'<0s  `) T/,0/1 /,t} ` T /R(0102 /R(0u R `* T -EO7213 -EO7vTia ` T +;|'<0324 ;|'<0w ` T ,bI435 bIx ` T -<EO7546 <EO7y ` T.*65< *zSl`= U<}1~7 <}1~{(f<0 UmR1~8 mR1~|()f Uz+=9 z+=}(f0 Uʸr,: ʸr,~(*fO7 U(0; (0(f<0 U B,<6=  B, `Fast U5M+==<> M+= ` U6x'<0>=? x'<0 ` U</,?>@ /, ` U /R(0@?A /R(0 ` U -EO7A@B -EO7 ` U ;|'<0BAC ;|'<0 ` U bICBD bI ` U <EO7DCE <EO7 ` U=3*EDG 3*Fa`= V<}31~F <}31~(f<0 VmR31~,GEH mR31~,` Timeb Vz3+=,HGI z3+=, ` AVʸr3,,IHJ ʸr3,, ` BV3(0,JIK 3(0, ` CV B3,KJL  B3, `Slow VM3+=LKM M3+= < `E Vx3'<0MLN x3'<0  `G V/3,NMO /3, } ` V /R3(0ONP /R3(0 R `H V -3EO7POR -3EO7Tib ` V IHQb Hd( AUU \fThe design for good telemetry coverage (planned) and poor telemetry coverage (not planned) may be difUU@\Kferent. ;|3'<0RPS ;|3'<0 ` V b3ISRT b3I  `M V <3EO7TSU <3EO7  `N VI*UT_ I* /`O WW1)R ɹV> :W:Memory<}I1~W <}I1~(f WmRI1~X mRI1~(f WzI+=Y zI+=(df WUHHZ meHHK6d@co ( UT UT`e Dynamics Pipeline !`|The assumed stages are: 9pT1. Read tapes - assumed to hold complete set of available data TGx" already ordered. _`2. Bad pixel correct. EO7w`'3. Remove instrument Velocity offset. `04. Calibrate velocity, e.g. ephemeris correct. `B5. Other as yet to be defined calibration, e.g. CTE correction, `C>scattered light. `Mea. Save dataset. `16. Fill data gaps in images and missing images. `"7. Detrend and high pass filter. `~a. Save dataset? -`8. Remap and apodize. WEh 9. SPATIAL FFT ]p>10. Dot product with associated Legendre function masks. u`a. Save dataset. `)11. Transpose - save, append, apodize i`12. Temporal FFT `  a. Save `13. Peak find. ho  bThe detrend and transpose are the only algorithms that are not inherently parallelizable on a per @Ba image basis. .`S Calibration ve5`TtyNThese steps consist of several calculations carried out on individual images. M`U&The baseline calibration may include: epV cd Replacing a set of already known bad pixels in each image with the mean of nearby pixels. } W db Subtracting a calculated/observed instrument velocity offset (this could be included instead in cp8 48[]? 2 5?jequal[times[char[o],char[u],char[t],id[char[i]]],plus[times[char[i],char[n],id[char[i]]],minus[prompt[]]]]ʸrI,\ e ʸrI,(Saasf W1Hʿep848]Rbdi[[dI(0^  I(0(ndf Wh BI,_Ua ly BI,el o `Fast aWm`"M ٘`cCa1& 5?gg[fract[indexes[0,1,char[N],times[char[o],char[b],char[s]]],indexes[0,1,char[N],times[char[n],char[o],char[d],char[e],char[s]]]],indexes[0,1,char[N],times[char[w],char[i],char[n]]]]MI+=a_e ixMI+=rbi ` WWddbveddofmlb"M٘cadY``HHdb eqHHRQQZ@i]s[W[nkthe detrend/high pass filter step). This consists of subtracting a slowly changing observed image (Sa@Wbfrom each dopplergram - probably using a fixed image for each days observing will be sufficient. .`X4See Section 1.4 for a discussion of this algorithm. PhYBFill Gaps h ZBYSome specific gap filling algorithm will be used - perhaps either zero filling or simple Mv@Z6interpolation over gaps (e.g., linear interpolation).  [ha^The gap filling can probably be done the first time data is read from disk in the calibration [[i_procedure, except where gaps occur at times that span processor disks, it will involve minimal iP['processor-processor communication. ve0cFor one- or two- minute gaps, probably the most frequent, linear interpolation may be the best and HP+simplest method of interpolating the data. [n0D pTo linear interpolate data in an image at time tmissing, with the last good image at tgood and the am @D f'next good image at t+good, be hG  &`1. o> F|As long as Image(tmissing), Image(t+good), and Image(t-good) are on the same processor disk erLFilcsystem, this should parallelize efficiently. If they are on adjacent disk systems, they will not ZPFfibparallelize as efficiently, but for only short few image gaps this will not be a problem. UT UT`xcDetrend and High Pass Filter h di^The data has some low spatial and temporal frequency signals that dominate the oscillations - Pcthese include solar rotation, limb shift, and changes in the instrument background signal. H `These signals can be removed by high pass spatial and temporal filtering the data, prior to the @ ispatial FFT/SHT/temporal FFT.  im]One method of doing this involves subtracting a sliding boxcar mean of images from the data, nprior to transforming. Current practice uses about 21 minute windows (=Nwin images) for this @>process. (This type of filter does have unwanted sidelobes.)  s `On a single processor, the algorithm runs by maintaining a sum of the 21 images centered on the (@iz0image to be detrended, and subtracting the sum: g@`]be pXhUT   DexI'<0eaf xI'<0l mp `na tW /I,feg /I,, hi ` t iW r/RI(0gfh /RI(0 r b `l  W p-IEO7hgi th-IEO7te F ` mW n;|I'<0ihj bt;|I'<0omat ` W rbIIjik bIIN `s  W <IEO7kjl o<IEO7 `gl rWs_*lkm ni_* `e  bXe<}_1~mln g<}_1~UT `<0REMAP XfmR_1~)nmo mR_1~) t`Order Xz_+=)onp z_+=) t r NPIX + /R@ NPIXR Xʸr_,)poq ʸr_,)-O7` NPIX e` X_(0)qpr _(0)om NPIX + b@k NPIXR X B_,rqs  B_,<O7 `Server IX7M_+=srt M_+= ` Xx_'<0tsu x_'<0~ `} X/_,utv /_, ` RX /R_(0vuw /R_(0z= ` X -_EO7)wvz -_EO7)/Rh PX -eG9O7+x| yyw7\2+yx/over[plus[indexes[0,1,string["N"],times[char[P],char[I],char[X]]],indexes[0,1,string["N"],times[char[P],char[I],char[X],char[R]]]],indexes[0,1,string["N"],string["nodes"]]];|_'<0)zw{ ;|_'<0) `<0 X 7b_I){z~ 7b_I)h X y(~fG+|x }}{+}|RT/~nover[indexes[0,1,string["N"],times[char[P],char[I],char[X],char[R]]],indexes[0,1,string["N"],string["nodes"]]]<_EO7)~{ R<_EO7)/R0h X<fG9O7+| ~z7\2+/over[plus[indexes[0,1,string["N"],times[char[P],char[I],char[X]]],indexes[0,1,string["N"],times[char[P],char[I],char[X],char[R]]]],indexes[0,1,string["N"],string["nodes"]]]s[r*~ [Ir*chch`[R ]Yd<}r1~ "]<}r1~(<0f YmRr1~ 0 mRr1~(f Yzr+= zr+=(f Yʸrr, ʸrr,(hahafR] iYxr(0 ]]r(0(f YR Br,  Br, `Compute YMr+= [pMr+=chch `1, iY"xr'<0 chxr'<0ods[ ` Y/r, /r,[Rd ` Y /Rr(0 /Rr(0 ` Y -rEO7 -rEO7( f Y ;|r'<0 ;|r'<0( f Y brI brI(R]x f Y <rEO7 <rEO7(R f Y* *ut` = Z<}1~ <}1~(1,f <0 ZmR1~, mR1~,` Scale Zz+=, z+=, `  Zʸr,, ʸr,, ` 200 Z(0, (0, `10 Z B,  B,  `Server ZM+= M+=<O7 ` Zx'<0 x'<0 ` Z/, /,}~ ` Z /R(0 /R(0R~ ` Z R-EO7, -EO7, ` Z ;|'<0, ;|'<0, ` Z bI, bI, `~200 Z <EO7, <EO7,B `10 Z* *er` [<}1~ <}1~(f [mR1~ mR1~(f [z+= z+=( f [ʸr, ʸr,( Rf [(0 (0( f [ B,  B, `Compute [M+= M+= < ` [x'<0 x'<010 `  [/, /, `!200 [ /R(0 /R(0 R `"10 [ -EO7 -EO7( f# [ ;|'<0 ;|'<0( f$ [ bI bI( f% [ <EO7 <EO7( f& [* *ut`' \<}1~ <}1~(f( \mR1~, mR1~,`) Timea \z+=, z+=,/R`* \ʸr,, ʸr,,10 `+ \(0, (0, `, \ B,  B, `-Slow \M+= M+= < `. \x'<0 x'<0  `/ \/, /, } `0 \ /R(0 /R(0 R `1 \ -EO7 -EO7Tia `2 \ ;|'<0 ;|'<0* `3 \ bI bI+ `4 \ <EO7 <EO7, `5 \* *-Sl`6= ]<}1~ <}1~(f7<0 ]mR1~ mR1~( f8 ]z+= z+=(f90 ]ʸr, ʸr,(f:O7 ](0 (0(f;<0 ] B,  B, `<Fast ]M+= M+=4 `= ]x'<0 x'<05 `> ]/, /,6 `? ] /R(0 /R(07 `@ ] -EO7 -EO78 `A ] ;|'<0 ;|'<09 `B ] bI bI: `C ] <EO7 <EO7; `D ]* *<Fa`E= ^<}1~ <}1~(fF<0 ^mR1~, mR1~,`G Timeb ^z+=, z+=, `H ^ʸr,, ʸr,, `I ^(0, (0, `J ^ B,  B, `KSlow ^M+= M+= < `L ^x'<0 x'<0   `M ^/, /,  } `N ^ /R(0 /R(0  R `O ^ -EO7 -EO7 Tib `P ^ ;|'<0 ;|'<0 H `Q ^ bI bII `R ^ <EO7" <EO7J `s ^I@%#ė,sum[over[times[char[i],char[n],id[char[j]]],indexes[0,1,char[N],times[char[w],char[i],char[n]]]],equal[char[j],plus[char[i],minus[over[indexes[0,1,char[N],times[char[w],char[i],char[n]]],num[2,"2"]]]]],equal[char[j],plus[char[i],over[indexes[0,1,char[N],times[char[w],char[i],char[n]]],num[2,"2"]]]]]%70Eq% ×[char[O],id[times[over[indexes[0,1,char[N],times[char[m],char[o],char[d],char[e],char[s]]],indexes[0,1,char[N],times[char[n],char[o],char[d],char[e],char[s]]]],indexes[0,1,char[N],times[char[o],char[b],char[s]]],log[times[num[2,"2"],indexes[0,1,char[N],times[char[o],char[b],char[s]]]]]]]]p8};gKB]b}d dds%7!0Q B SHH O7HHccd@`  dTo parallelize in distributed I/O system, this requires access to input data on adjacent nodes. As ]].H[N&long as  this will be efficient. F0E[iyThe algorithm requires O(3NPIX) arithmetic operations and O(3NPIX) disk reads on writes. By ,TPE[wmstoring in memory the Nwin images used for the running mean calculation, this can be reduced to: slpch^O(3NPIX) arithmetic, O(NPIX) disk reads, and O(NPIX) disk writes. 0ar`[The 21 minute boxcar average can be considered a output quantity and stored for say every 10th ],P[bimage.] ]]`Remap  }aThe purpose of the remap is to convert the co-ordinate system of each input image, into a new co-B bordinate system. The transform is a function of the time the image was observed. The output data nis slightly different in size to the input data; typically the input data is 1024x1024 = NPIX and the @]](output 900x1024 = NPIXR pixels. il`"Very crudely, the calculation is: &`_Remap[i,j] = aV[k, l] + bV[K+1, l+1] + cV[k+1, l] + dV[k, l+1] PE>0emcomputed for each pixel i,j for the velocity V. The a, b, c, d, k, l, are recomputed for each output LPbpixel remap (i, j) from the metadata associated with each image and the spacecraft orbit. d  cgRemap is one of the most computationally intensive operations it requires O(NPIXR) operations, `r\with a large scale factor (say 200) - it can be coarse grain parallelized, as each image is e,Xindependent and the order of calculation of the images does not matter. The individual ob dkcalculations can be fine grain parallelized on a per output pixel basis, though each output pixel 024@d ;depends on several unknown a priori input pixels. els` Spatial FFT de ^n ]Aim is to separate the m dependence in each image, compute for each remapped image @^-separately the FFT of each row in the image. ` fFor each image calculate: ` For each row e,`ch{ DpFFTout[i] = FFT(in[i]) fro\`oc} M;0ft'q%times[char[O],id[cross[num[2,"2"],over[indexes[0,1,char[N],times[char[m],char[o],char[d],char[e],char[s]]],indexes[0,1,char[N],times[char[n],char[o],char[d],char[e],char[s]]]]]]]~rhvd,1}D bwO;!0QthB S oQE1"div I3ultioQEona pC[+ou eC[C@l6I pl660*Y#. 0707 CPUate*Y)@. h 77 DISKed&Y=. ra&&wMemory=@' M ea=@(Figure 1: Single Computer ImplementationHHFF2JDindHH arHH@haimp,c.FFT of all rows of remap image takes *x,c } calculations. LxUNimages on Nnodes processors takes  times the single image time. n`#SHT (Spherical Harmonic Transform) pnThere are 2 main steps for the dot product inner product with associated Legendre function masks. `5(i) Computing masks, or loading precomputed masks. hL(ii) Dot product of masks and columns of data: requires ~ operations. .`h &For each row FFTd image, calculated: h{ a0_The MASK[i][j] are fixed. They can either be computed on the fly, or to improve efficiency er  dperhaps compute a range of values and reuse on several images, or they can be pre-computed once .@#and stored on disk (or in memory). F0}For all l and m up to 1000, there are 5*105 masks, each used twice. Each mask is ~5*102 oTnelements (kept as symmetric or antisymmetric masks), \  2 5 * 108 mask elements. The mask balaelements can be stored as 8 bit integers, 16 bit integers, or as 32 bit floats. So mask storage np@so5requires 250 Mb  1 GB of storage (memory).  mafFor the SHT, each image requires ~5x108 operations with a scale factor of, for example, 4 per @op operation. xFoy ~  seconds  2x103 seconds   0.5 hours per image on a 1 MFLOP processor. t0 tlThis is too slow by a factor of about 30-60 to be practical on a single slow large memory evPeycomputer. 0WUse instead about 10, 5 MFlop computers, each with about 500 Mb of main memory (or two lPu$passes with 250 Mb of main memory). ma`ceMask Computation 14`&Associated Legendre polynomial masks: Vp mImproving the SHT n sk^As the time to compute each output value in the SHT is large, there may be sufficient time to |P fhoptimize the data for the transpose during this operation and ways to speed up the computation. ]2ag reIita IIer mG)mH) kmH*  I% oa In07I$ct o7I7m onI$ryevImr. 2'Y'. ab2424rCPU h r=s0  orrGrGNetworkses1Zs05  1d1dCMemoryzs  eds: ns  e xxsks75  ut Memoryma/ms  /w/wda s3  ate Memorycom}qs:@  }{}{CPU}s  }}r ss05  ssMemory*qs:@  *{*{CPU%os  yyps:@  ozzCPUtY . ...:Y . GG...w;'͉ MI w;"Figure 2: Server and Compute NodesHH 4D/ 4  // P/2    P/Pa/2   mo/a/2    s/a)bN5  s:@aM6 {_aL6 ETcV ɹ QlETlCPUUT UT`Disk Storage Requirements Vp UR UTpLocality of Memory References p UP UT`*Other Likely Algorithms on Full Disk Data `hm UN UT`gA -Other Likely Processing on Partial Disk Data i#`ichRings and Trumpets Analysis ;`j,Measure subsurface flows in patches on Sun. , S`kO3D FFT of remapped data NtNlog2(NtN) processing steps. bsek`l..2Parallelized by subdividing images after 2D FFTs. `mCorrelation Tracking  ndAfter filtering out oscillations (for say a 1-day observing run), compute on pairs of images, e.g., @n AB; BC; CD. `oHankel function Decomposition `p)Measure power flow into/out of sunspots. e qg,ZCompute Hankel function decomposition on remapped patches on individual images, then take @qraFFT temporal of coefficients. 1UL UT`f References aliI`en2[1] Milford, APU Science Primitives, SOI-TN-026.  :86 D  @gOtpinSHT al<ܨ;=RiBtsctimes[char[O],id[over[indexes[0,1,char[N],string["images"]],indexes[0,1,char[N],string["nodes"]]]]]O3Dor<?og8aaFtequal[indexes[0,1,char[N],string["masks"]],times[cross[over[num[1,"1"],num[2,"2"]],over[num[2,"2"],num[3,"3"]]],indexes[0,1,char[N],string["masks"]]]]z>!ܨ=osV;;1-jqr?rs 2,<<*OP *=!` ebr=1~POQ e=1~>!iomp UU`d Detrend ibvo|w1~,QPR o|w1~,?!UL`esOrder b+=,RQS ie+=,@!` 2*NPIX b8,,SRT 8,,A!`ts 3*NPIX O],bo87(0,TS ag87(0,B!]]` 2*NPIX 8abF"6g,U^e ],"6g,C!"]," `"]Compute ]]cnO6+=VW ]]O6+=D!V ` rbzs'<0WVX zs'<0E! ` !b,XWY e,F!~ ` !b ίw(0YXZ Deίw(0G!|w ` b EO7,ZY[ EO7,H!ieh b <'<0,[Z\ <'<0,I!` b d9I,\[] d9I,J!hag b 0EO7,]\^ EO7,K!Fh ,bg*^]U *L!]]6`W c=1~_ =1~M!( rsfX co|w1~` o|w1~N!(!fY c+=a +=O!( !wfZ c8,b 8,P!( f[ c87(0c 87(0Q!( f \c"6g:,d "6g:,R! `Slow hO6+=eUf O6+=S! 0 ` ^czs'<0feg zs'<0T! , ` cU,gfh ,U! ` c ίw(0hgm ίw(0V! ` c EO7i EO7W!( f c <'<0j <'<0X!( f c d9Ik d9IY!( f c EO7l EO7Z!( \ f c*mho *[!Sl`= d=1~n =1~\!(f<0 do|w1~,omp o|w1~,]!`Scale dh+=,poq +=,^! ` dm8,,qpr 8,,_! `1 d87(0,rq\ 87(0,`! `10 d"6g,s| "6g,a! `Compute eO6+=t\u !O6+=b!  ` dzs'<0utv !zs'<0c! ` d,vuw !,d! ` d ίw(0wvx !ίw(0e! ` d EO7,xwy !EO7,f!e `q d <'<0,yxz <'<0,g! 8`r d d9I,zy{ d9I,h!1 87`\1 d EO7,{z| EO7,i!10`10 d*|{s *j!Co` eu=1~} =1~k!( f evo|w1~~ o|w1~l!( f ew+= +=m!( f ex8, 8,n!( f ey87(0 87(0o!( f ez"6g, "6g,p! `Slow {fO6+=s O6+=q!! ` |ezs'<0 zs'<0r! ` es, ,s! ` e ίw(0 ίw(0t! ` e EO7 EO7u!( f e <'<0 <'<0v!( f e d9I d9Iw!( f e EO7 EO7x!( f e* *y!Sl`= f=1~ =1~z!(f<0 fo|w1~, o|w1~,{!` Timea f+=, +=,|! ` f8,, 8,,}! ` f87(0, 87(0,~! ` f"6g$, "6g$,! `Fast gO6+= O6+=!  ` fzs'<0 zs'<0!  ` f, ,!  ` f ίw(0 !ίw(0! |w ` f EO7 EO7!Tia ` f <'<0 <'<0! ` f d9I d9I! ` f EO7 EO7! ` f$* $*!Fa`= g=$1~ =$1~!(f<0 go|w$1~ o|w$1~!(f g$+= $+=!(f0 g8$, 8$,!(fO7 g87$(0 87$(0!(f<0 g"6g,TV "6g,! `Server bO6$+= O6$+=! ` gzs$'<0 zs$'<0! ` g$, $,!= ` g ίw$(0 ίw$(0!<0 ` g $EO7 $EO7! ` g <$'<0 <$'<0!0 ` g d9$I d9$I!O7 ` g $EO7 $EO7!<0 ` Vg:* :*!er`= h=:1~ =:1~!(f<0 ho|w:1~, o|w:1~,!` Timeb h:+=, :+=,! ` h8:,, 8:,,! ` h87:(0,d 87:(0,! ` h"6gP, "6gP,! `Fast iO6:+=d O6:+=!  ` hzs:'<0 zs:'<0! V ` h:, :,!  ` h ίw:(0 ίw:(0! |w ` h :EO7 :EO7!Tib ` h <:'<0 <:'<0! ` h d9:I d9:I! ` h d:EO7 :EO7! `  hP* P*!Fa` = i=P1~ =P1~!(f <0 io|wP1~ o|wP1~!(f  iP+= P+=!(f 0 i8P, 8P,!(fO7 i87P(0 87P(0!(f<0 i"6g, "6g,! `Server lO6P+= O6P+=! d ` izsP'<0 zsP'<0! ` iP, P,!= ` i ίwP(0 ίwP(0!<0 ` i PEO7 PEO7! ` i <P'<0 <P'<0!0 ` i d9PI d9PI!O7 ` i PEO7 PEO7!<0 ` if* f*!er`= j=f1~ =f1~! UU`<0SHT jo|wf1~, o|wf1~,! `Order jf+=, f+=,! `NPIX + ίw`Nmodes <0j8f,, 8f,,!O7`Nmodes + !h  j087f(0, <087f(0,!`9NPIX + `Nmodes j"6g|, "6g|,! `Compute !kO6f+= O6f+=! `  !jzsf'<0 zsf'<0!~ `! !jf, SHf,! `" j !ίwf(0 ίwf(0! `# j fEO7, fEO7,!mh$8 j <f'<0, <f'<0,!`%0 j d9fI, d9fI,!h& j fEO7, fEO7,!h' jg|* |*!!6`( k=|1~ =|1~!( !sf) ko|w|1~ o|w|1~!( !f* k|+= |+=!( wf+ k8|, 8|,!( f, k87|(0 87|(0!( f- k"6g,&( "6g,! `jSlow pO6|+= O6|+=! `/ kzs|'<0 zs|'<0!  `0 k|, |,!( `1 k ίw|(0 ίw|(0!) `2 k |EO7 |EO7!(* f3 k <|'<0 <|'<0!(+ f4 k d9|I d9|I!(, f5 k |EO7 |EO7!(- f6 k(* *!jSl`7= l=1~ =1~!(f8<0 lo|w1~, o|w1~,!`9Scale l+=, +=,!1 `: l8,, 8,,!2 `;4 l87(0, 87(0,!3 `<10 l"6g, "6g,! `=Compute mO6+= !O6+=!  `> lzs'<0 !zs'<0!( `? l, !,! `@ l ίw(0 !ίw(0! `A l EO7, !EO7,!e `B l <'<0, <'<0,! 8`C l d9I, d9I,!4 87`D4 l EO7, EO7,!10`E10 l* *!Co`F m=1~ =1~!(> fG mo|w1~ o|w1~!(? fH m+= +=!(@ fI m8, 8,!(A fJ m87(0 87(0!(B fK m"6g,  "6g,!C `LSlow nO6+= O6+=! `M mzs'<0 zs'<0! `N m, ,!F `O m ίw(0 ίw(0!G `P m EO7 EO7!(H fQ m <'<0 <'<0!(I fR m d9I d9I!(J fS m EO7 EO7!(K fT m * *!LSl`U= n=1~ =1~!(fV<0 no|w1~, o|w1~,!`W Timea n+=, +=,! `X n8,, 8,,! `Y n87(0, 87(0,! `Z n"6g,  "6g,! `[Fast oO6+=   O6+=!  `\ nzs'<0    zs'<0!  `] n,    ,!  `^ n ίw(0   ίw(0! |w `_ n EO7  EO7!!a `` n <'<0 <'<0!X  `a n d9I d9I!Y `b n EO7 EO7!Z `c n*  *![Fa`d= o=1~ =1~!(fe<0 oo|w1~ o|w1~!(ff o+= +=!(fg0 o8, 8,!(fhO7 o87(0 87(0!(fi<0 o"6gf, "6gf,! `.Server jO6+=  O6+=! `k ozs'<0 zs'<0! `l o, , != `m o ίw(0 ίw(0 !<0 `n o EO7 EO7 ! `o o <'<0 <'<0 !0 `p o d9I  d9I !O7 `q o EO7 ! EO7!<0 `r o*! # *!er`s= p=1~" =1~!(ft<0 po|w1~,#!$ o|w1~,!`u Timeb p+=,$#% +=,! `v p8,,%$& 8,,! `w p87(0,&% 87(0,! `x p"6g,'07 "6g,! `yFast qO6+=() O6+=!  `z! pzs'<0)(* zs'<0!  `{# p,*)+ ,!  `| p ίw(0+*, ίw(0! |w `}$ p EO7,+- EO7!Tib `~ p %<'<0-,. <'<0!v  ` p &d9I.-/ d9I!w  ` p EO7/.0 EO7!x  ` p7*0/' *!yFa`= q=1~1 =1~!(!f<0 qo|w1~2 o|w1~ !(#f q+=3 +=!!(f0 q8,4 8,"!($fO7 q87(05 87(0#!(f<0 qO6+=7'8 O6+=%! ` qzs'<0879 zs'<0&! `O7 q,98: ,'! ` q ίw(0:9; ίw(0(!= `~ q EO7;:< EO7)!<0 `~ q <'<0<;= <'<0*! `= q d9I=<? d9I+!0 ` q W%!Q ɹ>V@ .W.SharedEO7?= EO7,! ` q@1 ɹ@>A :@:I/OO6hTV ɹA@B thTCPUTV ɹBAC TCPUiWQ ɹCBD wiWLocalbW)R ɹDCE wbW!MemoryWQ ɹEDF W;LocalW)R ɹFEG WMemoryY1.GFH  . . . . . .w@'䝀 MHG w@&Figure 3: Shared Memory Multiprocessor18Y.IJ 1E1E=*A'E ɹONK 'M'MStepsB=9#PRវequal[times[char[I],char[m],char[a],char[g],char[e],id[indexes[0,1,char[t],string["missing"]]]],cdot[over[indexes[0,1,char[t],string["missing"]],id[plus[indexes[1,1,char[t],string["good"],string["+"]],minus[indexes[1,1,char[t],string["good"],string["-"]]]]]],plus[times[char[I],char[m],char[a],char[g],char[e],id[indexes[1,1,char[t],string["good"],string["+"]]]],cdot[id[plus[num[1,"1"],minus[over[indexes[0,1,char[t],string["missing"]],id[plus[indexes[1,1,char[t],string["good"],string["+"]],minus[indexes[1,1,char[t],string["good"],string["-"]]]]]]]]],times[char[I],char[m],char[a],char[g],char[e],id[indexes[1,1,char[t],string["good"],string["-"]]]]]]]]dQSS"KBz=9%R]bPPdHHSQ HH9@p* Advantage: clear simple algorithm  J Disadvantage: probably impossible to implement efficiently either on a .@chsingle or parallel computer. dF t]lMethod 2:Batch together a bunch of l, m values so that after all NOBS images have d"Tuscbeen analysed there are only at most a few input files for each set of l, m. e],bPr[=For example, group all m for a particular l. oUT UT` Temporal FFT s exsThe data is stored in a number of files, each with several (l, m) for all time (or perhaps divided ,ch@[g;into shorter time periods, e.g., into 1-day time periods). ]]]`_mEach FFT requires O(Ntlog2Nt) operations; there are Nmodes to do altogether. ``So total calculation required: `a6O(NmodesNobslog2Nobs)  bzWith each calculation O(Nobslog2Nobs); if the data is evenly divided onto processor local disks @bjwith all t for a range of (l, m) on each processor, then the total computer time will be @hdd" b`c a"and the I/O for each node will be xe e 0mYNote the FFT can leave very poor locality of reference in memory and may cause many page Pframe misses. UR UT`SHT Alternative Implementation ber`thRThe masks can be recomputed on the fly, starting with seeds for more efficiency.    t_This could be used in conjunction with processing multiple images simultaneously, e.g. process @10 images at once. arDUP UT`o $End to End Processing in One Step So\  rcThe processing up to the transpose can be efficiently serialized into a single program to reduce bjlcbthe disk I/O requirements of the system. It avoids re-reading intermediate results. In addition, x_depending on what data products are finally desired, it can avoid writing various intermediate putʭ ֦TVd" #S-sqrt[indexes[0,1,string["N"],string["PIXR"]]] $LUXDPNohover[cross[string["2"],indexes[0,1,string["N"],string["PIX"]]],indexes[0,1,string["N"],string["nodes"]]]8}\!֦V^` STTem ,W_ Th ,!tih &`LXc  UUZldC:.zLY`ssPmaover[plus[indexes[0,1,string["N"],string["PIX"]],indexes[0,1,string["N"],string["modes"]]],indexes[0,1,string["N"],string["nodes"]]]e#LZcesİP tfover[plus[num[3,"3"],indexes[0,1,string["N"],string["PIX"]]],indexes[0,1,string["N"],string["nodes"]]]$L[^ DPeahover[cross[string["2"],indexes[0,1,string["N"],string["PIX"]]],indexes[0,1,string["N"],string["nodes"]]]de"6g,\rt s "6g,-!d" `rsqServer s[0ds;]bover[indexes[0,1,string["N"],cdot[string["modes"],sqrt[indexes[0,1,string["N"],string["PIXR"]]]]],indexes[0,1,string["N"],string["nodes"]]]}\&`L^cV S [[]d_WW$n<L`Vb L YYC:.zLadLPover[plus[indexes[0,1,string["N"],string["PIX"]],indexes[0,1,string["N"],string["modes"]]],indexes[0,1,string["N"],string["nodes"]]]N"j9iy= b`d L ]]İv6G%`LcX^ ,i ZZ\,sn<Ldb "N aa1$du LeftdtrRight]nddtr Reference]nddtrdvd6gdddsd#  RotateddexRotatedrddedbg[dded! d $ LRotatedd!" ]d"!#Q d#"7 Wd$ _L=U?f= triCellBodyf>[  CellBodyf?  CellHeading1 f@ $BodyfA  FormulafB  TableFootnotedefCT #Q TableTitleT:Table : fD  Footnote fEP TitleBodyfF  CellHeadingfG CellBodyfHT  TableTitleT:Table : fI $qBodye@J Q eT eHeader+>@K Fo o FooterfLPHeadingBody fMQSubhead.0 Body fNPTitle2BodyfOPle HeadingBodyfP qdy e HeaderfQ eTer +> Footer fR qerBody fSPHeTitleBody fT SuHBody fV qBody fW HeHBody0$ fX dy$eIndent fY $Body fZ $Bodyer0$ f[ $Indentf\ CellBody@3 f] $Indent2f^ CellBody f_ Bodyf` Footnote faQ Subhead.0 Body fbQSubhead.0 Body@3 fc $Indent2Ce fd  $Body fe $BodyBoff  Formula fhPTitle2BodyFo fi $Body0$ fj ?Indentfk  CellHeadingfl CellBodyfm CellBodya$ fn bIndent@@fp CellBodyfqQf Subhead1 . Bodyfr   Formula0$ fsf $Indent@3 ft $Indent2fu CellBodyfvQ Subhead2.. Bodyfw Bodyfx CellBody fy0 $Bodyfz CellBody8(2ff f{  $Indentf}Q  Subhead2.. Bodyf~Q Subhead2.. BodyfQ Subhead1 . Body-    Subscript   Emphasis  Subscript  Superscript       Superscript   dy    2 Subscriptnf 0      Su     ( Superscript  !    Subscript  Subscript$   Superscript   }   Subscriptu     0 Subscript     F F F Z F ZeujeThinfMediumgDoublehThick@i Very Thing ptg@eegexeg`egghgheeggxeegheegx eg(gegh egH efeeeeefHHFGFHFGFHFGFHFGFHFGFFormat B eeeeeeeeegH HFGFHFGFHFGFHFGFHFGFFormat ADo eeeeeeeeegHHFGFHFGFHFGFHFGFHFGFFormat Ax 1RMq`*h1~1~h+=h,eh(0,e+='<0x,(0EO7'<0hIeEO7M}`#` 1 MNMeeMMMFGMMMFGMMM FGM B M M M Meer 1"NMOFGNNNFGNNNFGNADoNNN N eeN N N FGN 1ONPOOFGOAO 1OOOOO1~O hO O eO O ,O 1POQP(0PPPPPEO7P`#P PP P eeP P P FGP 1QPRQQFGQB QQQQeeQ"QQ FGQ Q Q FGQ Q 1RQS R R Ree R RRFGRRRR R R FGR AR  1R 1SRTS1~ShSSeSS,SS S!S "S (0#S $S %S &S 1TSU 'T(T)Tee*T+T,TFG-T.T/T0T 1T 2T FG3T B 4T 5T 1UTV"6U7UFG8U9U:UFG;U<U=U>U?U @U AU BU eeCU DU3 1VUWEVFVGVHVFGIVAJV 1KVLVMVNV OV 1~PV hRV SV eTVI 1WVXUWWWXW(0YW\W^W_WaWeWfW  gW hW iW eejW kW_ 1XWYlXmXnXoXFGpXB qXrXsXtXuX "vX wX FGzX {X ~XFGr 1YXZYYYYYeeYYYYY Y Y Y Y FGYA 1ZY[ZZZ1~ZhZZeZZZZ Z Z Z (0Z Z 1[Z\[[ [[[ee[[[[[ [ [ [ [ FG[B  1\[]\\"\\FG\\\FG\\\ \ \ \ \ \ee 1]\^]]]]]]FG]A]]] ] ] ] 1~] h] 1^]_^^^^^(0^^^^^ ^ ^  ^ ^ ^ee 1_^b"_#_$_%_&_'_FG(_B )_*_+_ ,_ -_ "._ /_ FG0_ 1b)_c!ObPbQbRbSbTbbeeVbWbXb Yb Zb [b \b ]b 1cbd!^c_c`cacbc1~cchUcecfcgc hc ic jc kc lc(0 1dce!mdndodpd qdrd\deetdudvd wd xd yd zd {d 1edf!|e}e~ee"eeFGseee)e e e e e e 1feg!ffffffffff f f f f f1~$ 1gfh!ggggggg(0ggg g g g g  g: 1hgi!hhhhhhdhhhh h h h h "hP 1ihj!i)iiiiiiiii i i i i if 1jik!jjjjjjj1~jjj j j j j j| 1kjl!kkkkkk kkkk k k k k k 1lkm!llllll"llll l )l l l l 1mln!mmmmmmmmmm m m m m m 1nmo!nnnnnnn n n n  n n n n n 1onp!oooooo oooo o o o o  o 1poq!!p"p#p)$p%p&pp(p)p*p +p ,p -p .p /p 1qTp!0q1q2q3q4q5q'q7q8q9q :q ;q <q =q ?q-=-.Comment Helvetica MTNew Century SchlbkSymbolTimesRegularExtraRomanMediumRegularBold RegularObliquedItalic]VD`&KzaLp>:,""/5) .WX zQ7pyx3RxSWH2]*xX$=_oBB aEz16.hWzU髃j:3!{f2ԚjDrPcf_䧾:Z@IFiV=;O>>F#yZ F?"?7\ a3;dSǮ.n'xOtn(+s~|NI5>4Lm\~ UnءԬfFkQwj* fy~~V{r\ |^j5VD'^*kߟ5