End of an(other) Era: EMIS Aggregation Subsystem

For the first time in twenty years, the SSDT has not produced a version of the EMIS system on OpenVMS. In particular, we have not produced a version of the "EMIS Aggs" (formally, EMIS Aggregation Subsystem).  Tomorrow, the SSDT will be holding a wake for the system at NWOCA.  Normally, we'd have a retirement, but we've retired the thing for the last two years, and it kept coming back, so this time it's a wake.  Things don't come back from wakes...

So as not to exclude others who have been effected (or afflicted by) the EMSAGG's, this blog post will summarize some of the history and impact of the aggs on the SSDT and others.  I'll also attach a few historical documents, which we intend to ceremonially shred tomorrow (since they won't let us burn on the premises). Print a copy, if you like, and subject it to whatever treatment you feel is appropriate.

Early History

From the time that SB140 was passed, which created the EMIS system in law, ODE and the SSDT had about two years to go live.  This included the time for ODE to work out the specs (EMIS Guides).  We had to write a full ETL (extract, translate and load) process, validation routines and data maintenance.  But that was, largely, the "easy" part, we also had to write "aggregations" which would summarize the data to ODE's requirements and allow electronic network submission to ODE.   Keep in mind this was twenty years ago, when disk space was not cheap and network transmission speeds were not like they are today (your today phone probably has better connnectivity).  So summarizing the considerable amount data was important.

To provide the SSDT the specifications for the Aggregations, Paul Dimel and Jim Daubenmire (both with ODE) and myself, developed a private language between us which became known as the "Agg Pseudo-Code".   It looks a bit like pseudo-code you might have seen in school, but it's imprecise, ambiguous and lacking a lot of detail unless you had intimate knowledge of how the AGG software worked.  But as it was only a communication document between a small group at ODE and the SSDT, it worked well enough to allow us to (usually) accurately produce the desired validation and output. I've attached a copy (EMIS_AGS.TXT) from FY95.  This was from when the aggs were compartively simple.

Dates and Statistics

This first lines of code for the AGS were commited to CMS (Code Management System) on 23-Jun-1991 (by yours truely) and the first release on VAX was 21-Aug-1991 (release notes attached).  At it's peak (FY2010), the AGG subsystem consisted of approximately 56K SLOC (source lines of code) and 350K CLOC (compiled lines of code).  CLOC is much higher because a large part of the validations and ETL was generated code (using a tool called Netron/CAP) instead of manually written code.  CAP is not really a code generator, but provides a tool to write code generators.  So we wrote generators in CAP that generated much of the EMIS system.

This is just the core agg routines, the numbers don't include the dozens of supporting subroutines and re-usable libraries that formed the basis of the  EMIS system.   The entire EMIS system contained 1500+ source files and 310K SLOC and about 1 million CLOC (not including 100's of pages of documentation nor the web application and SOAP service we developed in this century) . 

Architecture and Design

Using modern terminology, the Aggs are conceptually similar to a MapReduce process (excepting there was no parallel processing or distributed clusters, i.e. no "partitioning").  EMSAGG1 performed the data gathering/mapping functions.  It consisted of approximately 25 subroutines (one for each aggregated "major" type).  Each subroutine was responsible for reading each source record (e.g. StudentDemographics) and determining that records contribution to the aggregate.  For example, a single student record would generate "HeadCount = 1, FTE=1.00, FundingFTE =1.00", etc. This values were written to the intermediate file with key values (IRN, BuildingIRN, Gender, GradeLevel, etc) to match the "keys" in the aggregate spec.

But writing the data this way, we quickly realized that the size of the intermediate file would quickly grow larger than the input "Chapter 5" records and it would take hours, or days, just to sort it on the systems of the day.  So instead, EMSAGG1 wrote a single record per major AggType by "Major Key" (DistrictIRN+BuildingIRN) and the rest of the record contained a "map" (key,value pairs) for the minor key (Gender, GradeLevel) and values.

The intermediate file was then sorted by Agg type and major key.  EMSAGG2 (the "reduce" phase) then processed this file with a similar set of subroutines (one for each AGG1 subroutine).  The subroutines would control break on major key and summarize in memory the minor keys and values.  Today, we'd think nothing of using dozens of megabytes of memory to summarize data like this.  But in the early 90's it seemed ridiculous to a COBOL programmer to "waste" so much memory (most of us came from a PDP or HP3000 background, where such a technique was impossible). But OpenVMS made it possible to use large memory tables for summarizing the data and allowed us to hit our performance goals.

The design of the aggs, although somewhat forced on us by the performance needs, turned out to be quite elegant and flexible.  It allowed the SSDT to react fairly quickly when the rules changed unexpectedly or when ODE was "late" with the pseudo-code.  The modular nature of it made it possible to use different subroutines for different reporting periods and encapsulate period specific validations/calculations.


So there's a brief history and summary of the EMIS Aggregation Subsystem. There's more I could say and other people I could mention, but it's late and I'm getting bored and you problably stopped reading...

None of the above should be mistaken for nostalgia.  Although I'm proud of much of the work we did with this system, I also remember hours in the symbolic debugger trying to find why one particular student test score didn't land in the right bucket, only to find a extra "period" terminated an IF statement.  The tools we use today are far more powerful, less error prone and productive than anything we did back then.   I won't miss those days. 

Tomorrow the SSDT will be celebrating the end of the EMIS Aggregations Subsystem. Perhaps I'll post a few pictures of the people involved in the project (and the joy on their faces as they shred bits of history).  If you take time to perform a similar ceremony, feel free to send me a photo and I'll added it to ours.  You're also welcome to post comments here if you have anything to add or recollections of your own.

-Dave "Yes, I wrote the software, but EMIS wasn't my fault" Smith

emis_ags.txt65.88 KB
rel910828.txt27.62 KB