The case of the not-so-electronic application

There was an article in the New York Times today about the vexation students experience in using the online version of the Common Application:

So it was frustrating for Max Ladow, 17, a senior at the Riverdale Country School in the Bronx, to discover this fall that he could not get his short essay answers to fit in the allotted 150 words on the electronic version of the application, even when he was certain he was under the limit.

When he would follow the program’s instructions to execute a “print preview” of his answers — which would show him the actual version that an admissions officer would see, as opposed to the raw work-in-progress on his screen — his responses were invariably cut off at the margin, in midsentence or even midword.

I remember encountering this problem when applying to undergraduate programs a number of years ago, and even then I recall being astounded by the approach taken by the Common Application—the online application is actually only a web front-end which takes the applicant’s data and stuffs it into a tagged PDF version of the paper application. It’s like using TurboTax or similar software to prepare your tax return, then printing the forms out and mailing them in; there’s nothing inherently electronic about the process. You could have filled the forms out by hand, or with a typewriter, and gotten nearly the same result. Because the web front-end knows nothing about the PDF-generation process, it only enforces the word-count limits which apply to the paper version of the application. But there’s another, more insidious limit—that of space on the page. As the article explains, it’s easy to be well within the word-count limit, yet have run out of space on the printed page. In a real word processor, you might play with the font, or the font size, or the margins. But the Common Application provides no such features.

What is both stunning and alarming, though, is the response that the New York Times received when they asked the Common Application about the problem:

Asked why the problem had not been fixed, Mr. Killion said, “Believe me, if there’s a way to do it, we’d do it. Maybe there’s a way out there we don’t know about.”

It is really inconceivable to think that that is considered an acceptable response to the problem in 2010. The most appropriate response, of course, is to liberate the electronic version of the application from the constraints of the paper form. There is no reason that an applicant’s information cannot be sent to schools in a purely electronic format. There is already an EDI transaction set, defined by ASC X12, for admissions applications—number 189, “Application for Admission to Educational Institutions”. Considering that the 130 and 131 transaction sets are already routinely used by many schools to exchange transcripts electronically, this does not seem so unreasonable.

Given that I do not have a copy of the ASC X12 EDI transaction sets, I can’t say how well the 189 transaction set would do for the data the Common Application collects. Of course, there is also a general movement away from EDI, in favor of XML-based formats, so I suspect that the new XML Admission Application developed by PESC might be a better choice.

In short, there are a multitude of solutions available to the Common Application—from improving the process which generates PDFs so that text is scaled to fit the available space, to moving to a purely electronic, XML-based process which casts off the vestiges of the paper form. Claiming that the problem cannot be solved, while it continues to unfairly penalize students who seek only to have their application information conveyed with fidelity, is not an acceptable solution.

A novel method for inserting elements in text with lxml

lxml is a fantastic library for working with XML in Python applications, and I’ve used it in a number of projects. However, I recently ran into an interesting problem: how to replace some text in an XML document with a new XML element. Obviously, you can’t just replace the text with the literal tags; lxml will automatically escape the tags, as would be expected. What you must do, instead, is insert an Element instance in the middle of the string—but it’s not that simple. In a DOM-based implementation, it would be relatively easy to truncate the current text node, then append the new element, and finally append a new text node with the remaining text. But lxml doesn’t use text nodes; instead it uses and properties to hold text content. This complicates the matter substantially; the issue has been raised before, and the simple answer is that there’s no perfectly clean way to do it using only the lxml API.

I had devised a solution which worked by shuffling around content from the text and tail attributes, and creating new Elements and appending them to the parent Element, but it was fragile and not terribly clean. If there were any elements already present in the text being processed, then the code failed miserably. However, while poking around in the lxml documentation, I found that lxml offered support for SAX, an event-driven XML API. Because lxml can both generate a stream of SAX events from an lxml tree, and generate an lxml tree from a stream of SAX events, my idea was to couple lxml to itself using SAX, but interpose a filter which would perform the necessary substitutions. Because of the nature of SAX, it would be simple to insert an element in character data—just output the character data before the new element, then emit events to create the new element, then the trailing characters. I found that some additional work was necessary to get elements to have the namespace prefixes I wanted, but other than that, the resulting code is a somewhat more elegant and robust solution, although as noted before, a DOM-based API is the easiest way to do this type of substitution. However, it’s not easy to go back and forth between lxml and a DOM implementation, so within the lxml API, this is probably as good as it gets.

XML misconceptions harm interoperability

I have noticed that certain misconceptions in using XML seem to come up over and over again, often with the result that some piece of software claims to support XML, but in reality it has certain idiosyncrasies that harm interoperability. The most egregious of these issues revolve around the use of namespaces and namespace prefixes. It’s not just novice developers, either; recently I was reading the developer documentation for Facebook Chat (to confirm my impression that it had an XMPP interface) when I came across the following: “The XML parser does not yet fully handle XML namespaces. Please stick to the same style as the examples in XMPP RFCs 3920 and 3921 when using XML namespaces.” Similarly, I was warned by the W3C Feed Validation Service when I submitted an Atom feed which used namespace prefixes for Atom and XHTML. So, I’d like to point out a few common XML misconceptions which harm interoperability:

  • XML namespaces do not have to be URLs, and XML namespaces which are URLs do not necessarily have to refer to resources that exist or that are meaningful. XML namespaces are URIs, and that by definition includes URNs.
  • XML namespace prefixes are arbitrary, and must not be expected to have certain values.
  • The default namespace is not as attractive as it seems; it may save typing, but overuse of the default namespace (or defining the default namespace to be different in different parts of a document) will just lead to confusion. Explicit namespace prefixes lead to clarity, particularly where many namespaces are involved.

A tool for evaluating academic programs

While I was studying at McGill, I wrote a small Python script which read in an XML file which contained a description of university courses and programs; the script verified that each course’s prerequisites and corequisites were satisfied, and that the requirements of the program were satisfied. The script would then produce an HTML file listing the courses taken each semester (highlighting those with unmet prerequisites or corequisites), as well as the met and unmet program requirements. The script also used dot (part of the GraphViz package) to produce a diagram showing the interactions between course prerequisites. It is my understanding that software of this type is actually of commercial value, as evidenced by such things as the CAPP module in SunGard (formerly SCT) Banner, and its Degree Audit feature, which is similar to what has been described here. Other student information systems have similar features, but I have found that they all have certain shortcomings—one of the more common shortcomings is an inability to input course selections for future terms and perform a “what-if” analysis, making it possible to formulate a multi-semester plan, rather than simply evaluating the courses taken thus far against degree requirements.

I have recently dusted the code off and posted it to GitHub, in the hope that someone may find it useful. I’ve also made some improvements; most notably, dot is now used to generate SVG output which is directly included in the XHTML generated by the script. Previously, dot produced a PNG file and imagemap which were used together. The new solution is cleaner, as it allows the output to be contained in a single file. An example of the output produced can be found here. While I cannot say this with any certainty, I believe this type of graphical output is unique among degree audit tools.

A word of explanation concerning the name is probably in order as well; it refers to a document issued to incoming Engineering students at McGill which laid out, in great detail, and with little room for variation, the courses one would take over the coming eight semesters. This document was thus referred to as one’s ‘life’. Were you to deviate from it, you would find yourself in uncharted, shark-infested waters, and this software was designed to alleviate some of that.

I do not know if I will continue working on the software beyond what I have already done; if I do, then my priorities will likely be to make the necessary changes so that the degree evaluation information which is currently printed to stdout is also included in the HTML report, as well as improving the experimental Tkinter interface (which is also included in the Git repository).

Notes on the design of student information systems

What would a student information system designed around open XML formats rather than proprietary and arcane database tables look like? Consider, for example, using xCal (an XML format for iCalendar data) to store course schedules, so every student could get a personalized feed to subscribe to in Google Calendar or iCal or the PIM tool of their choice, with their schedule of classes automatically kept up-to-date. Similarly, contact information for professors and students can be stored using an XML representation of vCard, making it trivial for students and professors to add each other to their address books. Students could automatically get a vCard file every semester with their professors’ contact information. More than simple conveniences like these, though, there is real transformational power in the use of open formats. Regional university consortia could offer students a “union catalog” of sorts, like libraries have offered for many years. For libraries, the enabling technologies were the MARC bibliographic data standard, and the Z39.50 protocol for exchanging MARC data. An open standard for student information systems could do the same thing. There are EDI transaction sets for exchanging transcript data, but from what I’ve seen they are not universally used. What’s more, transcript data is only a tiny part of the problem.

There is also the related, but substantial issue of data ownership—even though a university unquestionably owns the data in its PeopleSoft or Banner installation, how much can it do with that data without PeopleSoft or Banner installed? These proprietary formats are inextricably linked to the systems that generated them, ensuring that universities are forever tied to those systems. Migration is a difficult and expensive proposition, a waste of money that could otherwise go to teaching and research. There is also the thorny issue of universities with home-grown student information systems, a common practice particularly at large, old institutions. Back in the day when universities had “computer centers”, it was commonplace to develop these kinds of applications in-house. Many universities have moved on from their custom solutions, though, and in doing so they have lost fundamental control over their data—it is now stored in proprietary formats of the vendor’s choosing, rather than formats of their own design. Providing an open data format would make it possible for universities to migrate data stored in their existing custom student information system to an open format that they control—and when the time comes to upgrade, they can choose any vendor’s product, so long as it supports this open format.

I suspect that one of the major stumbling blocks will be that modern student information systems are in fact no longer standalone products; they have been integrated into gigantic enterprise resource planning systems like PeopleSoft. A system that deals solely with students and course catalogs and course registration will appear to be lacking a huge swath of features. Most of these, though, will be things that, at first glance, seem to have little to do with actually teaching courses: human resources, accounting, regulatory compliance, etc. The trend in modern ERP systems is to tightly couple all of these areas together in what essentially becomes a single package—and, in the case of an educational institution, bolt on a student information system. Even those systems not designed for general-purpose ERP (like SCT Banner) still grow to fulfill traditional ERP roles. A university which fully adopts Banner would end up using it not only as a student information system but also for financial management, human resources (not just for professors and academic staff, but all staff), financial aid, fundraising, and more.

For open formats to succeed, the prevailing design philosophy must shift towards the use of discrete modules which exchange information using open formats. Invariably the modules will need to communicate with each other; an accounting system must know how many credits a student is taking, and the rate per credit, before it can generate a bill (until we finally wise up and make education free). This should be simple to accomplish, though, as long as everything uses well-documented formats for data interchange. This is perhaps best described as a return to the Unix philosophy, the notion of writing software to do just one thing, but do it well.

Moreover, there are already many ancillary systems which are almost always external to a student information system, but which depend on its data: e-learning systems like Moodle and Blackboard, integrated library systems like Aleph (and quite a few more), and access control systems (to restrict access to a lab to students enrolled in particular courses, for example). All of these require data feeds from the SIS, and as such most require customization for every SIS out there. An open, neutral interchange format would negate the need to do this customization for every platform. Software vendors developing products designed to integrate with student information systems would be designed to ingest data in just one format, rather than needing customization for all of the various student information systems out there. Even seemingly simple things, like generating course catalogs for print and for the web can be made easier, with the use of XSLT and XSL-FO; (shockingly) at many institutions course catalogs are still prepared by hand, although the printed course catalog is rapidly dying.

In short, I would close with this: putting data into an XML-based format leads to gains in transparency and simplicity in infrastructure—and these are both key values which many student information systems (and ERP packages in general) are lacking.