A novel method for inserting elements in text with lxml
lxml is a fantastic library for working with XML in Python applications, and I've used it in a number of projects. However, I recently ran into an interesting problem: how to replace some text in an XML document with a new XML element. Obviously, you can't just replace the text with the literal tags; lxml will automatically escape the tags, as would be expected. What you must do, instead, is insert an Element instance in the middle of the string—but it's not that simple. In a DOM-based implementation, it would be relatively easy to truncate the current text node, then append the new element, and finally append a new text node with the remaining text. But lxml doesn't use text nodes; instead it uses and properties to hold text content. This complicates the matter substantially; the issue has been raised before, and the simple answer is that there's no perfectly clean way to do it using only the lxml API.
I had devised a solution which worked by shuffling around content from the text and tail attributes, and creating new Elements and appending them to the parent Element, but it was fragile and not terribly clean. If there were any elements already present in the text being processed, then the code failed miserably. However, while poking around in the lxml documentation, I found that lxml offered support for SAX, an event-driven XML API. Because lxml can both generate a stream of SAX events from an lxml tree, and generate an lxml tree from a stream of SAX events, my idea was to couple lxml to itself using SAX, but interpose a filter which would perform the necessary substitutions. Because of the nature of SAX, it would be simple to insert an element in character data—just output the character data before the new element, then emit events to create the new element, then the trailing characters. I found that some additional work was necessary to get elements to have the namespace prefixes I wanted, but other than that, the resulting code is a somewhat more elegant and robust solution, although as noted before, a DOM-based API is the easiest way to do this type of substitution. However, it's not easy to go back and forth between lxml and a DOM implementation, so within the lxml API, this is probably as good as it gets.