Wednesday, May 21, 2008

Merging 2 + XML Documents with different Encoding Types

Problem:
I have the task of merging 2 XML documents together that differ in their encoding types. One is declared as a UTF-8 (8bit UCS Transformation format) doc, the other the ISO-8859-1 format type (Latin Alpha No. 1). I wanted to do it w/o parsing the xml as well, as that's an expensive operation and with large documents can be problematic. Well, I figured, this is easy! I'll do the following:
  • Create a String Buffer to hold the new Large XML
  • Strip the Root Nodes and any xml header (doctype/?xml etc)
  • Write each xml doc to the String Buffer
  • Append the Root Node Back to the Main String
  • Close the String Buffer
  • Have a snack
XML Doc 1:
<?xml version="1.0" encoding="UTF-8"?>
<jobs>
<job>
<jobtitle>Job 1</jobtitle> ...
</job>
</jobs>
XML Doc 2:
<?xml version="1.0" encoding="iso-8859-1"?>
<jobs>
<job>
<jobtitle>Job 2</jobtitle> ...
</job>
</jobs>


Turns out, that only works well if the XML documents you're attempting to merge are of the same encoding type. Any ideas on a work around?

Solution:
What I did, is I specified an encoding type for a FileWriter object, and followed the same process, but had to write the file to disk specifying a unified encoding type, then read the file back.

This worked ok, but I am looking at alternatives like going to Binary and back to String again, but for now, this is my best available option.

No comments: