I recently discovered a problem with our WebCopy and Cyotek Sitemap Creator products to do with "corruption" of plain text documents, where non-ANSI characters appeared incorrectly. It didn't take long to realize that these programs were saving text content as ANSI files. Which I found curious as the Crawler library they use detects response encoding and uses this to save the files.

Or does it? Consider the code below:

csharp
string fileName;
byte[] data;
Encoding encoding;

fileName = Path.GetTempFileName();
data = new byte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;

using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
  using (BinaryWriter writer = new BinaryWriter(stream, encoding))
    writer.Write(data);
}

Looking at this, you might be tempted to assume (as I did) that this code would save the content in the given encoding. When I tried opening one of the files generated by similar code to the above in Notepad++, I found they were encoded as ANSI files. Switching the encoding to UTF-8 immediately displayed the files correctly without the "corruption". So it seems the byte order mark (BOM) isn't actually written by the BinaryWriter - I think it only uses the given encoding for converting strings to a byte array. All this time I assumed files were being saved as UTF-8 (or whatever the response encoding was) and properly supported Unicode, and all this time I was wrong.

So how do you manually write a BOM into a document? The oddly named GetPreamble function available from the Encoding class is what you need - this returns the bytes that comprise the BOM, and you can then write this directly to your stream:

csharp
string fileName;
byte[] data;
Encoding encoding;

fileName = Path.GetTempFileName();
data = new byte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;

using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
  using (BinaryWriter writer = new BinaryWriter(stream, encoding))
  {
    writer.Write(encoding.GetPreamble());
    writer.Write(data);
  }
}

Note that you only need to write a BOM if your document is actually supposed to be a text file - if it is "normal" binary data (such as an image or a gzip stream) then you definitely do not want to write a BOM, or you truly will have a corrupt file.

Now the files produced by WebCopy and Sitemap Creator are encoded correctly and I can be happily with yet another bug squashed, unhappy at yet another reminder of why I need to write a proper set of automated tests for the libraries I use, but happy again that I had another (albeit brief) tip to post on this blog.

Update History

  • 2012-12-11 - First published
  • 2020-11-21 - Updated formatting

Like what you're reading? Perhaps you like to buy us a coffee?

Donate via Buy Me a Coffee

Donate via PayPal


Comments