I recently discovered a problem with our WebCopy and Cyotek Sitemap Creator products to do with "corruption" of plain text documents, where non-ANSI characters appeared incorrectly. It didn't take long to realize that these programs were saving text content as ANSI files. Which I found curious as the Crawler library they use detects response encoding and uses this to save the files.
Or does it? Consider the code below:
Looking at this, you might be tempted to assume (as I did) that this code would save the content in the given encoding. When I tried opening one of the files generated by similar code to the above in Notepad++, I found they were encoded as ANSI files. Switching the encoding to UTF-8 immediately displayed the files correctly without the "corruption". So it seems the byte order mark (BOM) isn't actually written by the BinaryWriter - I think it only uses the given encoding for converting strings to a byte array. All this time I assumed files were being saved as UTF-8 (or whatever the response encoding was) and properly supported Unicode, and all this time I was wrong.
So how do you manually write a BOM into a document? The oddly
GetPreamble function available from the
is what you need - this returns the bytes that comprise the BOM,
and you can then write this directly to your stream:
Note that you only need to write a BOM if your document is actually supposed to be a text file - if it is "normal" binary data (such as an image or a gzip stream) then you definitely do not want to write a BOM, or you truly will have a corrupt file.
Now the files produced by WebCopy and Sitemap Creator are encoded correctly and I can be happily with yet another bug squashed, unhappy at yet another reminder of why I need to write a proper set of automated tests for the libraries I use, but happy again that I had another (albeit brief) tip to post on this blog.
- 2012-12-11 - First published
- 2020-11-21 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?