Some of the project files created by Cyotek Sitemap Creator and WebCopy are fairly large and the load performance of such files is poor. The files are saved using a XmlWriter class which is nice and fast. When reading the files back however, currently the whole file is loaded into a XmlDocument and then XPath expressions are used to pull out the values. This article describes our effort at converting the load code to use a XmlReader instead.

Sample XML

The following XML snippet can be used as a base for testing the code in this article, if required.

xml
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<cyotek.webcopy.project version="1.0.0.0" generator="Cyotek WebCopy 1.0.0.2 (BETA))" edition="">
  <uri lastCrawled="-8589156546443756722" includeSubDomains="false">http://saturn/cyotekdev/</uri>
  <additionalUri>
    <uri>first url</uri>
    <uri>second url</uri>
  </additionalUri>
  <authentication doNotAskForPasswords="false">
    <credential uri="/" userName="username" password="password" />
  </authentication>
  <saveFolder path="C:\Downloaded Web Sites" emptyBeforeCrawl="true" createFolderForDomain="true" flattenWebsiteDirectories="false" remapExtensions="true" />
  <crawler removeFragments="true" followRedirects="true" disableUriRemapping="false" slashedRootRemapMode="1" sort="false" acceptDeflate="true" acceptGZip="true" bufferSize="0" crawlAboveRoot="false" />
  <defaultDocuments />
  <linkInfo save="true" clearBeforeCrawl="true" />
  <stripQueryString>false</stripQueryString>
  <useHeaderChecking>true</useHeaderChecking>
  <userAgent useDefault="true"></userAgent>
  <rules>
    <rule options="1" enabled="true">trackback\?id=</rule>
    <rule options="1" enabled="false">/downloads/get</rule>
    <rule options="1" enabled="false">/article</rule>
    <rule options="1" enabled="false">/sitemap</rule>
    <rule options="1" enabled="false">image/get/</rule>
    <rule options="1" enabled="false">products</rule>
    <rule options="1" enabled="false">zipviewer</rule>
  </rules>
  <domainAliases>
    <alias>(?:http(?:s?):\/\/)?saturn/cyotekdev/</alias>
  </domainAliases>
  <forms>
    <page name="" uri="login" enabled="true" method="POST">
      <parameters>
        <parameter name="rememberMe">true</parameter>
        <parameter name="username">username</parameter>
        <parameter name="password">password</parameter>
      </parameters>
    </page>
  </forms>
  <linkMap>
    <link id="b1b85626f9984279b5e033c30a0a3f65" uri="" source="1" contentType="text/html" httpStatus="200" lastDownloaded="-8589156550177150260" hash="0333961593BD555C49ABF2355140225A07DA9297" fileName="index.htm">
      <title>Cyotek</title>
      <incomingLinks>
        <link id="b1b85626f9984279b5e033c30a0a3f65" />
      </incomingLinks>
      <outgoingLinks>
        <link id="96a358d21135449eb6561f25399e24de" />
      </outgoingLinks>
      <headers>
        <header key="Content-Encoding" value="gzip" />
        <header key="Vary" value="Accept-Encoding" />
        <header key="X-AspNetMvc-Version" value="1.0" />
        <header key="Content-Length" value="3415" />
        <header key="Cache-Control" value="private" />
        <header key="Content-Type" value="text/html; charset=utf-8" />
        <header key="Date" value="Fri, 01 Oct 2010 16:51:07 GMT" />
        <header key="Expires" value="Fri, 01 Oct 2010 16:51:07 GMT" />
        <header key="ETag" value="" />
        <header key="Server" value="Microsoft-IIS/7.5" />
        <header key="X-Powered-By" value="UrlRewriter.NET 2.0.0" />
      </headers>
    </link>
  </linkMap>
</cyotek.webcopy.project>

Writing XML using a XmlWriter

Before I start discussing how to load the data, here is a quick overview of how it is originally saved. For clarity I'm only showing the bare bones of the method.

csharp
string workFile;

workFile = Path.GetTempFileName();

using (FileStream stream = File.Create(workFile))
{
  XmlWriterSettings settings;

  settings = new XmlWriterSettings { Indent = true, Encoding = Encoding.UTF8 };

  using (XmlWriter writer = XmlWriter.Create(stream, settings))
  {
    writer.WriteStartDocument(true);

      writer.WriteStartElement("uri");
      if (this.LastCrawled.HasValue)
        writer.WriteAttributeString("lastCrawled", this.LastCrawled.Value.ToBinary());
      writer.WriteAttributeString("includeSubDomains", _includeSubDomains);
      writer.WriteValue(this.Uri);
      writer.WriteEndElement();

    writer.WriteEndDocument();
  }
}

File.Copy(workFile, fileName, true);
File.Delete(workFile);

The above code creates a new temporary file and opens this into a FileSteam. A XmlSettings object is created to specify some options (by default it won't indent, making the output files difficult to read if you open then in a text editor), and then a XmlWriter is created from both the settings and stream.

Once you have a writer, you can quickly save data in compliant format, with the caveat that you must ensure that your WriteStarts have a corresponding WriteEnd, that you only have a single document element, and so on.

Assuming the writer gets to the end without any errors, the stream is closed, then temporary file is copied to the final destination before being deleted. (This is a good tip in its own right, as this means you won't destroy the user's existing if an error occurs, which you would if you directly wrote to the destination file.)

Reading XML using a XmlDocument

As discussed above, currently we use a XmlDocument to load data. The following snippet shows an example of this.

Note that the code below won't work "out of the box" as we use a number extension methods to handle data type conversion, which makes the code a lot more readable!

csharp
document = new XmlDocument();
document.Load(fileName);

_uri = documentElement.SelectSingleNode("uri").AsString();
_lastCrawled = documentElement.SelectSingleNode("uri/@lastCrawled").AsDate();
_includeSubDomains = documentElement.SelectSingleNode("uri/@includeSubDomains").AsBoolean(false);

So, as you can see we load a XmlDocument with the contents of our file. We then call SelectSingleNode several times with a different XPath expression.

And in the case of a crawler project, we do this a lot, as there is a large amount of information stored in the file.

I haven't tried to benchmark XPath, but I would assume that we could have optimized this by first getting the appropriate element (uri in this case) and then run additional XPath to read text/attributes. But this article would be rather pointless then as we want to discuss the XmlReader!

As an example, we have a 2MB project file which represents the development version of cyotek.com. Using System.Diagnostics.Stopwatch we timed how long it took to load this project 10 times, and it averaged 25seconds per load. Which is definitely unacceptable.

Reading using a XmlReader

Which brings us to the point of this article, doing the job using a XmlReader and hopefully improving the performance dramatically.

Before we continue though, a caveat:

This is the first time I've tried to use the XmlReader class, therefore it is possible this article doesn't take the best approach. I also wrote this article at the same time as getting the reader to work in my application so I've gone back and forth already correcting errors and misconceptions, which at times (and possible still) left the article a little disjointed. If you spot any errors in this article, please let us know

The XmlReader seems to operate in the same principle as the XmlWriter, in that you need to read the data in more or less the same order as it was written. I suppose the most convenient analogy is a forward cursor in SQL Server, where you can only move forward through the records and not back.

Creating the reader

So, first things first - we need to create an object. But the XmlReader (like the XmlWriter) is abstract. Fortunately exactly like the writer, there is a static Create method we can use.

Continuing in the reader-is-just-like-writer vein, there is also a XmlReaderSettings class which you can use to fine tune certain aspects.

Lets get the document opened then. Unlike XmlDocument where you just provide a file name, XmlReader uses a stream.

csharp
using (FileStream fileSteam = File.OpenRead(fileName))
{
  XmlReaderSettings settings;

  settings = new XmlReaderSettings();
  settings.ConformanceLevel = ConformanceLevel.Document;

  using(XmlReader reader = XmlReader.Create(fileSteam, settings))
  {
  }
}

This sets us up nicely. Continuing my analogy from earlier, if you're familiar with record sets, there's usually a MoveNext or a Read method you call to read the next record in the set. The XmlReader doesn't seem to be different in this respect, as there's a dedicated Read method for iterating through all elements in the document. In addition, there are a number of other read methods for performing more specific actions.

There is also a NodeType property which lets you know what the current node type is, such as the start of an element, or the end of an element.

I'm going to use the IsStartElement method to work out if the current node is the start of an element, then perform processing based on the element name.

Enumerating elements, regardless of their position in the hierarchy

The following snippet will iterate all nodes and check to see if they are the start of an element. Note that this includes top level elements and child elements.

csharp
while (reader.Read())
{
  if (reader.IsStartElement())
  {
  }
}

The Name property will return the name of the active node. So I'm going to compare the name against the names written into the XML and do custom processing for each.

csharp
switch (reader.Name)
{
  case "uri":
    break;
}

Reading attributes on the active element

I mentioned above that there are a number of Read* methods. There are also several Move* methods. The one that caught my eye is MoveToNextAttribute*,which I'm going to use for converting attributes to property values.

The Value property will return the value of the current node. If MoveToNextAttribute returns true, then I know I'm in a valid attribute and I can use the aforementioned Name property and the Value property to update property assignments.

The following snipped demonstrates the MoveToNextAttribute method and Value property:

csharp
while (reader.MoveToNextAttribute())
{
  switch (reader.Name)
  {
    case "lastCrawled":
      if (!string.IsNullOrEmpty(reader.Value))
        _lastCrawled = DateTime.FromBinary(Convert.ToInt64(reader.Value));
      break;
    case "includeSubDomains":
      if (!string.IsNullOrEmpty(reader.Value))
        _includeSubDomains = Convert.ToBoolean(reader.Value);
      break;
  }
}

This is actually quite a lot of work. Another alternative is to use the GetAttribute method - this reads an attribute value without moving the reader. I found this very handy when I was loading an object who's identifying property wasn't the first attribute in the XML block. It also takes up a lot less code

csharp
entry.Headers.Add(reader.GetAttribute("key"), reader.GetAttribute("value"));

Reading the content value of an element

I've now got two values out of hundreds in the file loaded and I'm finished with that element. Or am I? Actually I'm not - the original save code demonstrates that in addition to a pair of attributes, we're also saving data directly into to the element.

As we have been iterating attributes, the active node type is the last attribute, not the original element. Fortunately there's another method we can use - MoveToContent. This time though, we can't use the Value property. Instead, we'll call the ReadString method, giving us the following snippet:

csharp
if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.Element)
  _uri = reader.ReadString();

I've included a call to IsStartElement in the above snippet as I found if I called MoveToContent when I was already on a content node (for example if no attributes were present), then it skipped the current node and moved to the next one.

If required, you can call ReadElementContentAsString instead of ReadString.

Some node values aren't strings though - in this case the XmlReader offers a number of strongly typed methods to return and convert the data for you, such as ReadElementContentAsBoolean, ReadElementContentAsDateTime, etc.

csharp
case "useHeaderChecking":
  _useHeaderChecking = reader.ReadElementContentAsBoolean();
  break;

Processing nodes where the same names are reused for different purposes

In the sample XML document at the start of this article, we have two different types of nodes named uri. The top level one has one purpose, and the children of additionalUri have another.

The problem we now face is as we have a single loop which processes all elements the case statement for uri will be triggered multiple times. We're going to need some way of determining which is which.

There are a few of ways we could do this, for example

  • Continue to use the main processing loop, just add a means of identifying which type of element is being processed
  • Adding another loop to process the children of the additionalUri element
  • Using the ReadSubtree method to create a brand new XmlReader containing the children and process that accordingly.

As we already have a loop which handles the elements we should probably reuse this - there'll be a lot of duplicate code if we suddenly start adding new loops.

Unfortunately there doesn't seem to an equivalent of the parent functionality of the XmlDocument class, the closest thing I could see was the Depth property. This returned 1 for the top level uri node, and 2 for the child versions. You need to be careful at what point you read this property, it also returned 2 when iterating the attributes of the top level *ri** node.

One workaround would be to use boolean flags to identify the type of node you are loading. This would also mean checking to see if the NodeType was XmlNodeType.EndElement, doing another name comparison, and resetting flags as appropriate. This might be more reliable (or understandable) than simply checking node depths, your mileage may vary.

Another alternative could be to combine depth and element start/end in order to push and pop a stack which would represent the current node hierarchy.

In order to get my converted code running, I've went with the boolean flag route. I suspect a future version of the crawler format is going to ensure the nodes have unique names so I don't have to do this hoop jumping again though!

Combined together, the load data code now looks like this:

csharp
while (reader.Read())
{
  if (reader.IsStartElement())
  {
    switch (reader.Name)
    {
      case "uri":
        if (!isLoadingAdditionalUris)
        {
          while (reader.MoveToNextAttribute())
          {
            switch (reader.Name)
            {
              case "lastCrawled":
                if (!string.IsNullOrEmpty(reader.Value))
                  _lastCrawled = DateTime.FromBinary(Convert.ToInt64(reader.Value));
                break;
              case "includeSubDomains":
                if (!string.IsNullOrEmpty(reader.Value))
                  _includeSubDomains = Convert.ToBoolean(reader.Value);
                break;
            }
          }

          if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.Element)
            _uri = reader.ReadString();
        }
        else if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.EndElement)
          _additionalRootUris.Add(new Uri(UriHelpers.CombineUri(this.GetBaseUri(), reader.ReadString(), this.SlashedRootRemapMode)));
        break;
      case "additionalUri":
        isLoadingAdditionalUris = true;
        break;
    }
  }
  else if (reader.NodeType == XmlNodeType.EndElement)
  {
    switch (reader.Name)
    {
      case "additionalUri":
        isLoadingAdditionalUris = false;
        break;
    }
  }
}

Which is significantly more code than the original version, and it's only handling a few values.

Using the ReadSubtree Method

The save functionality of crawler projects isn't centralized, child objects such as rules perform their own loading and saving via the following interface:

csharp
public interface IXmlPersistance
{
  void Write(string fileName, XmlWriter writer);
  void Read(string fileName, XmlNode reader);
}

And the current XmlDocument based code will call it like this:

csharp
_rules.Clear();
foreach (XmlNode child in documentElement.SelectNodes("rules/rule"))
{
  Rule rule;
  rule = new Rule();
  ((IXmlPersistance)rule).Read(fileName, child);
  _rules.Add(rule);
}

None of this code will work now with the switch to use XmlReader so it all needs changing. For this, I'll create a new interface

csharp
public interface IXmlPersistance2
{
  void Write(string fileName, XmlWriter writer);
  void Read(string fileName, XmlReader reader);
}

The only difference is the Read method is now using a XmlReader rather than a XmlNode.

The next issue is that if I pass the original reader to this interface, the implementer will be able to read outside the boundaries of the element it is supposed to be reading, which could prevent the rest of the document from loading successfully.

We can resolve this particular issue by calling the ReadSubtree method which returns a brand new XmlReader object that only contains the active element and it's children. This means our other settings objects can happily (mis)use the passed reader without affecting the underlying load.

Note in the snippet below what we have wrapped the new reader in a using statement. The MSDN documentation states that the result of ReadSubtree should be closed before you continue reading from the original reader.

csharp
Rule rule;

rule = new Rule();
using (XmlReader childReader = reader.ReadSubtree())
  ((IXmlPersistance2)rule).Read(fileName, childReader);
_rules.Add(rule);
break;

Getting a XmlDocument from a XmlReader

One of the issues I did have was classes which extended the load behaviour of an existing class. For example, one abstract class has a number of base properties, which I easily converted to use XmlReader. However, this class is inherited by other classes and these load additional properties. Using the loop method outlined above it wasn't possible for these child classes to read their data as the reader had already been fully read. I didn't want to have these derived classes has to do the loading of base properties, and I didn't want to implement any half thought out idea. So, instead these classes continue to use the original loading of the XmlDocument. So, given a source of a XmlReader, how do you get an XmlDocument?

Turns out this is also very simple - the Load method of the XmlDocument can accept a reader. The only disadvantage is the constructor of the XmlDocument doesn't support this, which means you have to explicitly declare a document, load it, then pass it on, demonstrated below.

csharp
void IXmlPersistance2.Read(string fileName, XmlReader reader)
{
  XmlDocument document;

  document = new XmlDocument();
  document.Load(reader);

  ((IXmlPersistance)this).Read(fileName, document.DocumentElement);
}

Fortunately these classes aren't used frequently and so they shouldn't adversely affect the performance tuning I'm trying to do.

I could have used the GetAttribute method I discussed earlier as this doesn't move the reader, but firstly I didn't discover that method until after I'd wrote this section of the article and I thought it had enough value to remain, and secondly I don't think there is an equivalent for elements.

The final verdict

Using the XmlReader is certainly long winded compared to the original code. The core of the original code is around 100 lines. The core of the new code is more than triple this. I'll probably replace all the "move to next attribute" loops with direct calls to GetAttribute which will cut down the amount of code a fair bit. I may also try to do a generic approach using reflection, although this will then have its own performance drawback.

However, the XML load performance increase was certainly worth the extra code - the average went from 25seconds down to 12seconds. This is still quite slow and I certainly want to improve it further, but at less than half the original load time I'm pleased with the result.

You also need to be careful when writing the document. In Cyotek crawler projects, as we are using XPath to query an entire document, we can load values no matter where they are located. When using a XmlReader, the values are read in the same order as they were written - so if you have saved a critical piece of information near the end of the document, but you require it when loading information at the start, you're going to run into problems.

Update History

  • 2010-11-05 - First published
  • 2020-11-21 - Updated formatting

Like what you're reading? Perhaps you like to buy us a coffee?

Donate via Buy Me a Coffee

Donate via PayPal


Comments

# DotNetKicks.com

[b]Using the XmlReader class with C#[/b] You've been kicked (a good thing) - Trackback from DotNetKicks.com - Trackback from DotNetKicks.com

Reply

# DotNetShoutout

[b]Using the XmlReader class with C#[/b] Thank you for submitting this cool story - Trackback from DotNetShoutout - Trackback from DotNetShoutout

Reply

# Calabonga

Thanks! The post is very helpfull. XmlReader is not my choise!

Reply

# Slobodan

Its worth noting the existence of the XPathNavigator class that in some respects behaves like a "reader" but can also be used for direct access to "named" locations using XPath.

With careful planning of the XML layout (to place the elements in a natural order) you can achieve the performance of the "reader" and the conciseness of the XmlDocument.

Reply

# Richard Moss

Slobodan,

Thanks for your comments, I shall check into the XPathNavigator class too and see what this has to offer.

Regards; Richard Moss

Reply