Some of the project files created by Cyotek Sitemap Creator and
WebCopy are fairly large and the load performance of such files
is poor. The files are saved using a
XmlWriter class which is
nice and fast. When reading the files back however, currently
the whole file is loaded into a
XmlDocument and then XPath
expressions are used to pull out the values. This article
describes our effort at converting the load code to use a
The following XML snippet can be used as a base for testing the code in this article, if required.
Writing XML using a XmlWriter
Before I start discussing how to load the data, here is a quick overview of how it is originally saved. For clarity I'm only showing the bare bones of the method.
The above code creates a new temporary file and opens this into
XmlSettings object is created to specify
some options (by default it won't indent, making the output
files difficult to read if you open then in a text editor), and
XmlWriter is created from both the settings and
Once you have a writer, you can quickly save data in compliant format, with the caveat that you must ensure that your WriteStarts have a corresponding WriteEnd, that you only have a single document element, and so on.
Assuming the writer gets to the end without any errors, the stream is closed, then temporary file is copied to the final destination before being deleted. (This is a good tip in its own right, as this means you won't destroy the user's existing if an error occurs, which you would if you directly wrote to the destination file.)
Reading XML using a XmlDocument
As discussed above, currently we use a
XmlDocument to load
data. The following snippet shows an example of this.
Note that the code below won't work "out of the box" as we use a number extension methods to handle data type conversion, which makes the code a lot more readable!
So, as you can see we load a
XmlDocument with the contents of
our file. We then call
SelectSingleNode several times with a
different XPath expression.
And in the case of a crawler project, we do this a lot, as there is a large amount of information stored in the file.
I haven't tried to benchmark XPath, but I would assume that we
could have optimized this by first getting the appropriate
element (uri in this case) and then run additional XPath to read
text/attributes. But this article would be rather pointless then
as we want to discuss the
As an example, we have a 2MB project file which represents the
development version of cyotek.com. Using
System.Diagnostics.Stopwatch we timed how long it took to load
this project 10 times, and it averaged 25seconds per load.
Which is definitely unacceptable.
Reading using a XmlReader
Which brings us to the point of this article, doing the job
XmlReader and hopefully improving the performance
Before we continue though, a caveat:
This is the first time I've tried to use the
XmlReaderclass, therefore it is possible this article doesn't take the best approach. I also wrote this article at the same time as getting the reader to work in my application so I've gone back and forth already correcting errors and misconceptions, which at times (and possible still) left the article a little disjointed. If you spot any errors in this article, please let us know
XmlReader seems to operate in the same principle as the
XmlWriter, in that you need to read the data in more or less
the same order as it was written. I suppose the most convenient
analogy is a forward cursor in SQL Server, where you can only
move forward through the records and not back.
Creating the reader
So, first things first - we need to create an object. But the
XmlReader (like the
XmlWriter) is abstract. Fortunately
exactly like the writer, there is a static
Create method we
Continuing in the reader-is-just-like-writer vein, there is also
XmlReaderSettings class which you can use to fine tune
Lets get the document opened then. Unlike
you just provide a file name,
XmlReader uses a stream.
This sets us up nicely. Continuing my analogy from earlier, if
you're familiar with record sets, there's usually a
Read method you call to read the next record in the set.
XmlReader doesn't seem to be different in this respect, as
there's a dedicated
Read method for iterating through all
elements in the document. In addition, there are a number of
other read methods for performing more specific actions.
There is also a
NodeType property which lets you know what the
current node type is, such as the start of an element, or the
end of an element.
I'm going to use the
IsStartElement method to work out if the
current node is the start of an element, then perform processing
based on the element name.
Enumerating elements, regardless of their position in the hierarchy
The following snippet will iterate all nodes and check to see if they are the start of an element. Note that this includes top level elements and child elements.
Name property will return the name of the active node. So
I'm going to compare the name against the names written into the
XML and do custom processing for each.
Reading attributes on the active element
I mentioned above that there are a number of
There are also several
Move* methods. The one that caught my
MoveToNextAttribute*,which I'm going to use for
converting attributes to property values.
Value property will return the value of the current node.
true, then I know I'm in a
valid attribute and I can use the aforementioned
Value property to update property assignments.
The following snipped demonstrates the
This is actually quite a lot of work. Another alternative is to
GetAttribute method - this reads an attribute value
without moving the reader. I found this very handy when I was
loading an object who's identifying property wasn't the first
attribute in the XML block. It also takes up a lot less code
Reading the content value of an element
I've now got two values out of hundreds in the file loaded and I'm finished with that element. Or am I? Actually I'm not - the original save code demonstrates that in addition to a pair of attributes, we're also saving data directly into to the element.
As we have been iterating attributes, the active node type is
the last attribute, not the original element. Fortunately
there's another method we can use -
MoveToContent. This time
though, we can't use the
Value property. Instead, we'll call
ReadString method, giving us the following snippet:
I've included a call to
IsStartElement in the above snippet as
I found if I called
MoveToContent when I was already on a
content node (for example if no attributes were present), then
it skipped the current node and moved to the next one.
If required, you can call
Some node values aren't strings though - in this case the
XmlReader offers a number of strongly typed methods to
return and convert the data for you, such as
Processing nodes where the same names are reused for different purposes
In the sample XML document at the start of this article, we have
two different types of nodes named
uri. The top level one
has one purpose, and the children of
The problem we now face is as we have a single loop which
processes all elements the case statement for
uri will be
triggered multiple times. We're going to need some way of
determining which is which.
There are a few of ways we could do this, for example
- Continue to use the main processing loop, just add a means of identifying which type of element is being processed
- Adding another loop to process the children of the
- Using the
ReadSubtreemethod to create a brand new
XmlReadercontaining the children and process that accordingly.
As we already have a loop which handles the elements we should probably reuse this - there'll be a lot of duplicate code if we suddenly start adding new loops.
Unfortunately there doesn't seem to an equivalent of the parent
functionality of the
XmlDocument class, the closest thing I
could see was the
Depth property. This returned
1 for the
uri node, and
2 for the child versions. You need
to be careful at what point you read this property, it also
2 when iterating the attributes of the top level
One workaround would be to use boolean flags to identify the
type of node you are loading. This would also mean checking to
see if the
another name comparison, and resetting flags as appropriate.
This might be more reliable (or understandable) than simply
checking node depths, your mileage may vary.
Another alternative could be to combine depth and element start/end in order to push and pop a stack which would represent the current node hierarchy.
In order to get my converted code running, I've went with the boolean flag route. I suspect a future version of the crawler format is going to ensure the nodes have unique names so I don't have to do this hoop jumping again though!
Combined together, the load data code now looks like this:
Which is significantly more code than the original version, and it's only handling a few values.
Using the ReadSubtree Method
The save functionality of crawler projects isn't centralized, child objects such as rules perform their own loading and saving via the following interface:
And the current
XmlDocument based code will call it like this:
None of this code will work now with the switch to use
XmlReader so it all needs changing. For this, I'll create a
The only difference is the
Read method is now using a
XmlReader rather than a
The next issue is that if I pass the original reader to this interface, the implementer will be able to read outside the boundaries of the element it is supposed to be reading, which could prevent the rest of the document from loading successfully.
We can resolve this particular issue by calling the
ReadSubtree method which returns a brand new
object that only contains the active element and it's children.
This means our other settings objects can happily (mis)use the
passed reader without affecting the underlying load.
Note in the snippet below what we have wrapped the new reader in
a using statement. The MSDN documentation states that the result
ReadSubtree should be closed before you continue reading
from the original reader.
Getting a XmlDocument from a XmlReader
One of the issues I did have was classes which extended the load
behaviour of an existing class. For example, one abstract class
has a number of base properties, which I easily converted to use
XmlReader. However, this class is inherited by other classes
and these load additional properties. Using the loop method
outlined above it wasn't possible for these child classes to
read their data as the reader had already been fully read. I
didn't want to have these derived classes has to do the loading
of base properties, and I didn't want to implement any half
thought out idea. So, instead these classes continue to use the
original loading of the
XmlDocument. So, given a source of a
XmlReader, how do you get an
Turns out this is also very simple - the
Load method of the
XmlDocument can accept a reader. The only disadvantage is the
constructor of the
XmlDocument doesn't support this, which
means you have to explicitly declare a document, load it, then
pass it on, demonstrated below.
Fortunately these classes aren't used frequently and so they shouldn't adversely affect the performance tuning I'm trying to do.
I could have used the
GetAttribute method I discussed earlier
as this doesn't move the reader, but firstly I didn't discover
that method until after I'd wrote this section of the article
and I thought it had enough value to remain, and secondly I
don't think there is an equivalent for elements.
The final verdict
XmlReader is certainly long winded compared to the
original code. The core of the original code is around 100
lines. The core of the new code is more than triple this. I'll
probably replace all the "move to next attribute" loops with
direct calls to
GetAttribute which will cut down the amount of
code a fair bit. I may also try to do a generic approach using
reflection, although this will then have its own performance
However, the XML load performance increase was certainly worth the extra code - the average went from 25seconds down to 12seconds. This is still quite slow and I certainly want to improve it further, but at less than half the original load time I'm pleased with the result.
You also need to be careful when writing the document. In Cyotek
crawler projects, as we are using XPath to query an entire
document, we can load values no matter where they are located.
When using a
XmlReader, the values are read in the same order
as they were written - so if you have saved a critical piece of
information near the end of the document, but you require it
when loading information at the start, you're going to run into
- 2010-11-05 - First published
- 2020-11-21 - Updated formatting
Like what you're reading? Perhaps you like to buy us a coffee?