Querying the complete plays of Shakespeare using LINQ to XML by ThinqLinq

Querying the complete plays of Shakespeare using LINQ to XML

I was working to come up with some creative uses of LINQ to XML for my new talk I'm giving at the Huntsville, AL Code Camp. I figured it would be good to include a sample which queries a large XML document. Remembering that the complete works of Shakespeare were available in XML form, I did a quick search and found a version at http://metalab.unc.edu/bosak/xml/eg/shaks200.zip. This file separates each play out into separate XML files. Since I wanted to find out which parts had the most lines across all plays, I wanted to put them into a single XML file. Rather than doing this manually, I went ahead and whipped up a quick LINQ query to fetch the xml documents and load them up into an array of XElements:

Dim plays = _
    From file In New System.IO.DirectoryInfo("C:\projects\ShakespeareXml").GetFiles() _
   
Where file.Extension.Equals(".xml", StringComparison.CurrentCultureIgnoreCase) _
   
Let doc = XElement.Load(file.FullName) _
   
Select doc

Ok, now that out of the way, I really wanted to load up a single XML file with these resulting nodes. Pretty easy using XML Literals. Just wrap the query with a new root element:

Dim plays = _
  
<Plays>
   
<%= From file In New System.IO.DirectoryInfo("C:\projects\ShakespeareXml").GetFiles() _
   
Where file.Extension.Equals(".xml", String Comparison.CurrentCultureIgnoreCase) _
   
Let doc = XElement.Load(file.FullName) _
   
Select doc %>
  
</Plays>

Easy. Now I have a new XML document containing the complete plays of Shakespeare. Now, what can we do with it... Well, we can get a count of the plays in one line:

Console.WriteLine("Plays found: " & plays.<PLAY>.Count.ToString)

We could have done that without putting it into a new document. We do see that we have 37 plays represented, so we know the first query worked. Now, to count the number of lines (LINE) for each character (SPEAKER). The XML document groups each set of lines into a parent node called SPEECH. This SPEECH node then contains the SPEAKER element and a series of LINE elements. For example, here's the beginning of Juliet's fameous Romeo, Romeo speech:

<SPEECH>
   
<SPEAKER>JULIET</SPEAKER>
   
<LINE>O Romeo, Romeo! wherefore art thou Romeo?</LINE>
   
<LINE>Deny thy father and refuse thy name;</LINE>
   
<LINE>Or, if thou wilt not, be but sworn my love,</LINE>
   
<LINE>And I'll no longer be a Capulet.</LINE>
</SPEECH>

So to achieve the goal of counting our lines by character, we find the descendent nodes of the plays element (plays...<SPEECH>) and group them by the speaker. Then we project out the name of the speaker and the number of lines they have. We don't care about the bit roles, so we'll order the results in descending form based on the number of lines (LineCount). We'll limit the results to the top 50 entries. Here's the resulting query:

Dim mostLines = _
  
From speech In plays...<SPEECH> _
  
Group By key = speech.<SPEAKER>.Value Into Group _
  
Select Speaker = key, _
            LineCount =
Group.<LINE>.Count _
  
Order By LineCount Descending _
  
Take 50

The amazing thing with this process, running all three queries here, including the one which loads the full XML from the various files takes less than a second. I haven't had time to do a full performance test, including memory load, but the initial results are quite impressive!

If you have other creative uses of LINQ to XML, let me know, I'd love to include them in future presentations. Also, if you're in the Huntsville, AL area on 2/23/2008, head on over to the code camp and see the entire presentation in person.

Posted on - Comment
Categories: LINQ - VB - VB Dev Center - Linq to XML -
comments powered by Disqus