LINQ to CSV using DynamicObject
When we wrote LINQ in Action we included a sample of how to simply query against a CSV file using the following LINQ query:
From line In File.ReadAllLines(“books.csv”)
Where Not Line.StartsWith(“#”)
Let parts = line.Split(“,”c)
Select Isbn = parts(0), Title = parts(1), Publisher = parts(3)
While this code does make dealing with CSV easier, it would be nicer if we could refer to our columns as if they were properties where the property name came from the header row in the CSV file, perhaps using syntax like the following:
From line In MyCsvFile
Select line.Isbn, line.Title, line.Publisher
With strongly typed (compile time) structures, it is challenging to do this when dealing with variable data structures like CSV files. One of the big enhancements that is coming with .Net 4.0 is the inclusion of Dynamic language features, including the new DynamicObject data type. In the past, working with dynamic runtime structures, we were limited to using reflection tricks to access properties that didn't actually exist. The addition of dynamic language constructs offers better ways of dispatching the call request over dynamic types. Let's see what we need to do to expose a CSV row using the new dynamic features in Visual Studio 2010.
First, let's create an object that will represent each row that we are reading. This class will inherit from the new System.Dynamic.DynamicObject base class. This will set up the base functionality to handle the dynamic dispatching for us. All we need to do is add implementation to tell the object how to fetch values based on a supplied field name. We'll implement this by taking a string representing the current row. We'll split that based on the separator (a comma). We also supply a dictionary containing the field names and their index. Given these two pieces of information, we can override the TryGetMember and TrySetMember to Get and Set the property based on the field name:
Imports System.Dynamic
Public Class DynamicCsv
Inherits DynamicObject
Private _fieldIndex As Dictionary(Of String, Integer)
Private _RowValues() As String
Friend Sub New(ByVal currentRow As String,
ByVal fieldIndex As Dictionary(Of String, Integer))
_RowValues = currentRow.Split(","c)
_fieldIndex = fieldIndex
End Sub
Public Overrides Function TryGetMember(ByVal binder As GetMemberBinder,
ByRef result As Object) As Boolean
If _fieldIndex.ContainsKey(binder.Name) Then
result = _RowValues(_fieldIndex(binder.Name))
Return True
End If
Return False
End Function
Public Overrides Function TrySetMember(ByVal binder As SetMemberBinder,
ByVal value As Object) As Boolean
If _fieldIndex.ContainsKey(binder.Name) Then
_RowValues(_fieldIndex(binder.Name)) = value.ToString
Return True
End If
Return False
End Function
End Class
With this in place, now we just need to add a class to handle iterating over the individual rows in our CSV file. As we pointed out in our book, using File.ReadAllLines can be a significant performance bottleneck for large files. Instead we will implement a custom Enumerator. In our customer enumerable, we initialize the process with the GetEnumerator method. This method opens the stream based on the supplied filename. It also sets up our dictionary of field names based on the values in the first row. Because we keep the stream open through the lifetime of this class, we implement IDisposable to clean up the stream.
As we iterate over the results calling MoveNext, we will read each subsequent row and create a DynamicCsv instance object. We return this row as an Object (Dynamic in C#) so that we will be able to consume it as a dynamic type in .Net 4.0. Here's the implementation:
Imports System.Collections
Public Class DynamicCsvEnumerator
Implements IEnumerator(Of Object)
Implements IEnumerable(Of Object)
Private _FileStream As IO.TextReader
Private _FieldNames As Dictionary(Of String, Integer)
Private _CurrentRow As DynamicCsv
Private _filename As String
Public Sub New(ByVal fileName As String)
_filename = fileName
End Sub
Public Function GetEnumerator() As IEnumerator(Of Object) _
Implements IEnumerable(Of Object).GetEnumerator
_FileStream = New IO.StreamReader(_filename)
Dim headerRow = _FileStream.ReadLine
Dim fields = headerRow.Split(","c)
_FieldNames = New Dictionary(Of String, Integer)
For i = 0 To fields.Length - 1
_FieldNames.Add(GetSafeFieldName(fields(i)), i)
Next
Return Me
End Function
Function GetSafeFieldName(ByVal input As String) As String
Return input.Replace(" ", "_")
End Function
Public Function GetEnumerator1() As IEnumerator Implements IEnumerable.GetEnumerator
Return GetEnumerator()
End Function
Public ReadOnly Property Current As Object Implements IEnumerator(Of Object).Current
Get
Return _CurrentRow
End Get
End Property
Public ReadOnly Property Current1 As Object Implements IEnumerator.Current
Get
Return Current
End Get
End Property
Public Function MoveNext() As Boolean Implements IEnumerator.MoveNext
Dim line = _FileStream.ReadLine
If line IsNot Nothing AndAlso line.Length > 0 Then
_CurrentRow = New DynamicCsv(line, _FieldNames)
Return True
Else
Return False
End If
End Function
Public Sub Reset() Implements IEnumerator.Reset
_FileStream.Close()
GetEnumerator()
End Sub
#Region "IDisposable Support"
Private disposedValue As Boolean ' To detect redundant calls
' IDisposable
Protected Overridable Sub Dispose(ByVal disposing As Boolean)
If Not Me.disposedValue Then
If disposing Then
_FileStream.Dispose()
End If
_CurrentRow = Nothing
End If
Me.disposedValue = True
End Sub
' This code added by Visual Basic to correctly implement the disposable pattern.
Public Sub Dispose() Implements IDisposable.Dispose
Dispose(True)
GC.SuppressFinalize(Me)
End Sub
#End Region
End Class
Now that we have our custom enumerable, we can consume it using standard dot notation by turning Option Strict Off in Visual Basic or referencing it as a Dynamic type in C#:
VB:
Public Sub OpenCsv()
Dim data = New DynamicCsvEnumerator("C:\temp\Customers.csv")
For Each item In data
TestContext.WriteLine(item.CompanyName & ": " & item.Contact_Name)
Next
End Sub
C#:
[TestMethod]
public void OpenCsvSharp()
{
var data = new DynamicCsvEnumerator(@"C:\temp\customers.csv");
foreach (dynamic item in data)
{
TestContext.WriteLine(item.CompanyName + ": " + item.Contact_Name);
}
}
In addition, since we are exposing this as an IEnumerable, we can use all of the same LINQ operators over our custom class:
VB:
Dim query = From c In data
Where c.City = "London"
Order By c.CompanyName
Select c.Contact_Name, c.CompanyName
For Each item In query
TestContext.WriteLine(item.CompanyName & ": " & item.Contact_Name)
Next
C#:
[TestMethod]
public void LinqCsvSharp()
{
var data = new DynamicCsvEnumerator(@"C:\temp\customers.csv");
var query = from dynamic c in data
where c.City == "London"
orderby c.CompanyName
select new { c.Contact_Name, c.CompanyName };
foreach (var item in query)
{
TestContext.WriteLine(item.CompanyName + ": " + item.Contact_Name);
}
}
Note: This sample makes a couple assumptions about the underlying data and implementation. First, we take an extra step to translate header strings that contain spaces to replace the space with an underscore. While including spaces is legal in the csv header, it isn't legal in VB to say: " MyObject.Some Property With Spaces". Thus we'll manage this by requiring the code to access this property as follows: "MyObject.Some_Property_With_Spaces".
Second, this implementation doesn't handle strings that contain commas. Typically fields in CSV files that contain commas are wrapped by quotes (subsequently quotes are likewise escaped by double quotes). This implementation does not account for either situation. I purposely did not incorporate those details in order to focus on the use of DynamicObject in this sample. I welcome enhancement suggestions to make this more robust.