Dynamic Related Posts

LINQ to CSV using DynamicObject and TextFieldParser

In the first post of this series, we parsed our CSV file by simply splitting each line on a comma. While this works for simple files, it becomes problematic when consuming CSV files where individual fields also contains commas. Consider the following sample input:

CustomerID,COMPANYNAME,Contact Name,CONTACT_TITLE
ALFKI,Alfreds Futterkiste,Maria Anders,"Sales Representative"
ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,"Owner, Operator"
ANTON,Antonio Moreno Taqueria,Antonio Moreno,"Owner"

Typically when a field in a CSV file includes a comma, the field is quote escaped to designate that the comma is part of the field and not a delimiter. In the previous versions of this parser, we didn’t handle these cases. As a result the following unit test would fail given this sample data:


    <TestMethod()>
    Public Sub TestCommaEscaping()
        Dim data = New DynamicCsvEnumerator("C:\temp\Customers.csv")
        Dim query = From c In data
                    Where c.ContactTitle.Contains(",")
                    Select c.ContactTitle

        Assert.AreEqual(1, query.Count)
        Assert.AreEqual("Owner, Operator", query.First)
    End Sub

We could add code to handle the various escaping scenarios here. However, as Jonathan pointed out in his comment to my first post there are already methods that can do CSV parsing in the .Net framework. One of the most flexible ones is the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace. If you code in C# instead of VB, you can simply add a reference to this namespace and access the power from your language of choice.

Retrofiting our existing implementation to use the TextFieldParser is fairly simple. We begin by changing the _FileStream object to being a TextFieldParser rather than a FileStream. We keep this as a class level field in order to stream through our data as we iterate over the rows.

In the GetEnumerator we then instantiate our TextFieldParser and set the delimiter information. Once that is configured, we get the array of header field names by calling the ReadFields method.


    Public Function GetEnumerator() As IEnumerator(Of Object) _
        Implements IEnumerable(Of Object).GetEnumerator

        _FileStream = New Microsoft.VisualBasic.FileIO.TextFieldParser(_filename)
        _FileStream.Delimiters = {","}
        _FileStream.HasFieldsEnclosedInQuotes = True
        _FileStream.TextFieldType = FileIO.FieldType.Delimited

        Dim fields = _FileStream.ReadFields
        _FieldNames = New Dictionary(Of String, Integer)
        For i = 0 To fields.Length - 1
            _FieldNames.Add(GetSafeFieldName(fields(i)), i)
        Next
        _CurrentRow = New DynamicCsv(_FileStream.ReadFields, _FieldNames)

        Return Me
    End Function

    Public Function MoveNext() As Boolean Implements IEnumerator.MoveNext
        Dim line = _FileStream.ReadFields
        If line IsNot Nothing AndAlso line.Length > 0 Then
            _CurrentRow = New DynamicCsv(line, _FieldNames)
            Return True
        Else
            Return False
        End If
    End Function

While we are at it, we also change our MoveNext method to call ReadFields to get the parsed string array of the parsed values in the next line. If this is the last line, the array is empty and we return false in the MoveNext to stop the enumeration. We had to make one other change here because in the old version, we passed the full unparsed line in the constructor of the DynamicCsv type and did the parsing there. Since our TextFieldParser will handle that for use, we’ll add an overloaded constructor to our DynamicCsv DynamicObject accepting the pre parsed string array:


Public Class DynamicCsv
    Inherits DynamicObject

    Private _fieldIndex As Dictionary(Of String, Integer)
    Private _RowValues() As String

    Friend Sub New(ByVal values As String(),
                   ByVal fieldIndex As Dictionary(Of String, Integer))
        _RowValues = values
        _fieldIndex = fieldIndex
    End Sub

With these changes, now we can run our starting unit test including the comma in the Contact Title of the second record and it now passes.

If you like this solution, feel free to download the completed Dynamic CSV Enumerator library and kick the tires a bit. There is no warrantee expressed or implied, but please let me know if you find it helpful and any changes you would recommend.

Posted on 12/1/2009 3:31:00 PM - Comments(0)
Categories: Dynamic LINQ VB VB Dev Center VS 2010

LINQ to CSV using DynamicObject Part 2

In the last post, I showed how to use DynamicObject to make consuming CSV files easier. In that example, we showed how we can access CSV columns using the standard dot (.) notation that we use on other objects. Using DynamicObject, we can refer to item.CompanyName and item.Contact_Name rather than item(0) and item(1).

While I’m happy about the new syntax, I’m not content with replacing spaces with underscores as that doesn’t agree with the coding guidelines of using Pascal casing for properties. Because we have control on how the accessors work, we can modify the convention. Let’s reconsider the CSV file that we’re working with. Here’s the beginning:

CustomerID,COMPANYNAME,Contact Name,CONTACT_TITLE,Address,City,Region,PostalCode,Country,Phone,Fax
ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,NULL,12209,Germany,030-0074321,030-0076545
ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constituci¢n 2222,Mexico D.F.,NULL,5021,Mexico,(5) 555-4729,(5) 555-3745
ANTON,Antonio Moreno Taqueria,Antonio Moreno,Owner,Mataderos  2312,Mexico D.F.,NULL,5023,Mexico,(5) 555-3932,NULL

Notice here that the header row contains values with a mix of mixed case, all upper, words with spaces, and underscores. To standardize this, we could parse the header and force an upper case at the beginning of each word. That would take a fair amount of parsing code. As a fan of case insensitive programming languages, I figured that if we just strip the spaces and underscores and work against the strings in a case insensitive manner, I’d be happy. In the end, we’ll be able to consume the above CSV with the following code:


Dim data = New DynamicCsvEnumerator("C:\temp\Customers.csv")
Dim query = From c In data
            Where c.City = "London"
            Order By c.CompanyName
            Select c.ContactName, c.CompanyName, c.ContactTitle

To make this change, we change how we parse the header row and the binder name when fetching properties. In our DynamicCsvEnumerator, we already isolated the parsing of the header with a GetSafeFieldName method. Previously we simply returned the input value replacing a space with an underscore. Extending this is trivial:


    Function GetSafeFieldName(ByVal input As String) As String
        'Return input.Replace(" ", "_")
        Return input.
            Replace(" ", "").
            Replace("_", "").
            ToUpperInvariant()
    End Function

That's it for setting up the header parsing changes. We don't need to worry about spaces in the incoming property accessor because it's not legal to use spaces in a method name. I'll also assume that the programmer won't use underscores in the method names by convention. Thus, the only change we need to make in our property accessor is to uppercase the incoming field name to handle the case insensitivity feature. Here's the revised TryGetMember implementation.


    Public Overrides Function TryGetMember(ByVal binder As GetMemberBinder,
                                           ByRef result As Object) As Boolean
        Dim fieldName = binder.Name.ToUpperInvariant()
        If _fieldIndex.ContainsKey(fieldName) Then
            result = _RowValues(_fieldIndex(fieldName))
            Return True
        End If
        Return False
    End Function

All we do is force the field name to upper case and then we can look it up in the dictionary of field indexes that we setup last time. Simple yet effective.

Posted on 12/1/2009 2:07:00 PM - Comments(0)
Categories: LINQ VB Dev Center VS 2010 Dynamic

LINQ to CSV using DynamicObject

When we wrote LINQ in Action we included a sample of how to simply query against a CSV file using the following LINQ query:


From line In File.ReadAllLines(“books.csv”) 
Where Not Line.StartsWith(“#”) 
Let parts = line.Split(“,”c) 
Select Isbn = parts(0), Title = parts(1), Publisher = parts(3)

While this code does make dealing with CSV easier, it would be nicer if we could refer to our columns as if they were properties where the property name came from the header row in the CSV file, perhaps using syntax like the following:


From line In MyCsvFile
Select line.Isbn, line.Title, line.Publisher

With strongly typed (compile time) structures, it is challenging to do this when dealing with variable data structures like CSV files. One of the big enhancements that is coming with .Net 4.0 is the inclusion of Dynamic language features, including the new DynamicObject data type. In the past, working with dynamic runtime structures, we were limited to using reflection tricks to access properties that didn't actually exist. The addition of dynamic language constructs offers better ways of dispatching the call request over dynamic types. Let's see what we need to do to expose a CSV row using the new dynamic features in Visual Studio 2010.

First, let's create an object that will represent each row that we are reading. This class will inherit from the new System.Dynamic.DynamicObject base class. This will set up the base functionality to handle the dynamic dispatching for us. All we need to do is add implementation to tell the object how to fetch values based on a supplied field name. We'll implement this by taking a string representing the current row. We'll split that based on the separator (a comma). We also supply a dictionary containing the field names and their index. Given these two pieces of information, we can override the TryGetMember and TrySetMember to Get and Set the property based on the field name:


Imports System.Dynamic

Public Class DynamicCsv
    Inherits DynamicObject

    Private _fieldIndex As Dictionary(Of String, Integer)
    Private _RowValues() As String

    Friend Sub New(ByVal currentRow As String,
                   ByVal fieldIndex As Dictionary(Of String, Integer))
        _RowValues = currentRow.Split(","c)
        _fieldIndex = fieldIndex
    End Sub

    Public Overrides Function TryGetMember(ByVal binder As GetMemberBinder,
                                           ByRef result As Object) As Boolean
        If _fieldIndex.ContainsKey(binder.Name) Then
            result = _RowValues(_fieldIndex(binder.Name))
            Return True
        End If
        Return False
    End Function

    Public Overrides Function TrySetMember(ByVal binder As SetMemberBinder,
                                           ByVal value As Object) As Boolean
        If _fieldIndex.ContainsKey(binder.Name) Then
            _RowValues(_fieldIndex(binder.Name)) = value.ToString
            Return True
        End If
        Return False
    End Function
End Class

With this in place, now we just need to add a class to handle iterating over the individual rows in our CSV file. As we pointed out in our book, using File.ReadAllLines can be a significant performance bottleneck for large files. Instead we will implement a custom Enumerator. In our customer enumerable, we initialize the process with the GetEnumerator method. This method opens the stream based on the supplied filename. It also sets up our dictionary of field names based on the values in the first row. Because we keep the stream open through the lifetime of this class, we implement IDisposable to clean up the stream.

As we iterate over the results calling MoveNext, we will read each subsequent row and create a DynamicCsv instance object. We return this row as an Object (Dynamic in C#) so that we will be able to consume it as a dynamic type in .Net 4.0. Here's the implementation:


Imports System.Collections

Public Class DynamicCsvEnumerator
    Implements IEnumerator(Of Object)
    Implements IEnumerable(Of Object)

    Private _FileStream As IO.TextReader
    Private _FieldNames As Dictionary(Of String, Integer)
    Private _CurrentRow As DynamicCsv
    Private _filename As String

    Public Sub New(ByVal fileName As String)
        _filename = fileName
    End Sub

    Public Function GetEnumerator() As IEnumerator(Of Object) _
        Implements IEnumerable(Of Object).GetEnumerator

        _FileStream = New IO.StreamReader(_filename)
        Dim headerRow = _FileStream.ReadLine
        Dim fields = headerRow.Split(","c)
        _FieldNames = New Dictionary(Of String, Integer)
        For i = 0 To fields.Length - 1
            _FieldNames.Add(GetSafeFieldName(fields(i)), i)
        Next
       
        Return Me
    End Function

    Function GetSafeFieldName(ByVal input As String) As String
        Return input.Replace(" ", "_")
    End Function

    Public Function GetEnumerator1() As IEnumerator Implements IEnumerable.GetEnumerator
        Return GetEnumerator()
    End Function

    Public ReadOnly Property Current As Object Implements IEnumerator(Of Object).Current
        Get
            Return _CurrentRow
        End Get
    End Property

    Public ReadOnly Property Current1 As Object Implements IEnumerator.Current
        Get
            Return Current
        End Get
    End Property

    Public Function MoveNext() As Boolean Implements IEnumerator.MoveNext
        Dim line = _FileStream.ReadLine
        If line IsNot Nothing AndAlso line.Length > 0 Then
            _CurrentRow = New DynamicCsv(line, _FieldNames)
            Return True
        Else
            Return False
        End If
    End Function

    Public Sub Reset() Implements IEnumerator.Reset
        _FileStream.Close()
        GetEnumerator()
    End Sub

#Region "IDisposable Support"
    Private disposedValue As Boolean ' To detect redundant calls

    ' IDisposable
    Protected Overridable Sub Dispose(ByVal disposing As Boolean)
        If Not Me.disposedValue Then
            If disposing Then
                _FileStream.Dispose()
            End If
            _CurrentRow = Nothing
        End If
        Me.disposedValue = True
    End Sub

    ' This code added by Visual Basic to correctly implement the disposable pattern.
    Public Sub Dispose() Implements IDisposable.Dispose
        Dispose(True)
        GC.SuppressFinalize(Me)
    End Sub
#End Region

End Class

Now that we have our custom enumerable, we can consume it using standard dot notation by turning Option Strict Off in Visual Basic or referencing it as a Dynamic type in C#:

VB:



Public Sub OpenCsv()
    Dim data = New DynamicCsvEnumerator("C:\temp\Customers.csv")
    For Each item In data
        TestContext.WriteLine(item.CompanyName & ": " & item.Contact_Name)
    Next

End Sub

C#:


[TestMethod]
public void OpenCsvSharp()
{
    var data = new DynamicCsvEnumerator(@"C:\temp\customers.csv");
    foreach (dynamic item in data)
    {
        TestContext.WriteLine(item.CompanyName + ": " + item.Contact_Name);
    }
}

In addition, since we are exposing this as an IEnumerable, we can use all of the same LINQ operators over our custom class:

VB:


Dim query = From c In data
            Where c.City = "London"
            Order By c.CompanyName
            Select c.Contact_Name, c.CompanyName

For Each item In query
    TestContext.WriteLine(item.CompanyName & ": " & item.Contact_Name)
Next

C#:


[TestMethod]
public void LinqCsvSharp()
{
    var data = new DynamicCsvEnumerator(@"C:\temp\customers.csv");
    var query = from dynamic c in data 
                where c.City == "London"
                orderby c.CompanyName
                select new { c.Contact_Name, c.CompanyName };

    foreach (var item in query)
    {
        TestContext.WriteLine(item.CompanyName + ": " + item.Contact_Name);
    }
}

Note: This sample makes a couple assumptions about the underlying data and implementation. First, we take an extra step to translate header strings that contain spaces to replace the space with an underscore. While including spaces is legal in the csv header, it isn't legal in VB to say: " MyObject.Some Property With Spaces". Thus we'll manage this by requiring the code to access this property as follows: "MyObject.Some_Property_With_Spaces".

Second, this implementation doesn't handle strings that contain commas. Typically fields in CSV files that contain commas are wrapped by quotes (subsequently quotes are likewise escaped by double quotes). This implementation does not account for either situation. I purposely did not incorporate those details in order to focus on the use of DynamicObject in this sample. I welcome enhancement suggestions to make this more robust.

Posted on 11/22/2009 1:40:00 PM - Comments(6)
Categories: LINQ VB Dev Center VB C# Dynamic