LINQ to CSV using DynamicObject and TextFieldParser by ThinqLinq

LINQ to CSV using DynamicObject and TextFieldParser

In the first post of this series, we parsed our CSV file by simply splitting each line on a comma. While this works for simple files, it becomes problematic when consuming CSV files where individual fields also contains commas. Consider the following sample input:

CustomerID,COMPANYNAME,Contact Name,CONTACT_TITLE
ALFKI,Alfreds Futterkiste,Maria Anders,"Sales Representative"
ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,"Owner, Operator"
ANTON,Antonio Moreno Taqueria,Antonio Moreno,"Owner"

Typically when a field in a CSV file includes a comma, the field is quote escaped to designate that the comma is part of the field and not a delimiter. In the previous versions of this parser, we didn’t handle these cases. As a result the following unit test would fail given this sample data:


    <TestMethod()>
    Public Sub TestCommaEscaping()
        Dim data = New DynamicCsvEnumerator("C:\temp\Customers.csv")
        Dim query = From c In data
                    Where c.ContactTitle.Contains(",")
                    Select c.ContactTitle

        Assert.AreEqual(1, query.Count)
        Assert.AreEqual("Owner, Operator", query.First)
    End Sub

We could add code to handle the various escaping scenarios here. However, as Jonathan pointed out in his comment to my first post there are already methods that can do CSV parsing in the .Net framework. One of the most flexible ones is the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace. If you code in C# instead of VB, you can simply add a reference to this namespace and access the power from your language of choice.

Retrofiting our existing implementation to use the TextFieldParser is fairly simple. We begin by changing the _FileStream object to being a TextFieldParser rather than a FileStream. We keep this as a class level field in order to stream through our data as we iterate over the rows.

In the GetEnumerator we then instantiate our TextFieldParser and set the delimiter information. Once that is configured, we get the array of header field names by calling the ReadFields method.


    Public Function GetEnumerator() As IEnumerator(Of Object) _
        Implements IEnumerable(Of Object).GetEnumerator

        _FileStream = New Microsoft.VisualBasic.FileIO.TextFieldParser(_filename)
        _FileStream.Delimiters = {","}
        _FileStream.HasFieldsEnclosedInQuotes = True
        _FileStream.TextFieldType = FileIO.FieldType.Delimited

        Dim fields = _FileStream.ReadFields
        _FieldNames = New Dictionary(Of String, Integer)
        For i = 0 To fields.Length - 1
            _FieldNames.Add(GetSafeFieldName(fields(i)), i)
        Next
        _CurrentRow = New DynamicCsv(_FileStream.ReadFields, _FieldNames)

        Return Me
    End Function

    Public Function MoveNext() As Boolean Implements IEnumerator.MoveNext
        Dim line = _FileStream.ReadFields
        If line IsNot Nothing AndAlso line.Length > 0 Then
            _CurrentRow = New DynamicCsv(line, _FieldNames)
            Return True
        Else
            Return False
        End If
    End Function

While we are at it, we also change our MoveNext method to call ReadFields to get the parsed string array of the parsed values in the next line. If this is the last line, the array is empty and we return false in the MoveNext to stop the enumeration. We had to make one other change here because in the old version, we passed the full unparsed line in the constructor of the DynamicCsv type and did the parsing there. Since our TextFieldParser will handle that for use, we’ll add an overloaded constructor to our DynamicCsv DynamicObject accepting the pre parsed string array:


Public Class DynamicCsv
    Inherits DynamicObject

    Private _fieldIndex As Dictionary(Of String, Integer)
    Private _RowValues() As String

    Friend Sub New(ByVal values As String(),
                   ByVal fieldIndex As Dictionary(Of String, Integer))
        _RowValues = values
        _fieldIndex = fieldIndex
    End Sub

With these changes, now we can run our starting unit test including the comma in the Contact Title of the second record and it now passes.

If you like this solution, feel free to download the completed Dynamic CSV Enumerator library and kick the tires a bit. There is no warrantee expressed or implied, but please let me know if you find it helpful and any changes you would recommend.

Posted on - Comment
Categories: LINQ - VB - VB Dev Center - Visual Studio - Dynamic -
comments powered by Disqus