LINQ to CSV using DynamicObject and TextFieldParser
In the first post of this series, we parsed our CSV file by simply splitting each line on a comma. While this works for simple files, it becomes problematic when consuming CSV files where individual fields also contains commas. Consider the following sample input:
CustomerID,COMPANYNAME,Contact Name,CONTACT_TITLE ALFKI,Alfreds Futterkiste,Maria Anders,"Sales Representative" ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,"Owner, Operator" ANTON,Antonio Moreno Taqueria,Antonio Moreno,"Owner"
Typically when a field in a CSV file includes a comma, the field is quote escaped to designate that the comma is part of the field and not a delimiter. In the previous versions of this parser, we didn’t handle these cases. As a result the following unit test would fail given this sample data:
<TestMethod()>
Public Sub TestCommaEscaping()
Dim data = New DynamicCsvEnumerator("C:\temp\Customers.csv")
Dim query = From c In data
Where c.ContactTitle.Contains(",")
Select c.ContactTitle
Assert.AreEqual(1, query.Count)
Assert.AreEqual("Owner, Operator", query.First)
End Sub
We could add code to handle the various escaping scenarios here. However, as Jonathan pointed out in his comment to my first post there are already methods that can do CSV parsing in the .Net framework. One of the most flexible ones is the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace. If you code in C# instead of VB, you can simply add a reference to this namespace and access the power from your language of choice.
Retrofiting our existing implementation to use the TextFieldParser is fairly simple. We begin by changing the _FileStream object to being a TextFieldParser rather than a FileStream. We keep this as a class level field in order to stream through our data as we iterate over the rows.
In the GetEnumerator we then instantiate our TextFieldParser and set the delimiter information. Once that is configured, we get the array of header field names by calling the ReadFields method.
Public Function GetEnumerator() As IEnumerator(Of Object) _
Implements IEnumerable(Of Object).GetEnumerator
_FileStream = New Microsoft.VisualBasic.FileIO.TextFieldParser(_filename)
_FileStream.Delimiters = {","}
_FileStream.HasFieldsEnclosedInQuotes = True
_FileStream.TextFieldType = FileIO.FieldType.Delimited
Dim fields = _FileStream.ReadFields
_FieldNames = New Dictionary(Of String, Integer)
For i = 0 To fields.Length - 1
_FieldNames.Add(GetSafeFieldName(fields(i)), i)
Next
_CurrentRow = New DynamicCsv(_FileStream.ReadFields, _FieldNames)
Return Me
End Function
Public Function MoveNext() As Boolean Implements IEnumerator.MoveNext
Dim line = _FileStream.ReadFields
If line IsNot Nothing AndAlso line.Length > 0 Then
_CurrentRow = New DynamicCsv(line, _FieldNames)
Return True
Else
Return False
End If
End Function
While we are at it, we also change our MoveNext method to call ReadFields to get the parsed string array of the parsed values in the next line. If this is the last line, the array is empty and we return false in the MoveNext to stop the enumeration. We had to make one other change here because in the old version, we passed the full unparsed line in the constructor of the DynamicCsv type and did the parsing there. Since our TextFieldParser will handle that for use, we’ll add an overloaded constructor to our DynamicCsv DynamicObject accepting the pre parsed string array:
Public Class DynamicCsv
Inherits DynamicObject
Private _fieldIndex As Dictionary(Of String, Integer)
Private _RowValues() As String
Friend Sub New(ByVal values As String(),
ByVal fieldIndex As Dictionary(Of String, Integer))
_RowValues = values
_fieldIndex = fieldIndex
End Sub
With these changes, now we can run our starting unit test including the comma in the Contact Title of the second record and it now passes.
If you like this solution, feel free to download the completed Dynamic CSV Enumerator library and kick the tires a bit. There is no warrantee expressed or implied, but please let me know if you find it helpful and any changes you would recommend.