Double Quotes in CSV File
I have a lot of data coming in from CSV files. I have many CSV files (~20), with varying amounts of data- but some are quite large (largest file is ~230mb)
I'm trying to import it into a SQL database via SSIS, but the data is a little bit frustrating.
I have data that looks like this:
"Text from vendor ""Vendor Name, Inc."" blah blah", "Next string", "", 1234
Many things to notice here - as well you can imagine some of the difficulties here.
After parsing, this data should have 4 columns-
Column1 Column2 Column3 Column4
Text From Vendor "Vendor Name, Inc." blah blah Next string <blank> 1234
The biggest problems stem from the double quotes mixed in with the comma because it is a comma delimited file with quotes as the text qualifier. The other problem is the double quotes with blank text ... which prevents me from doing a replace on the double quotes ... I think ...
What would be your suggestions to help me parse this out? (don't forget, some of these are large files with about 260,000 records- )
Thanks in advance,
Rob
[1145 byte] By [
recruz] at [2008-1-7]
In addition, if I were to do my own parsing, what would the algorithm be to do that?
It would be something like, Find 1st quote - after first quote, if exists double quote, replace with aribtrary character (#) - then keep parsing through until next single quote is found?
Then I can go around and replace the arbitrary character with the double quotes later on-
Does that sound about right?
When you set up your Flat File Source, on the General Tab you will see a text box called "Text qualifier"
Place a " in the box. It will parse correctly. 
It seems like it is bad form on two levels and I would recommend that you not accept that format from a vendor unless tweaks are made.
- The quotes are used to seperate data in the CSV and allow the meta character, comma, to be allowed in data such as "$1,110.00". By including quotes within the quoted data that breaks form. Request that single quotes be used within double quotes if needed or require an escape of the quote within the data area such as "Ficus said, \"to be or not\". Was his reply", .
- The number in your example should be quoted to, so the form is not broken. Its like any computer grammar, one needs things to be consistent.
Just my opinion...I could be wrong. <g>
trust me, I have that box checked but it's not parsing correctly
Sorry, I missed that extra comma in the double-quoted text. 
Is requesting a change to the file format out of the question? That's a really broken file. If they could give it to you Tab delimited you would be set. I don't like using printable characters for delimiters for this very reason.
You could run a RegEx on the files, but with those large files I don't know if any of the software products available can handle them.
Check out PowerGrep. http://www.powergrep.com/ I would email them first to see if it will choke on those file sizes. You can point it at a directory, and it will loop through all of the files.
If you're good with .NET you could write your own utility using the FileStream class to stream the file in, and do the RegEx replace. File size wouldn't matter in this case.
We have ActiveX script like this to solve this problem:
Function Main()
Set fso = CreateObject("Scripting.FileSystemObject")
Set source = fso.OpenTextFile("_data.txt")
Set destination = fso.CreateTextFile("data_good.txt")
While Not source.AtEndOfStream
destination.writeline fixQuotes(source.Readline)
Wend
Set source = nothing
Set destination = nothing
Set fso = nothing
Main = DTSTaskExecResult_Success
End Function
Function fixQuotes(str)
qiStr = chr(34) & "," & chr(34)
str = Replace(str,qiStr,"|&|")
str = Replace(str,chr(34),"")
fixQuotes = chr(34) & Replace(str,"|&|",qiStr) & chr(34)
End Function
I've just encoutered this issue same with yours,and have solved like that below
Using Script component in DF
and try TextFieldParser class which is new in .net framework 2.0
Good Luck
Awesome - that sounds like what I am looking for - I will need to figure out how to make it work, but I found a reference so I'll just need to implement it.
Thanks for the suggestion! I'll post my solution here if I can work it out
VB 2005 Script version:
Sub ProcessFile(ByVal filename As String)
Dim sReader As New System.IO.StreamReader(DIR_Downloads & "\" & filename)
Dim sWriter As New System.IO.StreamWriter(DIR_RawFiles & "\" & filename)
Do
sWriter.WriteLine(fixQuotes(sReader.ReadLine()))
Loop Until sReader.Peek = -1
sWriter.Close()
sReader.Close()
End Sub
Function fixQuotes(ByVal str As String) As String
Dim qiStr As String = Chr(34) & "," & Chr(34) ' ","
str = str.Substring(1, str.Length - 2)
str = Replace(str, qiStr, "|&|")
str = Replace(str, Chr(34), """)
fixQuotes = Chr(34) & Replace(str, "|&|", qiStr) & Chr(34)
End Function
hi all,
I am also facing the same error with flat file, which consists of ""Investment Accounts""", where in the text qualifier is as ". But the data flow task is throwing an error as "the column delimiter could not found ". Can you please help to remove or replace the double quotes i.e. "".
Please sugesst me to go step by step.
With Regards,
anand
The script described above isn't working for you?
I ran into this same issue and when I did a search ended up in this forum. The script advise from ViewMaster to parse out the double quotes in the file using ScriptTask works but it assumes that all the fields are qualified with a double quote. In my case I have fields that are not enclosed in double quotes and some fields that are enclosed in double quotes. Then I spoted the comment from Andew Knight above to use a TextFieldParser class and I tested that out and lo and behold the TextFieldParser class parses the file pefectly fine. Not sure why SSIS flat file connection manager cannot parse it correctly. So here is my solution to this issue. I created a ScriptTask and used the TaskFieldParser class to read the csv file and convert it to tab delmited file and then modified my subsequent tasks to read the tab file. Just for these reasons I prefer to work with Tab delimited file but in this case the input file is beyond my control, so I converted it to tab and went my merry way. Sucks to do extra processing but what else a poor programmer to do.
To recap the issue
if you have a record like this
"AU ","MISSOULA ""LAMBROS"", MT","",100328 you would expect it to be parsed as col1 col2 col3 col4 AU MISSOULA "LAMBROS", MT Null 100328 But SSIS parses it as col1 col2 col3 col4 col5 AU MISSOULA ""LAMBROS" MT" Null 100328 Here is the code for the ScriptTask if anyone is interested. You will need a reference to Microsoft.VisualBasic add two File Connection Managers one for the csv file and for the tab file.
Code Snippet
Imports
System.IO Imports
System.Text Imports
Microsoft.VisualBasic.FileIO Public Sub Main()
' 'This routine uses the VB TextFieldParser class to properly parse the csv file and convert to a 'tab delimited file and subsequent tasks in the package can read from the tab file. Apparently the 'TaskFieldParser class does not have this limitation and it works as we expected. Besides it also 'provides an added feature where by you can trim extra spaces from text fields.
Dim csvFileFullPath As String Dim tabFileFullPath As String csvFileFullPath = Dts.Connections(
"CSV File Connection").ConnectionString tabFileFullPath = Dts.Connections(
"TAB File Connection").ConnectionString
Using tabStreamWriter As New StreamWriter(tabFileFullPath)
Using csvFileReader As New TextFieldParser(csvFileFullPath)
csvFileReader.TextFieldType = FieldType.Delimited
csvFileReader.Delimiters = New String() {","}
csvFileReader.HasFieldsEnclosedInQuotes = True
csvFileReader.TrimWhiteSpace =
True
Dim currentRow As String()
While Not csvFileReader.EndOfData
Try
Dim i As Int32 = 1 Dim outputRow As New Text.StringBuilder() currentRow = csvFileReader.ReadFields()
For Each currentField As String In currentRow
'currentField = currentField.Replace(Chr(34), Chr(39)) 'replace double quote with single quote if needed outputRow.Append(currentField)
If i < currentRow.Length Then outputRow.Append(Chr(9))
'add a tab for each field except last one
End If i = i + 1
Next
tabStreamWriter.WriteLine(outputRow.ToString())
Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException Dts.TaskResult = Dts.Results.Failure
End Try End While End Using End Using Dts.TaskResult = Dts.Results.Success
End Sub