Double Quotes in CSV File

I have a lot of data coming in from CSV files. I have many CSV files (~20), with varying amounts of data- but some are quite large (largest file is ~230mb)

I'm trying to import it into a SQL database via SSIS, but the data is a little bit frustrating.

I have data that looks like this:

"Text from vendor ""Vendor Name, Inc."" blah blah", "Next string", "", 1234

Many things to notice here - as well you can imagine some of the difficulties here.

After parsing, this data should have 4 columns-

Column1 Column2 Column3 Column4

Text From Vendor "Vendor Name, Inc." blah blah Next string <blank> 1234

The biggest problems stem from the double quotes mixed in with the comma because it is a comma delimited file with quotes as the text qualifier. The other problem is the double quotes with blank text ... which prevents me from doing a replace on the double quotes ... I think ...

What would be your suggestions to help me parse this out? (don't forget, some of these are large files with about 260,000 records- )

Thanks in advance,

Rob

[1145 byte] By [recruz] at [2008-1-7]
# 1

In addition, if I were to do my own parsing, what would the algorithm be to do that?

It would be something like, Find 1st quote - after first quote, if exists double quote, replace with aribtrary character (#) - then keep parsing through until next single quote is found?

Then I can go around and replace the arbitrary character with the double quotes later on-

Does that sound about right?

recruz at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 2

When you set up your Flat File Source, on the General Tab you will see a text box called "Text qualifier"

Place a " in the box. It will parse correctly.

SeanSchade at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 3

It seems like it is bad form on two levels and I would recommend that you not accept that format from a vendor unless tweaks are made.

  1. The quotes are used to seperate data in the CSV and allow the meta character, comma, to be allowed in data such as "$1,110.00". By including quotes within the quoted data that breaks form. Request that single quotes be used within double quotes if needed or require an escape of the quote within the data area such as "Ficus said, \"to be or not\". Was his reply", .
  2. The number in your example should be quoted to, so the form is not broken. Its like any computer grammar, one needs things to be consistent.

Just my opinion...I could be wrong. <g>

OmegaMan at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 4
trust me, I have that box checked but it's not parsing correctly
recruz at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 5

Sorry, I missed that extra comma in the double-quoted text.

Is requesting a change to the file format out of the question? That's a really broken file. If they could give it to you Tab delimited you would be set. I don't like using printable characters for delimiters for this very reason.

You could run a RegEx on the files, but with those large files I don't know if any of the software products available can handle them.

Check out PowerGrep. http://www.powergrep.com/ I would email them first to see if it will choke on those file sizes. You can point it at a directory, and it will loop through all of the files.

If you're good with .NET you could write your own utility using the FileStream class to stream the file in, and do the RegEx replace. File size wouldn't matter in this case.

SeanSchade at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 6
We have ActiveX script like this to solve this problem:

Function Main()

Set fso = CreateObject("Scripting.FileSystemObject")

Set source = fso.OpenTextFile("_data.txt")
Set destination = fso.CreateTextFile("data_good.txt")
While Not source.AtEndOfStream
destination.writeline fixQuotes(source.Readline)
Wend
Set source = nothing
Set destination = nothing
Set fso = nothing

Main = DTSTaskExecResult_Success

End Function

Function fixQuotes(str)
qiStr = chr(34) & "," & chr(34)
str = Replace(str,qiStr,"|&|")
str = Replace(str,chr(34),"")
fixQuotes = chr(34) & Replace(str,"|&|",qiStr) & chr(34)
End Function

TheViewMaster at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 7

I've just encoutered this issue same with yours,and have solved like that below

Using Script component in DF

and try TextFieldParser class which is new in .net framework 2.0

Good Luck

andrewknight at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 8

Awesome - that sounds like what I am looking for - I will need to figure out how to make it work, but I found a reference so I'll just need to implement it.

Thanks for the suggestion! I'll post my solution here if I can work it out

recruz at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 9
VB 2005 Script version:

Sub ProcessFile(ByVal filename As String)
Dim sReader As New System.IO.StreamReader(DIR_Downloads & "\" & filename)
Dim sWriter As New System.IO.StreamWriter(DIR_RawFiles & "\" & filename)
Do
sWriter.WriteLine(fixQuotes(sReader.ReadLine()))
Loop Until sReader.Peek = -1
sWriter.Close()
sReader.Close()
End Sub

Function fixQuotes(ByVal str As String) As String
Dim qiStr As String = Chr(34) & "," & Chr(34) ' ","
str = str.Substring(1, str.Length - 2)
str = Replace(str, qiStr, "|&|")
str = Replace(str, Chr(34), "&#34;")
fixQuotes = Chr(34) & Replace(str, "|&|", qiStr) & Chr(34)
End Function

TheViewMaster at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 10

hi all,

I am also facing the same error with flat file, which consists of ""Investment Accounts""", where in the text qualifier is as ". But the data flow task is throwing an error as "the column delimiter could not found ". Can you please help to remove or replace the double quotes i.e. "".

Please sugesst me to go step by step.

With Regards,

anand

kumaranand at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 11
The script described above isn't working for you?

jwelch at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 12

I ran into this same issue and when I did a search ended up in this forum. The script advise from ViewMaster to parse out the double quotes in the file using ScriptTask works but it assumes that all the fields are qualified with a double quote. In my case I have fields that are not enclosed in double quotes and some fields that are enclosed in double quotes. Then I spoted the comment from Andew Knight above to use a TextFieldParser class and I tested that out and lo and behold the TextFieldParser class parses the file pefectly fine. Not sure why SSIS flat file connection manager cannot parse it correctly. So here is my solution to this issue. I created a ScriptTask and used the TaskFieldParser class to read the csv file and convert it to tab delmited file and then modified my subsequent tasks to read the tab file. Just for these reasons I prefer to work with Tab delimited file but in this case the input file is beyond my control, so I converted it to tab and went my merry way. Sucks to do extra processing but what else a poor programmer to do.

To recap the issue

if you have a record like this

"AU ","MISSOULA ""LAMBROS"", MT","",100328

you would expect it to be parsed as

col1 col2 col3 col4

AU MISSOULA "LAMBROS", MT Null 100328

But SSIS parses it as

col1 col2 col3 col4 col5

AU MISSOULA ""LAMBROS" MT" Null 100328

Here is the code for the ScriptTask if anyone is interested. You will need a reference to Microsoft.VisualBasic add two File Connection Managers one for the csv file and for the tab file.

Code Snippet

Imports System.IO

Imports System.Text

Imports Microsoft.VisualBasic.FileIO

Public Sub Main()

'

'This routine uses the VB TextFieldParser class to properly parse the csv file and convert to a

'tab delimited file and subsequent tasks in the package can read from the tab file. Apparently the

'TaskFieldParser class does not have this limitation and it works as we expected. Besides it also

'provides an added feature where by you can trim extra spaces from text fields.

Dim csvFileFullPath As String

Dim tabFileFullPath As String

csvFileFullPath = Dts.Connections("CSV File Connection").ConnectionString

tabFileFullPath = Dts.Connections("TAB File Connection").ConnectionString

Using tabStreamWriter As New StreamWriter(tabFileFullPath)

Using csvFileReader As New TextFieldParser(csvFileFullPath)

csvFileReader.TextFieldType = FieldType.Delimited

csvFileReader.Delimiters = New String() {","}

csvFileReader.HasFieldsEnclosedInQuotes = True

csvFileReader.TrimWhiteSpace = True

Dim currentRow As String()

While Not csvFileReader.EndOfData

Try

Dim i As Int32 = 1

Dim outputRow As New Text.StringBuilder()

currentRow = csvFileReader.ReadFields()

For Each currentField As String In currentRow

'currentField = currentField.Replace(Chr(34), Chr(39)) 'replace double quote with single quote if needed

outputRow.Append(currentField)

If i < currentRow.Length Then

outputRow.Append(Chr(9)) 'add a tab for each field except last one

End If

i = i + 1

Next

tabStreamWriter.WriteLine(outputRow.ToString())

Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException

Dts.TaskResult = Dts.Results.Failure

End Try

End While

End Using

End Using

Dts.TaskResult = Dts.Results.Success

End Sub

brittoa at 2007-10-2 > top of Msdn Tech,SQL Server,SQL Server Integration Services...

SQL Server

Site Classified