Parsing a Document

vb.net 2003

I am working with large documents that I wish to parse in vb code. The document is organized as follows:

Paragraph Number Title

Text

3.1.7.2 Some Paragraph Title

a;lkjs;lkdfj;skljf;asfj;aslfj;asljf;asljdf;lasf

a;ldfj;asljf;asfj;laskjf;lasjf;lasjf;lasjf;asljf;askl

as;dfjas;ldfjas;lfjas;lfj;asljf;asljf;asljf;lasjf;aslj

3.1.7.3 Next Title

a;kldjf;alsjdf;asjf;asjf

ad;flkjas;dlkfjas;lkfj

;ajdf;lasjf;lksjdfkl;j

I wish to parse the document into a database with fields prepared to capture the paragaraph number, title, text, etc. What I don't understand is how, as I loop line by line thru my document, is how to identify a line that starts with a number. Also, when using the split function, how does one split a string based upon encountering a number.

Thanks,

Maria

[877 byte] By [LePhareRouge] at [2008-1-9]
# 1

You can use the .Chars property to get the first character. You can know if this is a number or not.

To seperate the para number with the title, you don't need to split by number. In your example, the first space in the line actually split the number from the title.

You could also split from the first space. You want have two string. the first part you remove the dot and check if it's a number, if it is then this is the title.

-

clear information (title, number, text)

loop

if it starts with a number

save current information

clear information

get number

get title

else

if line is not empty

save text

end if

end if

while end of line

save current information

ThE_lOtUs at 2007-10-3 > top of Msdn Tech,Visual Basic,Visual Basic General...
# 2

To know if a char is number use the IsNumber() Function from Char structure. Here is a example:

Code Snippet

' Dim MyTextLine As String ' It is a Line from your text

If Char.IsNumber(MyTextLine(0))

' Begin a new Record.

End If

Regards,

# 3

Le Phare Rouge,

Based on your post, my understanding of your question is that you need to extract the information from the text. I recommend you use the regular expression to separate the paragraph. Then you can extract the title, title number and text body. Here is the code snippet to separate the paragraph.

Code Snippet

Imports System.Text.RegularExpressions

Public Class Form1

Dim rxImageCode As Regex = New Regex("\b\d\.\d\.\d\.\d\s+.*\D+[^(\d\.\d\.\d\.\d)]\b")

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click

Dim str As String = My.Computer.FileSystem.ReadAllText("c:\billy123.txt")

Dim rxMatch As MatchCollection = rxImageCode.Matches(str)

For i As Integer = 0 To rxMatch.Count - 1

If (rxMatch(i).Success) Then

Debug.Print("index is :" + rxMatch(i).Groups(0).Index.ToString)

Debug.Print(rxMatch(i).Groups(0).Value)

End If

Next

End Sub

End Class

Here is one good article ”The 30 Minute Regex Tutorial” By Jim Hollenhorst in the codeproject. You can see this article to learn the use of the regular expression.

Best regards.

RiquelDong–MSFT at 2007-10-3 > top of Msdn Tech,Visual Basic,Visual Basic General...
# 4

Thanks for the great reference. It is valuable to have a reference which explains how to use this capability.

LePhareRouge at 2007-10-3 > top of Msdn Tech,Visual Basic,Visual Basic General...