Parsing a Document
vb.net 2003
I am working with large documents that I wish to parse in vb code. The document is organized as follows:
Paragraph Number Title
Text
3.1.7.2 Some Paragraph Title
a;lkjs;lkdfj;skljf;asfj;aslfj;asljf;asljdf;lasf
a;ldfj;asljf;asfj;laskjf;lasjf;lasjf;lasjf;asljf;askl
as;dfjas;ldfjas;lfjas;lfj;asljf;asljf;asljf;lasjf;aslj
3.1.7.3 Next Title
a;kldjf;alsjdf;asjf;asjf
ad;flkjas;dlkfjas;lkfj
;ajdf;lasjf;lksjdfkl;j
I wish to parse the document into a database with fields prepared to capture the paragaraph number, title, text, etc. What I don't understand is how, as I loop line by line thru my document, is how to identify a line that starts with a number. Also, when using the split function, how does one split a string based upon encountering a number.
Thanks,
Maria
You can use the .Chars property to get the first character. You can know if this is a number or not.
To seperate the para number with the title, you don't need to split by number. In your example, the first space in the line actually split the number from the title.
You could also split from the first space. You want have two string. the first part you remove the dot and check if it's a number, if it is then this is the title.
-
clear information (title, number, text)
loop
if it starts with a number
save current information
clear information
get number
get title
else
if line is not empty
save text
end if
end if
while end of line
save current information
Le Phare Rouge,
Based on your post, my understanding of your question is that you need to extract the information from the text. I recommend you use the regular expression to separate the paragraph. Then you can extract the title, title number and text body. Here is the code snippet to separate the paragraph.
Code Snippet
Imports
System.Text.RegularExpressions Public
Class Form1Dim rxImageCode As Regex = New Regex("\b\d\.\d\.\d\.\d\s+.*\D+[^(\d\.\d\.\d\.\d)]\b")
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim str As String = My.Computer.FileSystem.ReadAllText("c:\billy123.txt")
Dim rxMatch As MatchCollection = rxImageCode.Matches(str)
For i As Integer = 0 To rxMatch.Count - 1
If (rxMatch(i).Success) Then
Debug.Print("index is :" + rxMatch(i).Groups(0).Index.ToString)
Debug.Print(rxMatch(i).Groups(0).Value)
End If
Next
End Sub
End Class
Here is one good article ”The 30 Minute Regex Tutorial” By Jim Hollenhorst in the codeproject. You can see this article to learn the use of the regular expression.
Best regards.