StreamReader, Regex and memory
I am using StreamReader's method ReadToEnd() to read an entire file in to a String. I then use a Regex to parse some content from the file. I do this two times for each file to get all the data i want from it. The first regular expression looks like this:
String^ regexData1 ="<tag>(?<Occurrence>.*?)</tag>";Regex^ regOccurrence =gcnew Regex(regexData1, System::Text::RegularExpressions::RegexOptions::IgnoreCase | System::Text::RegularExpressions::RegexOptions::Singleline);
This is working just fine for now and whit smaller kinds of files. But I would like to learn how to parse larger files. I suppose that I can't use ReadToEnd() with really large files. I could use ReadLine() instead to parse one line at a time but if you take att look att the regular expression, you can se that I want all sorts of data that's between the tags.
So if we suppose that there where a new line character between the tags and I was using ReadLine(), then the regular expression would miss that match. Because the content where cut in to two parts on different lines. One could use the Read() function to deside how large parts to read in to a Char-array, but that would not garantee that the content would get cut up anyway...
How am I suppose to solv this problem?
(Moderator: Thread moved to the Regular Expression Forum)
Large file handling can be a complex matter, so there's quite a bit of litterature available out there. Have a go with your favorite serach engine to uncover some of it.
In this case, you've got two fairly simple alternatives:
- Store the entire thing in memory
- Read and process overlapping blocks
Approach #1 is what you've been doing thus far, and this can obviously become impractical for large files, with a size borderlinging than that of your application's memory space. The actual size you can fit there, in a consecutive block, depends on fragmentation of other data, dlls and so forth. Generally, the larger the file; the more swapping will occur and the slower your application (or entire system) will seem.
The second alternative would require you to know certain details about your search, such as the "worst case" span of a matched segment. An overlapping read would have to be large enough to cover all possible matches on the "border" areas of the previous block. If your regex is extensively using e.g. back and forward references, this may become a complicated matter, but for most uses it should be a feasible solution.
The second alternative would require you to know certain details about your search, such as the "worst case" span of a matched segment. An overlapping read would have to be large enough to cover all possible matches on the "border" areas of the previous block. If your regex is extensively using e.g. back and forward references, this may become a complicated matter, but for most uses it should be a feasible solution.
I do not really know how this is done practically. And when I think about it the task seem to be allmost impossible. First of all you can not use ReadLine() because you don't know how many new line characters your segment comprise. And therefore you don't know how many blocks to join and look for a match.
I don't know how to do this, but another way is to read in large blocks of Chars with a known amount of bytes and hope that your matching segments are smaller then those large blocks of Chars. Then your matching segment can only span over two blocks at a time. The problem is to know how large blocks you can read and how large your memory at the moment is. Anyway, this method could cut up the segment allmost anywhere.
...where is very hard to predict. Cut up in the middle of the content between the tags, somewhere in the start-tag or somewhere in the end-tag.
Can anyone give me a practical example of how to do this "overlapping read"?
I came up with one other sulotion and that is to read a known amount of Chars in to an array of Chars and convert that Char-array in to a String and do your search with regex on that.
If you got matches:
1 Parse them out.
2 Get index and length of last match.
3 Remove the characters before index + (length - 1) of last match.
4 Copy rest of String to beginning of the Char-array.
5 Fill the rest of the array by reading from the file.
6 Convert the Char-array in to a String.
7 Do the regex on that.
If you got no matches:
1 Move the content to the left by one character.
2 Fill the gap of one character to the right by reading one Char in to it.
3 Convert the Char-array in to a String.
4 Do the regex on that.
Now you got two choises: either you got matches or you got no matches.
Whatever you got, you have to do one of the procedure above again and the loop go on till you get to the end of the file.
The big question is to know how big the Char-array, that you read to (buffer), should be. To answer that question you have to know how many Chars you are able to read in to memory. I don't know how to do this at all...
Can anyone tell me a way to get to know the amount of Chars that could be read in to memory?
What do you think of the method I present above?
Is there a more easy way?