MSHTML based scraper
Hi,
I'm writing a web scraper toolkit in C++ which I'll be using to write several web scrapers. I decided to save myself some time by using MSHTML to parse pages, but I've run into a number of problems with it. I'm not sure if this is exactly the right forum for this issue, but it's the closest I could find.
Getting a single page in and parsing it is no problem. I can walk the DOM nicely and pick off whatever info I want. Pretty easy. (At least, after writing some wrappers for the otherwise laborious COM interfaces.) The problem is getting to the next page.
If I do anything that would normally cause navigation, such as inducing a click on a link or submitting a form, it launches IE to browse the page interactively. According to MSDN, this is what MSHTML will do if the application doesn't implement IHlinkFrame. So, I implemented IHlinkFrame.
To my dismay, the same thing happened! Upon further investigation, I found that MSHTML does not even query my application object for the IHlinkFrame interface. After much investigation, I could not find any explanation of this or what I was supposed to do to be able to handle navigation programmatically.
I finally gave up and decided to not induce MSHTML to navigate pages. For simple links, this is just a matter of loading the next page directly using its URL. For forms, this means actually walking the DOM to build up the submission data for the form. I begrudgingly wrote code to do this, even though I shouldn't have had to, since it's obviously something IE does routinely.
I was then assaulted by even more MSHTML drama. Once I have the POST data, for the life of me, I can't figure out how to get MSHTML to use it.
Per MSDN documentation, ordinarily, you would provide POST data in the BINDINFO structure passed to IBindStatusCallback::GetBindInfo. However, apparently IBindStatusCallback does not work with MSHTML, and you're supposed to use IPropertyNotifySink instead. However, there is no property of MSHTML that I can find that would provide POST data.
After much frustration, the only way I can think of dealing with this is to not only build the POST data myself, but also obtain the page myself directly using WININET.DLL, then save the resulting page to a local disk file, which can then be given to MSHTML.
I find it ridiculous that I should have to employ such a roundabout approach. The whole point behind using MSHTML was to avoid duplication of effort.
Any info will be greatly appreciated.
Regards,
Kevin
Hi Dave,
Thank you for the reply.
Actually, I had tried hosting the web browser control, being that MSDN mentioned that it handles navigation. Unfortunately, I didn't even get as far with it as I did with MSHTML. I could at least get MSHTML to parse a single page, while I couldn't get the web browser control to load at all.
I suspect that it has something to do with the web browser control expecting that it is being hosted from a full-fledged graphical application. Meanwhile, my app is not intended to be interactive, so it doesn't even create a single window. (It's actually a console process.) It seemed as though I would have to implement all the GUI functionality of an ActiveX container in order to use the control, and this is rather excessive given the nature of the application.
Let me know if either the web browser control should be able to work without a GUI, or if there's anything I'm missing in terms of how to programmatically respond to navigation occurring in MSHTML.
Many thanks,
Kevin
There is a really old MSDN sample called WalkAll that shows you how to embed a non-UI MSHTML parser combined with a navigational event sync through the HTMLWindowEvents2 connection point. It's also pretty easy to add a sync to HTMLDocumentsEvents2 or any of the other event interfaces available. It was written a long time ago, but still lists as one the best examples of how to accomplish what your trying to do.
WalkAll Source:
http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/samples/internet/browser/walkall/default.asp
Dispatch Interfaces Reference:
http://msdn.microsoft.com/workshop/browser/mshtml/reference/events/events.asp
WalkAll is a command shell app. It's based on 'old school' COM, where you implement all the interfaces yourself in C++, instead of using something like ATL, so it's self contained and has no dependencies. This might be a good place for you to start. I used it myself when I first started working with MSHTML years ago, it's quite good.
Regards,
Jim
Hi Jim,
Thank you for the reply.
Actually, I did already check out the WalkAll sample, which is why I got as far as I did pretty easily. Unfortunately, the sample doesn't actually follow any links, so I wasn't able to glean the mechanisms involved.
I could set up an event sink to get scripting events like onbeforeunload, onunload, or onload, but there doesn't seem to be any obvious way to get from having a scripting event to having a moniker to pass to IPersistMoniker::Load on MSHTML in order to get the next page to actually load under programmatic control.
The only thing I can think of, is that maybe I could wait for an unload event and then call IPersistMoniker::GetCurMoniker, but I have a funny feeling that I'm just going to get a moniker to the same resource that was initially loaded. (This because whenever I do anything that would load another resource, it loads in a new process (IEXPLORE.EXE), so I assume the MSHTML I created is technically still on the first page.)
MSDN reference is very terse on this subject, and what it does mention (IHlinkFrame) doesn't even get queried for. Beyond this, I have no clue, apart from very roundabout workarounds.
Regards,
Kevin