One service I\'m using doesn\'t have an API, but allows scraping, so I\'m curious what the best way in iOS/Objective-C would be to do the following:
A valid approach would be to perform the scraping inside a UIWebView
.
The strategy is pretty straightforward and it involves the usage of the method stringByEvaluatingJavaScriptFromString
of UIWebView
to control the webpage.
Assuming that you have already the user login info, you can input them using a javascript script.
For instance, assuming that webView
is the UIWebView
instance and username
is the username input field:
NSString * usernameScript = @"document.getElementById('username').value='Gabriele';";
[self.webView stringByEvaluatingJavaScriptFromString:usernameScript];
The above code will insert Gabriele
in the username field.
Along on the same path you can easily proceed and automatically interact with the webpage via javascript injections.
Once you are logged in, you can monitor for the current URL, until the redirection gets you to desired point. In order to do this, you have to implement the webViewDidFinishLoad:
method of UIWebViewDelegate
, which will be called each time the web view load a page
- (void)webViewDidFinishLoad:(UIWebView *)webView {
NSURL * currentURL = webView.request.mainDocumentURL;
if ([currentURL.absoluteString isEqual:desideredURLAddress]) {
[self performScraping];
}
}
At this point you can perform the actual scraping. Say that you want to get the content of a div
tag whose id is foo
. That's as simple as doing
- (void)performScraping {
NSString * fooContentScript = @"document.getElementById('foo').innerHTML;";
NSString * fooContent = [self.webView stringByEvaluatingJavaScriptFromString:usernameScript];
}
This will store the innerHTML
content of the div#foo
inside the fooContent
variable.
Bottom line, injecting javascript inside a UIWebView
you can control and scrape whatever web page.
For extra joy, you can perform all this off screen. To do so, allocate a new UIWindow
and add the UIWevView
as its subview. If you never make the UIWindow
visibile, everything described above will happen off screen.
Note that this approach is very effective, but it can be resource consuming, since you are loading the whole content of each web page. However, this can often be a necessary compromise, since other approaches based on XML parsers are likely to be inadequates due to the fact that HTML pages are often malformed, and most XML parsers are simply to strict to parse them.