Monday, June 14, 2010

.NET Screen Scraping

1. Introduction
2. Purpose of Screen Scraping
3. Simple Scraping
4. Forms
5. Posted Forms
6. Passing Headers
7. Scraping & Passing Cookies
8. Parse the response object by using Regular expression
9. Parse the response object by using String Method’s

1. Introduction

Before sharing the knowledge on screen scraping, I would like to say thanks to Mr.Charan who helped a lot in screen scraping technique.

If you don’t have option to retrieve the data through the API, then only go with screen scraping .I don’t like the screen scraping technique, but still I need to work on this. If control Id’s got changed in the server side the client side programming will get effected.


Screen Scraping means reading the contents of a web page.

Or


Acquiring data displayed on screen by capturing the text manually with the copy command or via software. Web pages are constantly being screen scraped in order to save meaningful data for later use. In order to perform scraping automatically, software must be used that is written to recognize specific data

There have been articles on online about data scraping, today we will be looking at the different techniques. The WebRequest class is provided for accessing data via the web, it has two derived classes that will be looking at: Webclient and httpWebresponse.

Both classes are able to do anything you wish to do, it is more of a case of which to use for what job.

Here we will cover everything you would want to do with the two classes and see which comes out best.

2. Purpose of Screen Scraping

You may have never heard of screen-scraping, web-fetching, or web-data extraction, but if you’ve ever surfed the internet, you’ve quite likely been beneficiary information on the web acquired using methods described by these terms. They refer to the increasingly popular method of methodically retrieving information with specialized tools. Numerous programs utilize many computer languages for the purpose of mining data. Software often assists users in intercepting HTTP requests and responses by incorporating proxy servers. The software then displays the pages’ source code (HTML, JavaScript, etc.) for users to extract the desired information. In addition, such software can aid iteration through pages (sometimes thousands of them) all the while gleaning valuable data in various forms.
The goal of scraping websites is to access information, but the uses of that information can vary. Users may wish to store the information in their own databases or manipulate the data.

3. Simple Scraping

Here we are looking at just scraping a simple page. Where you want to do nothing but get back the page and do not have to pass up any data.

Below I have written utility for screen scrapping using C#.NET


Simple Scrapping:



public string GetResponse()
{

string url=”http://redbus.in”;
int timeout=80000;

HttpWebRequest webRequest = WebRequest.Create(url) as HttpWebRequest;
if (timeout != 0)
{
webRequest.Timeout=timeout;//milliseconds
}
StreamReader responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());

string strResponse = responseReader.ReadToEnd();
responseReader.Close();

return strResponse;

}

4. Forms

You have seen how simple it is to scrape any page using either webClient or httpWebResponse by using above utlities, today we will be looking into how you pass form data to the page you wish to scrape.

Here we are looking at passing the form data as a query.


public string GetResponse()
{

string url=”http://redbus.in/a.aspx?a=123”;
int timeout=80000;

HttpWebRequest webRequest = WebRequest.Create(url) as HttpWebRequest;
if (timeout != 0)
{
webRequest.Timeout=timeout;//milliseconds
}
StreamReader responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());

string strResponse = responseReader.ReadToEnd();
responseReader.Close();

return strResponse;

}


5. Posted Forms

So you should have by now scraped a page and scraped the result of a passed form.

Here we are looking at posting the form data.

Down load the Fiddler Web Debugger .By using Fiddler find the what parameter to be post and also what is referral URL….etc

public string GetPostResponse()
{
string postData=” login=a&password=b&commit=Login”;
string url=”http//abcd.com”;
string referrer=http://abcd.com;
int timeout=80000;

HttpWebRequest webRequest = WebRequest.Create(url) as HttpWebRequest;
webRequest.KeepAlive = true;
if (referer != null && referer != "")
{
webRequest.Referer = referer;
}
if (timeout != 0)
{
webRequest.Timeout = timeout;
}

webRequest.Method = @"POST";
if (postData != null && postData.Length > 0)
{
webRequest.ContentType = @"application/x-www-form-urlencoded";
webRequest.ContentLength = postData.Length;

StreamWriter requestWriter = new StreamWriter(webRequest.GetRequestStream());
requestWriter.Write(postData);
requestWriter.Close();
}



StreamReader responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());
// and read the response
responseData = responseReader.ReadToEnd();
responseReader.Close();


return responseData;
}


6. Passing Headers

Here we are looking in how to pass values in the header. Header values go unnoticed by the users but can carry important information such as the browser type.


public string setHeader()
{


HttpWebRequest webRequest = WebRequest.Create("http://abcd.com") as HttpWebRequest;
// set the standard header information
webRequest.Accept = @"image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
webRequest.ContentType = @"application/x-www-form-urlencoded";
webRequest.UserAgent = @"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727; InfoPath.1)";
webRequest.KeepAlive = true;

}

7. Scraping & Passing Cookies

Finally are looking in how to pass values in the cookies. One of the most important things that cookies can be used for than can cause trouble when scraping is session variables.


Once again you will see that Client is much simpler than the Request method. It sets the cookie value directly in the header, where as Request uses a cookie container which may make things a little clearer but also as more powerful implications.

The best implication of using the cookie container is that if you are going to be scraping multiple sites you can keep all your cookies in the same container, the Request then only passes up the cookies with the corresponding domain.

How to retrieve the Cookie

public CookieContainer GetCookies()
{
HttpWebRequest webRequest = WebRequest.Create(“http://abcd.com”) as HttpWebRequest;
CookieContainer cookies = new CookieContainer();
webRequest.CookieContainer = cookies;

webRequest.Timeout = 80000;

StreamReader responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());
strResponse = responseReader.ReadToEnd();
responseReader.Close();
return cookies;
}
How to Set the Cookie:
public string set Cookie (CookieContainer cookies)
{


HttpWebRequest webRequest = WebRequest.Create("http://abcd.com") as HttpWebRequest;
// set the cookie value
webRequest.CookieContainer = cookies;

}

8. Parse the response object by using Regular expression

Here I am not taking about regular expression. Before using this technique read more about regx in online.

If you want to scrape, you'll have to view the HTML source of the site. Let's take a quick look at the source ...


Here we can clearly see where my ‘A’ section begins and ends. This is important, because if you want to capture the content on a site, you'll have to find a beginning and an ending section - Look hard for a unique demarcation - somewhere there is a clear beginning to the content and a clear ending, or you'll end up with a lot of garbage that you don't want.

Once you've become familiar with the HTML source, you're ready to craft a regular expression.

Firing up RegEx

So, with that in mind, we'll fire up the regular expression object, REGEX, and parse out the Hip section quite painlessly.

If you're not a fan of Regular Expressions, you soon will be. If you've been a Java or C# programmer, you've been spoiled by how nice regular expressions are. If you were a Visual Basic programmer, you were stuck with some crappy OCX or a DLL Library or regular expressions in VBScript that didn't quite work right. Now that .NET is on the scene, have no fear - you'll be using RegEx plenty.

Let's take a peek at our regular expression that we use to get out the content we want from abcd.com

Regex regex = new Regex("((.|\n)*?)",
RegexOptions.IgnoreCase);

Look confusing? Naw. It's simple.

We want to get out whatever is between and . The ((.|\n)*?) part of the expression, as foreign and weird as it looks, actually isn't that bad.

The period character followed by the | character and then the \n works to restrict the new line character but allows a match on any other character. The asterisk and question mark tell the RegEx engine to match on zero or more occurrences.

It's beyond the scope of this article to delve too deep into regular expressions, but there are plenty of resources out there if you'd like to learn more.

Coding our Screen Scraper:

private string getA() {

StreamReader oSR = null;

//Here's the work horse of what we're doing, the WebRequest object
//fetches the URL
WebRequest objRequest = WebRequest.Create("http://abcd.com");

//The WebResponse object gets the Request's response (the HTML)
WebResponse objResponse = objRequest.GetResponse();

//Now dump the contents of our HTML in the Response object to a
//Stream reader
oSR = new StreamReader(objResponse.GetResponseStream());

//And dump the StreamReader into a string...
string strContent = oSR.ReadToEnd();

//Here we set up our Regular expression to snatch what's between the
//BEGIN and END
Regex regex = new Regex("((.|\n)*?)",
RegexOptions.IgnoreCase);

//Here we apply our regular expression to our string using the
//Match object.
Match oM = regex.Match(strContent);

//Bam! We return the value from our Match, and we're in business.
return oM.Value;
}


9. Parse the response object by using String Method’s

Output HTML:
>PUNE STATION -Maldhakka Chowk [14:05+0]

Here we want to parse above value by using string methods.


private string ParseTheValue() {

StreamReader oSR = null;

//Here's the work horse of what we're doing, the WebRequest object
//fetches the URL
WebRequest objRequest = WebRequest.Create("http://abcd.com");

//The WebResponse object gets the Request's response (the HTML)
WebResponse objResponse = objRequest.GetResponse();

//Now dump the contents of our HTML in the Response object to a
//Stream reader
oSR = new StreamReader(objResponse.GetResponseStream());

//And dump the StreamReader into a string...
string strContent = oSR.ReadToEnd();


int iPt1 = 0;
int iPt2 = 0;

iPt1 = strContent.IndexOf("[", 0);

if (iPt1 < ipt1 =" strContent.LastIndexOf(">", iPt1);
iPt1 += ">".Length;

iPt2 = strContent.IndexOf("[", iPt1);

strContent.Substring(iPt1, iPt2 - iPt1).Replace("\r\n\t\t\t\t\t\t\t", "");
}

3 comments:

  1. Excellent example, very comprehensive. There is an example at http://blog.evonet.com.au/post/2011/02/24/Scraping-prices-off-Walmart-A-simple-Regex-Example.aspx which gives a real world scenario of scraping the Walmart product page

    ReplyDelete
  2. hey nice information for me,thanks for sharing the nice article.I read the full blog and you provide the amazing information.I definitely bookmark this blog.

    white pages screen scraper

    ReplyDelete
  3. Very nice information,great work! i appreciate your time and effort. thanks. Web Parsing

    ReplyDelete