19
Jun
Screen Scraping Tutorial using C# .NET
There is some data on some web pages.
My client has asked me to get the data into a database. As quickly as possible (I bill hourly)
Who knows what he's going to use it for.. maybe automated content generation, or for corporate research, or to massage the data for his clients in ways that the data provider can't accomplish. Whatever!
The pages aren't under my control and the data isn't available to me directly but I need it to be.
Screen scrape, web scraping, page scraping, HTML scraping… this is how to do it in .NET.
..It's easy. Really a two step problem-solving process.
- Analysis: How is the data organized, what pieces do you need? Will you use DOM or Regular Expressions to get at it?
- Action: Coding up the DOM walker or Regex.
So for this task I have been given a few excel sheets with URLs in them somewhere. I will now go extract those URLs.
Getting my list of URLs
You can get a list of target URLs from anywhere. Start crawling the web, find a database, whatever.
Ok, I just looked, and actually for this they aren't excel files they are MHTs - not a problem - I am looking to get these URLs out quickly so I'm not going to do anything fancy or anything, this is a one-off utility.
This will give me an opportunity to show you some more scrapey looking code. since this is a pre-scrape, I am still going to follow my two step process here.
- The URLs are on a page in MHT files (MIME-esque) so I will export to HTML. And all I need are the URLs, I will use Regex since that's easiest I think (i use rad's tool, it's really quite nice)
- To make the regex I could google [url regular expression] but I don't even need a generic URL pattern. I'll just whip up a custom one for this task using rad designer..
- Made my pattern (the only variable in the URL is a SKU query string) and fired up grep for windows:
The first command counted the matches for each file.
C:\Temp\urls>grep –file=pattern.txt *.htm -E -h -o > urls.txt
The second command makes a file with each line being a URL.
it found all the matches to the pattern I supplied in pattern.txt ( I didn't want to include it in the command line) in the htm files in the current folder. -E is extended regexp and -h hides the file name from the output. then I sent the output into another text file, urls.txt, which I can use elsewhere. -o only shows the part of the line that matches the pattern.
Ok, now I have my URLs in a text file:
Deciding how to scrape
Once you have your URLs, first thing is to check out the source of the pages in question. And I'll be looking at the DOM. (for me the URLs are html encoded, so I'll need to change & to & programmatically when I get to that part)
For a few minutes I look at the source and the DOM using the IE web developer toolbar. I really love that toolbar. As you navigate through the DOM it flashes a box around the element you've selected. of course it shows all the attributes and other styles.
The source is marked-up OKAY. It's a table layout. Not much CSS. Sort of a bummer, it's not very descriptive, I'm going to have to look at the text inside each <p> to decide exactly what it is.
Instead of
<h1 id="title">Oracle Financial Analyzer 11i for End-Users</h1>
we have
<p><b> Oracle Financial Analyzer 11i for End-Users</b></p>
Instead of
<div id="description"> <h2>Description</h2> <p>This course introduces learners.. etc</p> </div>
we have
<p><b>Description</b></p> <p>This course introduces learners to the interface and functionality of Oracle Financial Analyzer etc.</p>
If I could rely on an ID or a Class attribute I could process the DOM much easier. Because each page is laid out exactly the same (I guessed as much before I even looked at the source because the only difference between the URLs was on query string value), regular expressions will be easier for me to develop. Especially since I'll have easy access to that text.
Hanselman likes to say, "You have a problem so you decide to solve it with regular expressions. Now you have two problems." But I'm into them. Especially for tasks like this.
If I were to walk the DOM, I would use the Html Agility Pack, and I would gain XPath querying but also a bit of work. my goal is to do this task quickly, but not necessarily in the most robust way. I think my pages are similar enough that too-many errors or failed matches are unlikely. plus i don't want to switch for the text inside the paragraph elements.
I'm going to start building up the .NET scraping standard code. Starting a new console application project in Visual Studio.
.NET Screen scraping code C#
note the UserAgent - we have to cast to an HttpWebRequest to set this header and others, like Referrer.. this code will loop through all the URLs in my text file and get the response from the server.
string urlpath = ConfigurationManager.AppSettings["urlFileName"]; foreach (string urlEncoded in File.ReadAllLines(urlpath)) { string url = System.Web.HttpUtility.HtmlDecode(urlEncoded); Console.WriteLine(url); HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url); request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4"; using (WebResponse response = request.GetResponse()) { } }
Since I'm using Regular expressions I'll be parsing text. I'll go ahead and get the text HTML source out of the response.
string html; using (HttpWebResponse response = (HttpWebResponse) request.GetResponse()) { if ((request.HaveResponse) && (response.StatusCode == HttpStatusCode.OK)) { using (StreamReader sr = new StreamReader(response.GetResponseStream())) { html = sr.ReadToEnd(); // find matches // store into database } } }
Next I get to inspect the source of the response closely and create patterns for the specific entries I want. I also have to decide on the specific entries I want and prepare my database.
Writing the regular expressions
I'll just tell you, this data is course information, like training courses. on the pages I will be scraping we have a course title, a description, objectives, an audience, an outline, duration, and sometimes a certification. And of course the URL.
Now I'll make the regular expression patterns and find them in my response.
html = sr.ReadToEnd();
// title string title; try { Regex titleRegex = new Regex("<p><b>(?<title>.+?)</b></p>"); title = titleRegex.Matches(html)[0].Groups["title"].Value.Trim(); } catch (ArgumentOutOfRangeException) { title = ""; } // description Regex descRegex = new Regex("<p><b>Description</b></p>\n<p>(?<description>.+?)</p>", RegexOptions.Singleline); string description = descRegex.Match(html).Groups["description"].Value.Trim();
This isn't a Regex tutorial but if you have any questions please leave a comment. I made several other patterns for the rest of the fields.
Storing the results in a database
So now I'll design a table on my local install of sqlserver 2005. Then I'll use Subsonic to generate my data access layer quickly.
I can generate an entire DAL with the push of a button with subsonic. It's quite hot. took me about 5 minutes to make this table and the DAL, including configuration.
So now that I have all my fields the last step is to add them to the database. Here's the subsonic powered code:
System.Threading.Thread.CurrentPrincipal = new WindowsPrincipal(WindowsIdentity.GetCurrent()); Scraped.Course.Insert(title, description, objectives, audience, outline, duration, certification, url, "", DateTime.MinValue, null, null, false);
Then I checked my real URLs list, removed a couple duplicates and let this guy fly!
select count(*) from courses
309
My client has the data and I have the satisfaction of a job well done.
The possibilities for this are endless.
(if there's any interest at all, I'll give a DOM example in a later post)
June 20th, 2007 at 1:15 pm
You can use the System.Net.WebClient class to make a web request, it will take less lines of code to retrieve the response.
June 20th, 2007 at 5:14 pm
"there is some data on some web pages?" understatement?
July 26th, 2007 at 1:27 pm
I had a similar project to this. I'm curious to see how you would have handled using DOM. Have you posted a DOM example? If not, I would like to see one. Thanks!