mhinze.com archive

this is an archive of the old blog, ended 6/16/08




    Archive for the '.NET' Category

    HtmlAgilityPack User Agent / HttpWebRequest object

    Tuesday, November 6th, 2007
    using System;
    using HtmlAgilityPack;
    using System.Net;
    
    class LbHtmlWeb : HtmlWeb
    {
        private readonly string USER_AGENT =
            "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
    
        public LbHtmlWeb()
        {
            this.PreRequest += delegate(HttpWebRequest request)
            {
                request.UserAgent = USER_AGENT;
                return true;
            };
        }
    
        //…
    }

     

    Now that was easy!  Pretty spiffy that HtmlWeb exposes that event.

    Please let me at the HttpWebRequest.

    Don't do this (RSS Toolkit, I'm talking to you):

    XmlDocument doc = new XmlDocument();
    doc.Load(url);

     

    =(

    asp.net 2.0 anthology book review

    Thursday, November 1st, 2007

    Jeff Atwood offered free copies of his book in exchange for a review, and I emailed him and got one.  In the midst of getting married, honeymooning, and catching up after returning, I've spent a few weeks going over it  - I have read every word and grokked every example.

    In short my review is positive.  A good book.  Worth reading if you are an intermediate level web developer on the asp.net platform.  Possibly a great book, if your skills are at a certain sweet spot: maybe you are moving towards a more complete and nuanced understanding of software development but haven't quite matured, or you are a self-taught developer about to start your first "real" project. 

    ASP.NET Anthology would make a great gift for the Morts you work with.

    Before I get too into it, here are some links:

    the book

    Written by four prolific and popular bloggers, who differ widely, in my opinion, not only in swarmth but also in skill: Allen, Atwood, Galloway, Haack

    It was fun guessing which author was writing each section of the book - based on the content of their weblogs over the past year, but also based on style, examples and screenshots.

    The chapter on performance and optimization is available online.  It models how the other chapters are organized: problem, solution, discussion.  which turns out isn't a bad way to organize a book like this.  I don't know if this is a sitepoint model or if the authors came up with it, but it works great. it was fine reading straight through, and I'm sure it will be fine as I later reference different sections.

    You can also download a pdf with chapters 4 (GridView hacking) and 9 (web standards).

    if i learn one thing, it was worth it

    That's what these technical books are all about anyway, right?  Trudging through muddy text and inapplicable examples hoping to glean three or four juicy bits of knowledge.

    I had several of these flashes while reading this book.  Like most geek books, the bulk of Anthology is mediocre, but the peaks are not rare and quite pleasing! Just a few of the most notable, for me:

    • web deployment projects
    • conditional string formatting
    • per-request state
    • the discussion on reacting to hotlinked images
    • SimpleMailWebEventHandler and handling errors via HttpModule 

    some negative feedback

    Unfortunate in timing the release of this book - as so much exciting is happening now in the [asp.net] web development space: mvc, jquery, etc.  and the book is incomplete in reflecting that excitement. 

    Don't get me wrong.  This isn't necessarily a Mort handbook - we talk subversion and subsonic, etc.  - but it really isn't alt.net either.  for example the data access chapter is Mort-ish through and through (and easily bested by the data access tutorials), but there's fair to good Subsonic stuff in the advanced topics chapter.

    I detect a musty whif when the answer to "how do i create a data access layer" is the daab.

    The biggest loser however, is the membership chapter with its unreadable, heartbreaking prose.  This particular snippet (p. 231) from the introduction to a "how can my users recover their password"-section shattered my immersion so indelicately that I had to put the book down and resume reading later:

    It happens to everyone at some point, and it's embarrassing.  No, I'm not talking about that!  I'm talking about forgetting your password.

    A web site that doesn't provide an [sic] means by which users can retrieve forgotten passwords automatically is just asking to be inundated with tech support calls.  Fortunately, the ASP.NET team has you covered!

    ouch.

    There is really good prose though. The introduction to chapter 9, ASP.NET and Web Standards, is something I might reference when explaining web standards to clients.

    So the writing is sometimes awful and the information is a little out of date and available for free on the internet

    I was tempted, this being a linkblog and all, to simply go through the table of contents and link out to web pages where each topic is addressed more comprehensively.  It's possible, certainly, but it would have been a low-blow.

    Another small gripe: some source code is formatted in a way i don't like.  p. 255 has an example where the SendEmail event is defined below a method that invokes it.. I usually want to have my events defined above their usage (at least in a book!), because i read source top-down (am i crazy?).  This is a minor quibble, but it bothered me throughout the book.

    recommended

    But it's nice to see guys who are interested in showing you how to do something in a cool or correct way, rather than just relating a concept or procedure. 

    This book is not mind-numbing - in fact it motivated me to implement some new things in my projects, or showed me some new ways to implement things I thought would be more difficult.

    I suppose the best compliment I have for this book is that at several points while reading i noted that writing what I had just learned would make a great blog post - in other words: that I thought you would enjoy what i had just read.

    other reviews:

    Travis
    Brad
    idunno.org

    if you want me to review your book, leave a comment (i won't publish the comment, but I'll get an email) and i'll send you my mailing address.

    Simple Scraping with SubSonic.Sugar

    Friday, June 29th, 2007

    SubSonic.Sugar has some handy utilities.  Highly recommended.  The source is worth browsing too.

    static void Main()
    {
        foreach (string link in SubSonic.Sugar.Web.ScrapeLinks("http://mhinze.com/", false))
        {
            Console.WriteLine(link);
        }
    }

     

    image 

     

    Technorati tags: , ,

    Search Query Term Highlighter - HttpModule for ASP.NET

    Thursday, June 21st, 2007

    Sometimes I find it useful when, as I'm searching, result documents are presented with my search terms highlighted.  Like they do in Google's cache.

    So I decided to write an HttpModule to handle the highlighting.  I also thought it would be useful to log these terms.  This module can serve that purpose quite nicely with a little modification.

    I found an open source solution so I adapted it.  I ported most of the regular expressions and some of the logic from SEHL, the search engine query terms highlighter for PHP.

    Download: Search Term Highlighter HttpModule for ASP.NET

    To use this HttpModule in any ASP.NET web site, place the class file in your App_Code folder and add this to web.config in the system.web section:

    <httpModules>
        <add type="SearchTermHighlighterModule" name="SearchTermHighlighterModule"/>
    </httpModules>

     

    The HTML source it produces looks like this:

    <span class="highlight-search-query">Ishmael</span>

    So you'll need to include the style definition in your stylesheet:

    .highlight-search-query
    {
        background-color:Lime;
    }

     

    to get this( searching for [ishmael "some years ago" the] ):

     
    image 
     

    One item left on my TODO list is to use multiple CSS classes, so that each term can have it's own color.

     

     

    Technorati tags: , , ,

    Screen Scraping Tutorial using C# .NET

    Tuesday, June 19th, 2007

    There is some data on some web pages. 

    My client has asked me to get the data into a database.  As quickly as possible (I bill hourly)

    Who knows what he's going to use it for.. maybe automated content generation, or for corporate research, or to massage the data for his clients in ways that the data provider can't accomplish.  Whatever!

    The pages aren't under my control and the data isn't available to me directly but I need it to be.

    Screen scrape, web scraping, page scraping, HTML scraping… this is how to do it in .NET.

    ..It's easy.  Really a two step problem-solving process.

    1. Analysis: How is the data organized, what pieces do you need?  Will you use DOM or Regular Expressions to get at it?
    2. Action: Coding up the DOM walker or Regex.

    So for this task I have been given a few excel sheets with URLs in them somewhere.  I will now go extract those URLs.

    Getting my list of URLs

    You can get a list of target URLs from anywhere.  Start crawling the web, find a database, whatever.

    Ok, I just looked, and actually for this they aren't excel files they are MHTs - not a problem - I am looking to get these URLs out quickly so I'm not going to do anything fancy or anything, this is a one-off utility. 

    This will give me an opportunity to show you some more scrapey looking code.  since this is a pre-scrape, I am still going to follow my two step process here.

    1. The URLs are on a page in MHT files (MIME-esque) so I will export to HTML. And all I need are the URLs, I will use Regex since that's easiest I think (i use rad's tool, it's really quite nice)
    2. To make the regex I could google [url regular expression] but I don't even need a generic URL pattern.  I'll just whip up a custom one for this task using rad designer..
      Made my pattern (the only variable in the URL is a SKU query string) and fired up grep for windows:

    grep output

    The first command counted the matches for each file.

    C:\Temp\urls>grep –file=pattern.txt *.htm -E -h -o > urls.txt

    The second command makes a file with each line being a URL. 

    it found all the matches to the pattern I supplied in pattern.txt ( I didn't want to include it in the command line) in the htm files in the current folder.  -E is extended regexp and -h hides the file name from the output.  then I sent the output into another text file, urls.txt, which I can use elsewhere. -o only shows the part of the line that matches the pattern.

    Ok, now I have my URLs in a text file:

    image

     

    Deciding how to scrape

    Once you have your URLs, first thing is to check out the source of the pages in question.  And I'll be looking at the DOM.   (for me the URLs are html encoded, so I'll need to change &amp; to & programmatically when I get to that part)

    For a few minutes I look at the source and the DOM using the IE web developer toolbar. I really love that toolbar.  As you navigate through the DOM it flashes a box around the element you've selected.  of course it shows all the attributes and other styles.

    The source is marked-up OKAY.  It's a table layout.  Not much CSS.  Sort of a bummer, it's not very descriptive, I'm going to have to look at the text inside each <p> to decide exactly what it is.

    Instead of

    <h1 id="title">Oracle Financial Analyzer 11i for End-Users</h1>

    we have

    <p><b> Oracle Financial Analyzer 11i for End-Users</b></p>

    Instead of

    <div id="description">
    <h2>Description</h2>
    <p>This course introduces learners.. etc</p>
    </div>

    we have

    <p><b>Description</b></p>
    <p>This course introduces learners to the interface and functionality of Oracle Financial Analyzer etc.</p>

     

    If I could rely on an ID or a Class attribute I could process the DOM much easier.  Because each page is laid out exactly the same (I guessed as much before I even looked at the source because the only difference between the URLs was on query string value), regular expressions will be easier for me to develop.  Especially since I'll have easy access to that text.

    Hanselman likes to say, "You have a problem so you decide to solve it with regular expressions.  Now you have two problems."  But I'm into them.  Especially for tasks like this.

    If I were to walk the DOM, I would use the Html Agility Pack, and I would gain XPath querying but also a bit of work. my goal is to do this task quickly, but not necessarily in the most robust way.  I think my pages are similar enough that too-many errors or failed matches are unlikely. plus i don't want to switch for the text inside the paragraph elements.

    I'm going to start building up the .NET scraping standard code.  Starting a new console application project in Visual Studio.

    .NET Screen scraping code C#

    note the UserAgent - we have to cast to an HttpWebRequest to set this header and others, like Referrer..  this code will loop through all the URLs in my text file and get the response from the server. 

    string urlpath = ConfigurationManager.AppSettings["urlFileName"];
    foreach (string urlEncoded in File.ReadAllLines(urlpath))
    {
        string url = System.Web.HttpUtility.HtmlDecode(urlEncoded);
        Console.WriteLine(url);
        HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
        request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4";
        using (WebResponse response = request.GetResponse())
        {
    
        }
    }
     

    Since I'm using Regular expressions I'll be parsing text.  I'll go ahead and get the text HTML source out of the response.

        string html;
        using (HttpWebResponse response = (HttpWebResponse) request.GetResponse())
        {
            if ((request.HaveResponse) && (response.StatusCode == HttpStatusCode.OK))
            {
                using (StreamReader sr = new StreamReader(response.GetResponseStream()))
                {
                    html = sr.ReadToEnd();
    
                    // find matches
                    // store into database
                }
            }
        }

     

    Next I get to inspect the source of the response closely and create patterns for the specific entries I want.  I also have to decide on the specific entries I want and prepare my database.

    Writing the regular expressions

    I'll just tell you, this data is course information, like training courses.  on the pages I will be scraping we have a course title, a description, objectives, an audience, an outline, duration, and sometimes a certification.  And of course the URL. 

    Now I'll make the regular expression patterns and find them in my response. 

    html = sr.ReadToEnd();
    
    
    // title
    string title;
    try
    {
        Regex titleRegex = new Regex("<p><b>(?<title>.+?)</b></p>");
        title = titleRegex.Matches(html)[0].Groups["title"].Value.Trim();
    }
    catch (ArgumentOutOfRangeException)
    {
        title = "";
    }
    
    // description
    Regex descRegex = new Regex("<p><b>Description</b></p>\n<p>(?<description>.+?)</p>", RegexOptions.Singleline);
    string description = descRegex.Match(html).Groups["description"].Value.Trim();
    

    
    

    This isn't a Regex tutorial but if you have any questions please leave a comment.  I made several other patterns for the rest of the fields.

    Storing the results in a database

    So now I'll design a table on my local install of sqlserver 2005.  Then I'll use Subsonic to generate my data access layer quickly.

    Courses table

    I can generate an entire DAL with the push of a button with subsonic.  It's quite hot.  took me about 5 minutes to make this table and the DAL, including configuration.

    So now that I have all my fields the last step is to add them to the database.  Here's the subsonic powered code:

    System.Threading.Thread.CurrentPrincipal =
        new WindowsPrincipal(WindowsIdentity.GetCurrent());
    Scraped.Course.Insert(title, description,
        objectives, audience, outline,
        duration, certification, url,
        "", DateTime.MinValue, null,
        null, false);

     

    Then I checked my real URLs list, removed a couple duplicates and let this guy fly!

    select count(*) from courses

    309

    My client has the data and I have the satisfaction of a job well done.

    The possibilities for this are endless. 

     

    (if there's any interest at all, I'll give a DOM example in a later post)

     

    Where is the Row in RowCommand? DisplayIndex

    Saturday, June 9th, 2007

    So I sat down to do a little "data-driven web development with ASP.NET 2.0" today. 

    GridView, check.  DataSource, DataBind, check. 

    Next I wanted to add a button to each row to perform an action on the item the row represented. 

    So I added my ButtonField and ran into this problem: GridViewCommandEventArgs doesn't expose the Row. (!)

    Fair enough, I'll just add a Template Column and set the CommandArgument like Rick Strahl explains.

    But that didn't work either - actually there are a couple problems.  First, I needed two fields, or the entire row, not just the PK.  And second, I want it to eventually handle paging and it has some concurrency issues because RowIndex no worky-worky since it looks at the data source and not at the grid on the page. 

    Also, I have mutliple grids firing this event handler, so I didn't ever want to reference a specific instance of the GridView.

    Well at first I tried hard coding both fields into the CommandArgument and then splitting them up in the code-behind:

    <asp:gridview id="GridView1" runat="server" onrowcommand="Grid_Command">
        <Columns>
            <asp:TemplateField ShowHeader="False">
                <ItemTemplate>
                   <asp:LinkButton ID="Link1" runat="server"
                        CausesValidation="false"
                        CommandName="Cmd"
                        <!– in the ghetto, below –>
                        CommandArgument='<%#Eval("ID") + ":" + Eval("Name") %>'
                        Text="DoStuff">
                        </asp:LinkButton>
                </ItemTemplate>
            </asp:TemplateField>
        </Columns>
    </asp:gridview>

     

    protected void Grid_Command(object sender, GridViewCommandEventArgs e)
    {
        if (e.CommandName == "Cmd")
        {
            string[] arg = (string[])e.CommandArgument.ToString().Split(':');
            int id = int.Parse(arg[0]);
            string name = arg[1];
            DoStuff(id, name);
        }
    }
     

    The above works great, and I should be commended for my hackery.  But it's WRONG!

    So the solution to this unfortunate hubbub lies with a little known property called DisplayIndex, which gives you the index of the selected row on the page.

    Also, make sure you set the DataKeyNames property of the GridView.

    <asp:GridView ID="GridView1" runat="server" OnRowCommand="Grid_Command" DataKeyNames="ID,Name">
        <Columns>
             <asp:TemplateField ShowHeader="False">
                <ItemTemplate>
                    <asp:LinkButton ID="Link1" runat="server"
                        CausesValidation="false"
                        CommandName="Cmd"
                        CommandArgument='<%# Container.DisplayIndex %>'
                        Text="DoStuff">
                    </asp:LinkButton>
                </ItemTemplate>
            </asp:TemplateField>
         </Columns>
    </asp:GridView>
     
     

    protected void Grid_Command(object sender, GridViewCommandEventArgs e)
    {
        if (e.CommandName == "Cmd")
        {
            GridView gv = (GridView) sender;
            int index = int.Parse(e.CommandArgument.ToString());
            int id = (int) gv.DataKeys[index]["ID"];
            string name = (string) gv.DataKeys[index]["Name"];
            DoStuff(id, name);
        }
    }

     

    Not new, but the hotness!

     

     

    Technorati tags: , ,

    .NET Tip: Active Directory's pwdLastSet to DateTime in C#

    Thursday, April 19th, 2007

    to get a DateTime representing the last time a user's password was changed in active directory:

     

    IADsLargeInteger plsVal = (IADsLargeInteger) de.Properties["pwdLastSet"].Value;
    long filetime = plsVal.HighPart * 4294967296 + plsVal.LowPart;
    DateTime PasswordLastSet = DateTime.FromFileTime(filetime);
     

    converted from just one of the (vb.net) tidbits i found today in a truly most excellent wrapper for active directory…  just what i was looking for.  That is ActiveDs.IADsLargeInteger btw.

     

     

    Technorati tags: , ,

    sharepoint 2003: portal listings in code

    Thursday, April 12th, 2007

    seems simple enough huh?  well i am learning as i go on this 2003 object model stuff and couldn't find anything on this easily. so maybe this will help someone.

    code to get a portal listing, portal listings

    using Microsoft.SharePoint.Portal;
    using Microsoft.SharePoint.Portal.SiteData;
     

    Area area = AreaManager.GetArea(PortalContext.Current, new Guid(areaGuid));
    foreach (AreaListing listing in area.Listings)
    {
        // do stuff
    }
     
     
    Technorati tags: , ,

    sharepoint 2003 OM code to sort Link List items

    Friday, April 6th, 2007

    weekending in east texas

    and ive had just about enough of the internets and asp.net and sharepoint and seo and all that for a few days.

     

    so thank you for keeping up with me here and have a very happy Easter.  see you next week.

     

    by the way, i wrote a little class yesterday.  might be useful if you want to sort or order the SPListItems in a "Links" List in sharepoint 2003. 

    yes that's a float - a float not an int!  =)

     

        #region SPListItemOrderer : IComparer
        public class SPListItemOrderer : IComparer
        {
            public int Compare(object x, object y)
            {
                SPListItem sx;
                SPListItem sy;
    
                if (x is SPListItem)
                {
                    sx = x as SPListItem;
                }
                else throw new ArgumentException("Object is not of type SPListItem");
    
                if (y is SPListItem)
                {
                    sy = y as SPListItem;
                }
                else throw new ArgumentException("Object is not of type SPListItem");
    
                if (sx["Order"] != null || sy["Order"] != null)
                    return float.Parse(sx["Order"].ToString()).CompareTo(float.Parse(sy["Order"].ToString()));
                else throw new ArgumentException("SPListItem does not have an order field");
            }
        }
        #endregion

     

     

    gah, oh ok, ok.. here's the usage:

     

     
    // get an array from the collection so we can sort it
    ArrayList items = new ArrayList();
    foreach (SPListItem item in theList.Items)
    {
        items.Add(item);
    }
    SPListItem[] sortableItems = (SPListItem[]) items.ToArray(typeof(SPListItem));
    try
    {
        Array.Sort(sortableItems, new SPListItemOrderer());
    } catch { /* do not sort */ }

     

    just in case you want to do something like this ( by the way that inumerating the fields code was a pain!):

     
    foreach (SPListItem si in sortableItems)
    {
        try
        {
            // debugging stuff here
            //foreach (SPField f in si.Fields)
            //{
            //    try
            //    {
            //        strOut += string.Format("<!– {0} : {1} –>\n", f.Title, si[f.Title].ToString());
            //    }
            //    catch { }
            //}
    
            // the URL item has two fields, comma separated
            string itemurl = System.Web.HttpUtility.UrlPathEncode(si["URL"].ToString().Split(',')[0].Trim());
            string itemname = si["URL"].ToString().TrimStart(itemurl.ToCharArray()).TrimStart(',').Trim();
            strOut += string.Format("<tr><td><a href=\"{0}\">{1}</a></td></tr>\n", itemurl, itemname);
        }
        catch { /* do not create markup for this link */ }
    }

     

     

    Technorati tags: , ,

    clear or remove query string on postback… nifty but ghetto

    Wednesday, March 28th, 2007

     
     
    if (!Page.IsPostBack)
    {
       // set defaults, hide yer panels, etc here
    
       if (Request.QueryString["q"] != null) // if the query string exists
       {
           // aspnetForm - default masterpage form name, YourPage.aspx - whatever your page name is
           ClientScript.RegisterStartupScript(this.GetType(), "qr", 
                "document.all(\"aspnetForm\").action = \"YourPage.aspx\";", true);

           // do stuff with query string
           RedrawPage(Request.QueryString["q"]);
       }
    }

     

    via msdn

     

    so if you have a page that accepts a query string, you definitely want to set that view

    but this is asp.net, so your page also posts back

    you don't want the query string to persist across postbacks - that's just ugly, breaking bookmarks, etc.  you want to clear the query string but maybe you are trying Clear() or Remove() and you are realizing that Request.QueryString is Read Only..

    one option is to set the form's action in client script (Googlebot's not posting back anyway)

    is there a better way to do this?  it seems so ghetto

    by the way, comments on this blog are NOT nofollowed

     

    .

    © 2007 mhinze.com