| or Call: 713-252-0207

Tweet Me!

Crawling The Web ψ A Software How-To

How To Make A Spiderbot

The heavy lifting in this program is done with a third party tool Spider from Chilkat Software. Did I mention it is a free component!! This leaves you (the programmer) to concentrate on the fun stuff like putting Spider into action.

Quake USA Internet Spider

There are quite a few options; it can be run free from a starting URL, or target search words / phrases. Note the four boolean values acting as flags for the different options and a string array to hold domain locations being used as an exclusion array. These denote localized web sites to exclude from the search. Next I define a custom object to be the TreeViewItem. The treeview is holding a record of URL's that satisfy the search criteria.

The program branches one of two ways with either the routine public void LimitCrawl() or pubilc void GoCrawl() which is the unlimited crawler. LimitCrawl() as the name implies handles the crawling if any of the switches are used.

Initialization is straight forward: Chilkat.Spider spider = new Chilkat.Spider(); Chilkat utilizes it's own string array to handle the domains processed and seed URL's (see below). Spider also has the ability to avoid certain outbound link patterns such as "*.comcast.*". These are added directly to the Spider control.
Spider will automatically ignore any links that satisfy the pattern.

The SQL database, CRAWLER.mdf; is used to track the unique URL's the Spider has come across while crawling. It harvests any links from the current page it is creawling and saves those to the database as unique URL's known as seed URL's. As the crawler comes to the end of each link it is currently crawling, it retrieves a seed URL to begin crawling on. The cycle continues until all seed URL's are used up.

What governs where the crawler goes depends on the options selected on the main form. If you search on "Goats" & "Faint" the crawler will only save, or add to the treeview collection; sites that have met that search criteria. It will search the context, keywords, and title for the search criteria, if found, then the link is added to the treeview collection. Using the created Treeview Item, the title, description, keywords, and what specifically triggered this entry to be saved.

It is important to note that the starting URL is completely editable. I only started with "reuters.com" as a test to determine that the search engine was functioning properly. It makes sense to use a starting URL that belongs to a large domain that will not only have information you are searching for, but provide links to other sites that are pertinent.

Support routines like private long FindSite(string SiteURL, bool IsBaseDomain) and private bool FindSitePage(long Site ID, string URL_To_Find) determines if the site and/or page already exists in the database to prevent duplication. private long AddSite(string URL, string Title, string Description); and private bool AddSitePage(long Parent ID, string URL, string Title, string Description)add the site and associated pages to the database if they don't exist. The same purpose exists for the Keyword support routines, to provide management and association of keywords with respective sites.

bool bool_CheckWebSiteOrigin(string URL) determines if the given URL ends with any of the domain extensions to skip. Finally, there is a quicky Message handler that is used in the program with either an Error, Warning, or General message.

So enjoy the code and be thankful of companies like Chilkat Software for giving away a powerful and easy to use control.
You can find the code here: Spider

End Of Line Man!