I had a few questions about this recently, so I thought I would write a quick post on it. It’s actually pretty simple to set up, but I know people like to see instructions with pictures. It’s really not much different than setting up any other content source such as a file share. There are a few obvious things you have to take care of though. First, you need to make sure that the index server has access to the site it’s crawling. That means if you are behind a firewall, or you need to access your public facing web site using a different URL, you need to take that into consideration. We’ll talk about that more in minute.
To crawl your web site, you first go to your content sources page to create a new content source. Give your content source a name and then select web site and type in the URL to the web site that you want to crawl. When you choose this option, it follows every link it acts as a simple spider, following each link it can find and adding it to the index. In this example, I am going to crawl DotNetMafia.com.
You also have the capability to set how many links the spider will follow when crawling and whether or not server hops are allowed.
After you configure your content source, you can start a crawl. When it completes view the crawl log to see if there are any issues crawling your site. This can also help you find broken links as well.
If you want to crawl a site that requires authentication that can be done as well by creating a crawl rule. You can specify credentials with a crawl rule from a variety of sources, such as a certificate, cookie, or even FBA. I don’t have an example handy today for that though so I’ll cover it in a future post.
As I mentioned earlier, if you have to specify a different name for a server internally than externally, that can be handled with a server name mapping. A server name mapping allows you to map a URL that was crawled and replace it with a different URL (i.e. the external URL of the site). Here is what that would look like.
The last thing I will point out is that there isn’t a way to exclude a portion of the page from being included in the index (at least as far as I know). What this means, is if you have a common navigation on every page, those words on it will show up as results on every page. For example, if you have a link called Contact Us, all pages with the Contact Us link are going to show up as a hit in the search results. Here’s an example of what I mean. There are way too many results, which doesn’t help the user at all in this case.
As you can see, crawling web sites with Enterprise Search is pretty easy to set up. You may have to deal with some issues like the one above, but it’s still not a bad solution. This is a great way to index your public facing corporate site and bring those results into SharePoint.