Douglas Thrift's Computers Website

Search Engine | Historical

The historical background of my Search Engine CGI program.

Table of Contents

Prerelease and Version 1.0

Overview

Note: The rest of this overview section is a carryover from the prerelease stage of this project. I have kept it because I like it and I don't want to change it.

I am working on a search engine, it is a CGI program built using Perl, C++, and Java. I decided to build this search engine after using CGISRCH from AnalogX on my home network site, I won't go into details, but suffice it to say it didn't do what I wanted it to do. So following the old adage, "if you want it done right, do it yourself," and set out to build my very own search engine. Of course, building a search engine has already been done right by Google and they even sell a Google Appliance, but I really don't need to buy one of those and I love programming (also this will look good on an application to grad school or on my résumé).

Since Google did build a search engine the right way, I decided to use it as my primary example. My search engine will not have all the features of Google, but hopefully it will work similarly on a much smaller scale. Unlike CGISRCH, my search engine uses an index of pages to be more efficient. My search engine is also more customizable in its output style. My design for the search engine consists of four files that make up the program:

  1. search.cgi A Perl script that is called by the web server, gets the query and gives it to the main program, and prints the main program's output back to the web server.
  2. Search.exe A C++ program that handles the search and outputs its results; with different options it also acts as the indexer to build the index of webpages.
  3. HttpConnector.jar A Java library that is called by the main program when it is indexing web pages to connect to web servers using HTTP and download them.
  4. HttpConnectorHelper.dll A C++ library that is called by the Java library while it is connecting to a web server to determine if the response is usable for indexing.

Those files have Windows file extensions because I am developing the search engine on Windows, but it will be ported to Linux and FreeBSD, where the only difference in the file names will be some extensions (HttpConnectorHelper.dll will be HttpConnector.so). The search engine will also consist of five HTML files that make up the customizable templates for the output:

  1. header.html The top portion of the output which can display the search query, the number of results, the pages of results, and how long the search took.
  2. body.html The results portion of the output which is displayed for each result, and can contain the web page address, a relevant sample of the text, the web page description if any, and the web page title if any (the web page address will be displayed as the title otherwise).
  3. footer.html The bottom portion of the output which can display the same information as the header.
  4. notfound.html The middle portion of the output which is displayed if there are no results; it can display the query, and can have different output if there is more than one keyword.
  5. pages.html The links to the different pages of results that can appear in the header and footer.

The actual index files produced by the main program when it is indexing will be XML files, the search engine will contain a Document Type Declaration file called index.dtd which will allow you to analyze the indices and check for errors with Internet Explorer or an XML Validator.

Download

Note: As with the overview section, the rest of this download section is old.

Since I have not completed this project, it is not yet available to be downloaded. When it is ready, it will be available in binary for Windows, Linux, and FreeBSD all on Intel x86 processors. I will also release the source code with an Apache/BSD style license.

However, if you would like to be ready to run it here is a list of what you need:

  1. A web server, I am currently using the latest version of Apache, however you can use any web server that supports CGI, such as IIS from Microsoft or Simple Server:WWW from AnalogX.
  2. A Perl interpreter, this is pretty much standard on most Unices including Linux and FreeBSD, for Windows you can download Perl from ActiveState.
  3. A Java runtime environment, which you can download from java.sun.com for both Windows and Linux, I am not certain how to get Java for FreeBSD, but I will be when I release FreeBSD binaries.

Also, if you would like to build the search engine from the source code, either to port it to your platform (maybe Mac OS X, IBM OS/2, or Solaris) or just to say that you can compile it, here is a list of what you will need:

  1. A C++ compiler, I use Microsoft Visual C++ on Windows and I will be using G++ on Linux and FreeBSD. Like Perl, a C++ compiler usually comes standard with most Unices, and I believe even Mac OS X may have one because of this.
  2. A Java development kit, which you can also download from java.sun.com for Windows, Linux, and Solaris. I believe Mac OS X comes with Apple's JDK, and as above I will figure out FreeBSD when I get there.
  3. And, of course, you will need the Web Server and Perl interpreter as listed above (Java development kits include Java runtime environments, so you don't need to download both).

Copyright © 2002-2008, Douglas Thrift. All Rights Reserved.

Valid HTML 4.0! Valid CSS!