Search Engine for Newham Online
Issuer: Danny Budzak
Programme: Gateway & Extranet Project
Document Author: Danny Budzak
Document Version: 1.0
Document
Status: Discussion Draft
Document Control
Amendment History
| Date issued | Version no. | Reason for change |
| 4 - 03 - 99 | 1.0 | Initial discussion draft for review, amendment and signing |
Distribution List
| Name | Org/ role |
| Danny Budzak | LBN |
| Jill Davies | LBN Libraries |
| Lorraine Leeson | Art of Change |
| Rik Lewis | LBN consultant |
| John Lock | UeL |
| Michael Mulquin | Chair NO |
| Toni Rice | Advice Arcade |
| Gavin Sealey | LBN Youth service |
| Richard Steel | LBN CCS |
| Richard Stubbs | Newham Online |
Related documentation
| Reference no | Title: | Author: | Version & date |
| Search Engine: Project Specification | Richard Stubbs |
Contents
1.
Management Summary2.
Resources3.
Marketing and revenue4.
Search engine definitions5.
User requirements6.
Search engine specification.7.
Application support and integration.8.
Directories, keywords and indexing9.
Server and network implications10.
Building a search engine11.
Excite for Web Servers12.
ht:/dig13.
Ultraseek..14.
ActionsAppendix One: Comparison of EWS, ht:/dig and ultraseek
1. Management summary
1.1 For Newham Online to function effectively as a Portal / gateway, it is important to implement a search engine.
1.2 For a search engine to be a useful tool, the needs of the various users of the portal must be the starting point. The search engine is a bridge between the user and the information.
1.3 This document is an attempt to bring together some of the key issues involved in implementing a successful search engine strategy.
1.4 This document will look at the user needs, the requirements of the search engine and the network and server implications.
1.5 The document will consider four alternatives - building a search engine, using Excite or ht:/dig, both of which are free or buying Ultraseek, a commercial search engine.
2. Resources
2.1 Resources need to be identified for the progress of this project.
2.2 Resources which have been identified include:
London Borough of Newham - finance plus free Microsoft consultancy
University of East London - finance and three person development team
East London Lea Valley Teleregion - finance
Go To Find
Internet Gateway Ltd
3. Marketing and revenue
3.1 The search engine is the logical place to sell advertising on Newham Online and it should be investigated as to whether funding could be raised in this way to pay for the search engine development.
3.2 If a search engine was created specifically for the Newham Gateway it may be a product which could be sold commercially, particularly to projects similar to Newham Online.
4.1 A search engine is a tool for the users of internet, intranets and extranets to search for information. It can answer specific questions or provide a solution to a problem.
The basic components of a search engine are
4.2 The search engine is required so that users can easily find answers to their queries. The growth of net use is growing exponentially and it is important we lay a strong foundation now to be able to deal with growth in both content on Newham Online and use of Newham Online.
4.3 It must be able to retrieve information in a variety of formats, html, word docs, ftp, jpgs, avi etc. We want it to search across all of the pages which make up Newham Online, to index these pages and produce search results for users.
4.4 The search engine will be server based and be located on the Newham Online Web server.
4.5 A working version of the search engine (release 1.0) should be available for June 1999. A fully functioning version should be available from September 1999.
5. User requirements of the search engine
5. 1 The default of the search engine needs to be Newham but there needs to be the ability for the user to search on the UK and the world. It needs to search any defined domain within Newham. The three parameters on the domain are geographical, time and topic.
5. 2 The main ways in which information retrieval can be categorised are as follows:
5.2.1 Known item - the user knows what they are looking for and there is a single correct answer - eg where is Stratford Railway Station
5.2.2 Existence searching - the user knows what they want but they don't know how to describe it - eg 'I want to know where my grandfather lived' - this is family history but technically would be described as genealogy.
5.2.3 Exploratory searching - the user has a query but they don't know what they want to find - eg 'I want to change my job, but I'm not sure what I want to do, what are the options?'
5.2.4 Comprehensive searching - want everything on a subject - eg ' I want all the references to the redevelopment of the Royal Docks'
5. 3 The system will have the following types of users
5.4 The main issues which must be met from the users point of view are:
5.5 It is important the search engine is popular with the users.
5.6 The search engine should tell the user how it works and how it arrives at the results of a search - ie - what components of a file are being searched, is it free text search etc.
5.7 Users should be able to display the results in compact, standard or detailed format.
5.8 Users should have option of simple or advanced search.
5.9 Users should have access to help files and FAQs about the search engine
5.10 Users should have access to human contact if fail to find what they are looking for. This could for example be an email link to the library service or other information provider. Netcall should be available for registered users.
5.11 Users should be told the number of documents retrieved and where they are in the retrieved list - eg documents 12 - 21 of 67
5.12 The user should be able to manually submit a URL.
5.13 The user should be able to submit details of URLs which are no longer valid
(although the search engine should have an auto-delete mechanism).
5.14 The user should be given feedback. Even if this is to say that no matches have been found.
5.15 The user should be able to change the sort order of the retrieved results - eg sort by date, alphabetically, by file size, by relevance.
5.16 The user should be able to see their original search query while they use the search engine ( ie the search query stays in the input box while the search is carried out and completed).
5.17 The user should be able to save searches in stored sets.
5.18 The user should be able to search on full text (natural language), URL, the title and keywords.
5.19 The user should be able to restrict the type of content they want to receive
6. Search engine specification
6.1 There must be a simple and advanced search capability
6.2 The simple search facility will work on the basis of individual words typed into a box.
6. 3 The advanced search will support
6.4 The search engine should be able to search html files, jpg/gif etc files, WAV/AVI etc, email addresses, proper names, ftp sites.
6.5 The search engine must have help files and a list of FAQs which are easily accessible and understandable.
6.6 The search engine should group results by the categories of say 'loose, fair, good, close, strong and have a minimum relevance score.
6.7 The search syntax should be case insensitive
6.8 The search must have a speedy response. It must be able to deal with simultaneous multiple searches.
6.9 The search engine must include a feedback channel for help and technical questions on the use of the search engine.
6.10 The search engine must include a feedback channel for a user who fails to find what they are looking for.
6.11 The search engine must be able to automatically index all of the contents of all of the relevant sites which comprise Newham Online.
6.12 The same interface for the main search engine should be used where possible as the interface to search specific directories and databases within Newham Online.
6.13 The search engine must be y2k compliant.
6.14 The search engine should include an online tutorial.
6.15 The search engine should support the pics rating so it won't return certain items to someone with a particular type of registration.
6.16 The search engine should be able to carry out concept based searches
6.17 The index should store answers to common questions (ie multiple indices).
6.18 The spider / robot should contribute to site management and produce reports - ie which links are broken, which sites are live, which are dying.
6.19 Feedback should be provided for web editors on failed searches to help provide the content that users want.
6.20 The search engine should support words being treated as words, substrings, stems. The * character should be used for stems.
6.21 The search engine should support community languages
6.22 The user interface will be a combination of multiple input windows, drop down menus and hyperlinks. The user interface must load quickly onto the users machine and be backward compatible with earlier versions of Netscape and Explorer.
6.23 The search engine must support searching on full text, the URL, the title, keywords and description.
6.24 The search engine must be able to index dynamically generated pages.
6.25 The search engine must be able to index pages which are held in separate databases.
6.26 Search results must be actual working links.
6.27 Duplicates should be eliminated
6.28 A good test for a search engine is whether it can find Vitamin A,' to be or not to be', The The.
7. Application support and integration
7.1 Ability to use the search engine as a back end to other applications such as map based searching - eg click on a map and bring up a list of nearest chemists
7.2 Able to contribute to site and content management - eg producing reports on the 'age' of information on the network.
7.3 Ability to support PICS rating so as not to provide 'adult' information to underage users.
7.4 Ability to support extranet identification system so as to allow authorised users to search site information that is unavailable to non-authorised users.
7.5 Ability for others to use same engine for searches restricted to their sites.
8. Directories, keywords, indexing
8.1 Users find information through a combination of searching and browsing. The search engine needs to accommodate this. It is estimated that around 30% of internet users do not or cannot use search engines and therefore alternative search facilities need to be incorporated for this group of people.
8.2 The most effective way to compliment a search engine is with a directory. Ideally, the search engine and directory should be part of the same page. Yahoo! is a good example of how this works (although Yahoo! isn't a search engine in the way that Alta Vista is). Yahoo! has 14 main categories and users can go through these categories to find information they want.
8.3 Creating a directory is labour intensive but it is raised here because it is a project which another group, such as the library service, may be willing to co-operate on.
8.4 The use of keywords in html files will facilitate how users find information using the search engine. The standard meta tag should be used for this. The use of keywords also facilitates the search engine to deal with synonyms and to use the language of the user to help the user. For example, the term 'handicapped' would not be used by the local authority, but may be used by the user. By including this as a keyword in the meta tag, the user will find information about 'disabled people'.
8.5 The engine search should be able to create an index of the site which can be used by the user.
8.6 There needs to be a policy to prevent index- spamming whereby content providers artificially push their ranking up in the search results.
9. Server and network implications
9.1 The following need to be determined as they will differ with respective search engines
10.1 The partners within Newham Online could agree to build a search engine. This could either be done using the existing resources of the partners, or the partners could create a tender document.
10.2 A clear project brief would have to be produced with detailed costings and time scales.
10.3 A project leader would need to be appointed from within Newham Online.
10.4 The project would ultimately report to Newham Online steer or a sub group of this.
10.5 User testing would be necessary before the search engine was finally completed.
10.6 Ownership of the search engine would need to be established, in the event that this might be sold as a commercial product.
10.7 Copyright of the search engine would have to be established.
10.8 Advantages
- we get exactly what we want
- it could add considerable value to the gateway project
- it could be a commercial product and be a means of generating income
10.9 Disadvantages
- it could be more costly than buying a commercial search engine
11.1 EWS is not strictly speaking a search engine. It crawls the local file system rather than follow links. It is written in PERL and full source code is provided. Disassembly is prohibited but can change the source code.
11. 2 Advantages
- free
- one of the few search engines to support concept matching
- distributed as source code so some customisation is possible but there are clauses as
to how much change can be done.
- indexing is very fast
- no network load
- supports logging user queries
- EWS has a commercial relationship with Excite so it is likely to support new html
features as they appear
11.3 Disadvantages
- have little control over the indexing of the page. Have to hope it understands the gist
of the page
- have to tell it which files it can and cannot read
- not clear how EWS would work on the gateway site given that it crawls a local file
index rather than follows links
- only supports http
- hidden pages will show up unless specifically tell EWS not to index them
- can be problems if individual server configurations differ
- doesn't allow robot.txt standards for exclusion
- EWS have distributed faulty products in the past
- doesn't get MIME extensions from the server so can confused with file types
- can have problems with bmp images
- cannot support multiple physical servers
- proprietary searching mechanism and not as customisable as ht:/dig
- easy to set up but more indexing and maintenance needed
11.4 Users of EWS
Sun Microsystems www.sun.com
Adobe www.adobe.com/misc/search.html
Javasoft www.javasoft.com/share/search.html
Sega www.sega.com/site_search/
Forbes Magazine www.forbes.com/Architext/Forbesquery.htm
Chevron www.chevron.com/Architext/AT.chevronquery
12.1 ht:/dig is distributed as source code and can be turned into a commercial product under the GNU licence agreement. It is still in development and is likely to improve.
12.2 Advantages
- no restrictions on what can be done with the source code
- free
- supports configurable fuzzy searching
- determines MIME therefore no confusion over file types
- can be configured to crawl password protected sites
- can be used for expiry notification
- supports meta tags for keywords and description
- can index very quickly and index at specified times ie 3am Sunday
- will find broken links
- can crawl frames
- easy to configure the search results
- easy to set up
- could add features such as query log
12.3 Disadvantages
- no inherent support for languages but it is claimed that it could work with other
languages
- compiling and development time would be needed. This could be as much as buying a commercial search engine.
12.4 Users of ht:/dig
Red Hat Linux www.redhat.com/search/
NASA's Kennedy Space Centre www.ksc.gov/search/htdig/
Austrian press www.austria.org
Original World Famous Pez Page www.io.com/~paults/search.html
University of Kent at Canterbury www.uk.ac.uk/webmaster/search
13.1 Ultraseek is the technology which powers infoseek, a well known and popular search engine.
13.2 Advantages
- supports natural language query processing
- supports phrase searching
- automatic name searching
- supports + / - operators
- supports query refining
- very easy to administer
- allows users to submit pages for inclusion
- its commercial and therefore professional support is available
- logs user queries and these logs are easily accessible
- allows link query ( so can which pages link to a particular page)
- supports searching on title, URL, host name, images, meta tags
- can index sites through a proxy server
- can index URL's that contain query strings (ie generated from an external database)
- both spider and query server can use http keep alives (TCP protocol 'learns' how to
transfer data
- tracks status of individual URLs in the indexing process - ie doc summary, date last
visited
- supports multi- indices
- collections from one server can be mirrored on another
- can crawl frames and returns individual panels
- crawls client side image maps
- can crawl password specified pages
- can specify the number of spiders active at any time
- can present cookies to the server
- machine shut down will not affect the index
13.3 Disadvantages
- it could be costly
- Boolean not supported
- can't set the spider to run at specific times
13.4 Sites using Ultraseek
CNN www.cnn.com/search/index.html
NASA Spacelinks spacelink.nasa.gov/index.html
Sun Microsystems www.sun.com:8065
Sunsite Denmark sunsite.auc.dk
Channel A www.channela.com/
George Mason University www.gmu.edu/tools/
14.1 Commitment of resources needs to be identified
14.2 Project brief needs to be produced
14.3 Draft proposal needs to be sent to all of the partners of Newham Online
14. 4 Project leader needs to be identified
14.5 Usability group needs to be identified
14.6 Need to investigate the possibility of European Funding for language support
14.7 User evaluation is needed on EWS, ht:/dig and Ultraseek. There are 17 web sites identified above which are using versions of these three search engines. A project needs to be established to evaluate these sites.
14.8 Newham Online needs to contact with EWS, ht:/dig, Ultraseek with our specification and ask them how their search engine could meet the requirements.
Web Developer.com Guide to Search Engines
Wes Sonnenriech & Tim MacInta
Wiley Computer Publishers
New York 1998
Information Architecture for the World Wide Web
Louis Rosenfeld & Peter Morville
O'Reilly 1998
Web Navigation: Designing the User Experience
Jennifer Fleming
O'Reilly 1998
Internet Users Guide to Network Resource Tools
TERENA & Margaret Isaacs
Addison Wesley 1998
The Design of Everyday Things
Donald A Norman
MIT Press London 1998
Three articles on search engines from .net magazine. Start page:
http://www.futurenet.com/net/features/43findit/default.asp
Alta Vista help pages: www.altavista.com/av/content/help.htm
SearchUK help pages www.searchuk.com/results.html
Excite for Web Servers www.excite.com/navigate/
ht:/dig http://www.htdig.org/
ultraseek www.software.infoseek.com
general www.serverwatch.com
Academic information
Understanding and comparing web search tools - www.hamline.edu/library/bush/handouts/comparisons.html
Literature about search engines www.ub2.lu.se//desire/radar/lit-about-search-services.html#retr
Technical www.searchenginewatch.com
Appendix One: comparison of EWS, ht:/dig and Ultraseek
|
|
EWS | ht:/dig | Ultraseek |
| Operating system | NT | - | NT |
| RAM | 32 | 32 - 64 | 64 |
| Memory | 5 | 10 | |
| Index as % of content | 40% of content | 300 - 500% of content size | 10 -50% of content size |
| Ability to crawl other protocols | |||
| Http | Yes | Yes | Yes |
| ftp | X | X | X |
| Gopher | X | X | X |
| Nntp | X | X | X |
| Handle frames | Yes | Yes | Yes |
| Dynamically generated pages | X | Yes | Yes |
| Meta tags | X | Yes | Yes |
| Full text | X | X | Yes |
| User profile | |||
| No. of simultaneous users | Not an issue | Not known | Not an issue |
| Skill of users | Easy to use | Easy to use | Easy to use |
| Community language support | Will need extra work | Will need additional work | Yes |
| Cost | Free source code | Free source code | Estimates of costs would be needed from Ultraseek |