How to... use search engines effectively

By Margaret Adolphus

The perfect search engine does not exist. Not only is information increasing exponentially, but search behaviour is becoming ever more demanding. So, at the point when theoretical perfection is achieved, another layer of information becomes available, and people find new ways to search.

This is good news for the developers of search engines, especially for the behemoth Google, which controls 78 per cent of the market.

But for the rest of us, it's as difficult to keep abreast of developments in search engines as it is those in Web 2.0 applications. This article is an attempt to summarize some recent trends.

What do the experts think?

Given their importance in the market, it seems appropriate to start with Google. Speaking in December 2009, Matt Cutts predicted a number of trends (Skipease, 2009):

Segmentation of search – Google would try and categorize information more, for example Google Book Search, (US) government search, blog search, etc.
Semantic Web – Google search engine is becoming more sophisticated, taking account of synonyms, page structure and user intent.
Searching the cloud – as people become more confident to store information on "cloud" hard drives, there will be a need to search these.
Real-time – searching what people are writing at the moment to catch the latest buzz and get really up to the minute information.
Mobile search – as we use mobiles for information, we will need search tools to search them, so mobile websites will need to be formatted for searchability.

Writing from the perspective of the third quarter of 2010, these trends appear spot on, however, they fail to mention a key concern picked up by two information professional commentators: the need to organize information.

Information consultant Ellyssa Kroski wants search engines to reduce information overload:

"To access the vast content stores of the read/write Web, these search tools make use of structured and linked data, real-time search, personalization, and more focused filtering techniques. If you're a fan of buzzwords, you might say we've entered Web 3.0, a new era that is motivated by the need to more effectively organize, filter, and access information online" (Kroski, 2009).

And Phil Bradley, listening to someone else's vision of a perfect search engine, muses that his vision is of a tool that would filter, sieve and collate information rather than just present it (Bradley, 2010).

So, what are the main trends, and do they make it easier to find information?

Real-time and social search

By now, social search engines – which search across the social Web – are well established. People search engines are a particularly interesting development especially for potential recruiters or those involved in relationship management.

Two useful people search sites are 123people and kgbpeople. Both sites give a large amount of information:

social network sites,
web pages,
documents,
blog mentions,
photos.

123people organizes things so that all links appear on one page, whereas kgbpeople has a tabbed structure with tabs at the top of the page linking to social networks, search engines (where results are shown against the individual search engine), photos/video/audio, and personal.

No one doing a serious search can avoid the blogosphere, and there are a number of ways of searching blogs. Google provides the option of limiting search results by media type (including blogs – see menu in top left hand corner).

Boardreader searches forums, and Icerocket searches over the Web, the blogosphere, Twitter, MySpace, news, images and BigBuzz, with blogs the default option. It received the thumbs up from Phil Bradley (Bradley, 2009), who commends it for its value in providing a quick overview of social media, pulling everything into one place.

However, the most exciting development with search engines is the ability to search in "real time", i.e. the present moment, so that you can find out what people are talking about now.

What distinguishes a real-time search engine is that it continues to search after results are revealed, so that items continue to drop into your results page. Examples include Twazzup, Scoopler (now defunct), and Collecta.

Twitter is a particularly good way of searching for up to the minute content and there are a number of Twitter search engines. Twitter Search works by instant indexing: whereas tradition search engines search archived content, Twitter enters updates into its database as soon as they are tweeted. It also has some useful advanced search options: you can search within a date range, to and from a particular person, and specifically for links.

The only drawback to Twitter Search is that it only searches within Twitter's time range. Snapbird, however, enables you to search beyond Twitter's ten-day history, or in particular friends' accounts. There are many other Twitter search engines: see the article "50+ ways to search Twitter" (Peters, 2010).

Semantic and computational search

Semantic search means that the software does not crawl randomly through its index of web pages searching for the input term, but rather queries the item against its own structured data. In other words, there is some intelligence in the search: information is organized in a structured way, by humans, against the software's metadata.

Such an approach is exemplified by Wolfram|Alpha (WA), launched in May 2009. WA describes itself as a "computational knowledge engine" and works differently from standard search engines. It checks every search query against a database of facts which have been compiled by its team, basing its answer on algorithms.

The long-term aim is to make all knowledge computable and accessible to everyone. According to its website,

"Our goal is to build on the achievements of science and other systematizations of knowledge to provide a single source that can be relied on by everyone for definitive answers to factual queries" (Wolfram|Alpha, 2010).

The idea is to save users time in two main ways:

It displays the resulting information cleanly within the interface on the page, so there is no need to click in and out of results pages. Thus while searching "France" in Google would bring up references to Wikipedia, French hotels, etc., an AW search brings up a whole range of facts about the country, including maps, statistics, economic indicators, etc.
It provides answers, not sources of answers. If you want, for example, to convert $30,000 into UK sterling, it will display the answer rather than directing you to currency converting sites, and helpfully also provide a graph showing exchange history.

WA's main drawback is the size of its database: at 10 terabytes last October (Higgins, 2009), it is smaller than Google. According to its website (Wolfram|Alpha, 2010) it holds 10+ trillion pieces of data and 50,000+ types of algorithms and models. There are still significant gaps, however, and WA would be the first to admit that the site has a long way to go.

From a reference librarian's point of view, it is a good place to search for basic facts, for example about a country. It is also particularly strong on scientific and mathematical data.

Figure 1. The Wolfram|Alpha search engine, showing the results of a query for "silver, gold', which provides comparative information for the two elements (© Wolfram|Alpha).

Figure 1. The Wolfram|Alpha search engine, showing the results of a query for "silver, gold', which provides comparative information for the two elements (© Wolfram|Alpha)

While WA undoubtedly leads the way in computational search, it is not alone, particularly with regard to use of underlying factual databases.

Microsoft's Bing was also launched in May 2009, and like WA, claims to be able provide direct answers to questions. Bing finds these answers from two underlying databases that Microsoft took over: one relating to travel and shopping, and the other, the semantic-based Powersearch, which indexes Wikipedia.

Bing describes itself as a "decision engine", helping users make key decisions and providing instant answers. For example, a search "London to Johannesburg" brings up a list of sites providing flight information.

The search engine Ask (Ask Jeeves in Britain) has long relied on a database of questions and answers, and was recently relaunched as a natural language search engine, which can generate results both automatically and based on a human edited database of responses.

And Google, claims Matt Cutts (Skipease, 2009), is becoming increasingly sophisticated and semantically empowered: it can factor in synonyms, phrase structure and user intent.

Not all, however, favour these new database search methods. Pandia Search Engine News points out that there is a flaw in thinking that sites like Wolfram|Alpha and Ask can save time by answering the user's questions. Not all questions have one answer, and information may best be gleaned by going to a number of sources (Pandia, 2010). This is particularly so for very recent information, or where narrative information as in a news story, or subjective information (as in reviews of a hotel) is required.

Organizing information

The existence of more and more search engines, as well as more information, makes searching more time-consuming. That is why the efforts of search engines to help researchers organize information are welcome.

Displaying results

One of the most irritating things about search is time spent going back and forth from the list of the results to the actual pages, particularly when these don't have what you are looking for. A number of sites help surmount this problem, either by providing more information or by presenting it in a structured way.

Bing, for example, allows you to scroll your mouse over the edge of the entries, revealing "more on the page" from that particular site.

Figure 3. Bing's interface, showing how it displays results.

Figure 2. Bing's interface, showing how it displays results

Bing also gives suggested related searches in a column to the left (not shown above). Another useful feature is the ability to view video thumbnails within the search results in the video option.

Google squared (GS) provides search results in a grid structure, which is good for searching categories of items. Search for dog breeds, for example, and GS comes up with a list of breeds against images, data on life expectancy, size and country of origin, and it is also possible to add items. Search for a single, but complex, item, such as a country, it helps you structure the results by suggesting categories.

Another intriguing Google application, still in the lab, is Google News Timeline, which enables you to view the news chronologically, according to day and time, with a grid structure so you can see how a particular story is developing.

Figure 3. Results for the Australian general election 2010, as seen in Google News Timeline

The news site www.newsmap.jp offers another visual view of the news, colour coding stories according to world, national, business, technology, sport, entertainment, and health.

Metasearch engines

Metasearch engines, or MSEs, which save time by allowing you to search over several sites, are hardly new. However, the ability to search several together – especially if you can have a say in what search engines you search – becomes more important with more search engines around.

Useful research into which MSE to use is provided by Sadeghi (2009), who compares the effectiveness of a number of metasearch engines: Cluster, Dogpile, Excite, Mamma, MetaCrawler, Search.com, Webcrawler and Webfetch. Sadeghi evaluated these tools by measuring the average closeness of the ranked results with the underlying search engines, using a number of queries, and then comparing the MSEs with one another to see which gave the best result. The findings revealed that Dogpile performed best.

Searching beyond the surface – the deep Web and federated search

Traditional search engines only skim a very small portion of the Web. The Web is a bit like an iceberg: the bit you see is small, but there is a large amount below the surface. It has been estimated that the part that is invisible to ordinary search engines is as high as 90 per cent.

This is because traditional search engine technology relies on web crawlers (also referred to as "spiders" or "robots"), which explore the Web by following links. However, some links, particularly of databases, are virtually dead ends because you need to enter a search term on the front page of the database.

The result is that any search which requires some academic knowledge or serious research will be difficult, because such knowledge tends to be stored in PDF documents in databases. Search engines have tried to address this problem by persuading the database owners to accommodate their requirements, with varying results.

However, the most significant technological advance in the search of databases is federated search. Federated search allows the user to search multiple databases simultaneously, rather in the same way that MSEs do for search engines. The information architecture, however, is totally different.

When the user inputs a search query, the query is fanned out by means of a number of software connectors, which cause the search to be re-executed in other places. The results are sent back to the original federated search engine's server, and then presented to the user, which might be relevance ranked.

Federated search is not, however, without its problems, chief of which is the cost of the software "connectors", although as publishers adopt common standards this will be less of an issue. Another problem is information overload, with too many results, and the need to invent a new technology for relevance ranking as that for Google does not work (Warnick, 2010).

What is potentially more serious, however, is recent research which indicates that people are bypassing "carefully-crafted discovery systems" (Ciber Report, 2009, quoted in Joint, 2010) to find simpler search solutions.

According to University College London's Centre for Information Behaviour and Evaluation of Research (CIBER), a mere four months after Science Direct content was opened up to Google, a third of the traffic to the former's physics journals came by that route.

It is difficult to avoid the conclusion from this research that people are wanting a one-stop-shop solution, and if Google can accommodate this, what more can federated search offer? Certainly the latter's future will depend on careful user behaviour research, and consequent development of features that give the user the experience he or she is looking for.

One resource which has become popular for its ability to simplify search is Summon, launched in 2009 by Serials Solutions. Summon attempts to replicate the simplicity of a Google web search while providing access to library and other high quality resources.

Summon's technology architecture is powered not by federated search, but by a massive single index that pre-harvests content from 94,000 journals and 6,800 content providers. It can deliver relevance-ranked, media neutral results in less than a second.

One university (Grand Valley State University, Michigan) found that students' use of library resources increased considerably after implementation of Summon. Another (the University of Michigan) analysed "personas" (personality profiles of typical users) and then surveyed the community, deciding that the Summon service best fitted their needs.

Figure 4. Screenshots showing Grand Valley State University's version of Summon (© Grand Valley State University)

Figure 4. Screenshots showing Grand Valley State University's version of Summon (© Grand Valley State University).

Figure 4. Screenshots showing Grand Valley State University's version of Summon (© Grand Valley State University)

Many academic and research search tools still use federated technology, however. For example, WorldWideScience.org is a huge global science gateway which can search across 400 million databases in 65 different countries. It was launched in 2007 by the British Library and the US Department of Energy's Office of Scientific and Technical Information (OSTI).

WorldWideScience.org gets over the "too much information" problem by providing a left-hand panel which clusters the results by topic, author, publication and date.

Figure 5. WorldWideScience.org's search interface.

Figure 5. WorldWideScience.org's search interface

Another of OSTI's products, the E-print Network, relies on a combination of index crawling and federated search. The two technologies run in parallel, with some filtering of sites that do not meet the required quality, which makes E-print Network a very high quality tool. It's an approach which Warnick (2010) suggests may be the future of quality search products.

One of the problems with searching databases is that most quality search tools are only found in academic libraries. One solution to the problem of the independent researcher is offered by DeepDyve, which provides access to database items which you can then rent for a short while, thus avoiding the higher purchase cost.

Not employing federated search, but in Web 2.0 fashion depending on the good will of users, is the Deep Web Wiki. Here volunteers contribute and describe useful sites which may not be popular enough to be included in search engines, as well as databases.

One of the remaining limitations in federated search is language: tools may be limited to searching databases with English titles and abstracts. However, in June 2010, Multilingual WorldWideScience was launched at the International Council for Scientific and Technical Information (ICSTI) annual conference in Helsinki.

The software uses real-time translation to offer multi-lingual search. A query can be typed in one language, and then translated into the language of the database; similarly, results can be translated back into the language of the searcher. Now real-time searching and translating is possible into English, Chinese, Japanese, Korean, French, German, Spanish, Portuguese and Russian.

Other search trends

There are a number of other trends in search engine technology, notably segmentation, personalization and custom search.

Search segmentation

Many search engines confine themselves to a particular type of communication, or media. We have already seen examples of search engines that specialize in people, blogs, "real-time", Twitter etc.; other search engines search PDFs and e-books, audio and music, and video and movies.

When reporting on Google images' revamp, Phil Bradley listed a number of other sites devoted to images (Bradley, 2010b),:

Nachfoto for real-time image search.
Coloralo for cartoons and colouring pages.
Panoramio for geobased images.
Seeklogo for logos.
Tag Galaxy which allows you to search images by tag, and which has a delightful interface showing a number of planets circling round a sun.
flickrCC for creative commons images at Flickr.

For a list of search engines for different media, see "100+ Alternative Search Engines You Should Know" which, as its name suggests, provides names of lots of different search engines, organized by category.

And Bing, points out library guru Mary Ellen Bates, has as one of its advanced search options the facility to search for a particular file format in a page. Use the syntax "contains: files_type" and you can find pages with the subject of your search in a particular format (Bates 2010).

Personalized and custom search

Personalized search, the ability of a search engine to respond to queries on the basis of user search behaviour, and their profile if they have one, has been going since around 2007 (Koch, 2007).

Custom search also offers the possibility, via human intervention at the user end, of creating a search engine with pre-selected resources. Librarians at Western Oregon Library have been using Google Custom Search Engine to create research guides around particular subjects, which they have found particularly useful, as many of their students are non-traditional returners to education who are easily baffled by the vast amounts of information on the Web (Monge and Forbes, 2009).

Summary

In this article, we have looked at social, real-time, semantic, multi-, federated, segmented and personalized search, all of which are trends in the current search scene.

One strong overarching trend is the perceived need for human intervention in search. Web crawlers on their own are not sufficient; there is a need for some sort of organization of knowledge whether in databases of research papers, or of facts which can be queried against an algorithm. It may be that the future lies in a combination of database type searches with more random crawling of the Web.

Another is the realization that no one search engine can accommodate all search requirements. While Google is unlikely to lose market share, serious searchers will still be selective and look to different search engines for different requirements.

Personally, I will continue to use 123people for people searches, Collecta for blog searches, Wolfram|Alpha for facts about a country, Deep Dyve or Deep Web Wiki for more serious academic searches, and Google or Dogpile for more general searches. Now as never before, it is important to be familiar with the different search engines and their capabilities.

References

Bates, M.E. (2010), Bates Info Tip "Bing gets smart", e-mail sent 7 July 2010.

Bradley, P (2009), "Have fun filling in the blanks", Internet Q&A, Library and Information Update, March 2009.

Bradley, P. (2010a), "Google Open House report", http://philbradley.typepad.com/phil_bradleys_weblog/2010/07/google-open…, accessed 11 July 2010.

Bradley, P. (2010b), "Google Images: just like Bing", http://philbradley.typepad.com/phil_bradleys_weblog/2010/07/google-imag…, accessed 11 July 2010.

Joint, N. (2010), "The one-stop shop search engine: a transformational library technology?", Library Review, Vol. 58 No. 4.

Koch, P. and Koch, S. (2007), "The search engine scene in 2015", Pandia Search Engine News, available at http://www.pandia.com/sew/353-search-2015.html, accessed 8 August 2010.

Kroski, E. (2009), "Search engine wars redux | Stacking the tech", Library Journal,http://www.libraryjournal.com/article/CA6669698.html?&rid=1105906703&so…, accessed 11 August 2010.

Monge, R. and Forbes, C. (2009), "Google custom search engine and library instruction", presentation to Internet Librarian International, 15-16 October 2009, London, UK, available at http://conferences.infotoday.com/documents/82/B202_Forbes.pdf, accessed 11 August 2010.

Pandia (2010), "A soft spot for the Ask search engine", Pandia Search Engine News, available at http://www.pandia.com/sew/3059-a-soft-spot-for-the-ask-search-engine.ht…, accessed 19 August 2010.

Peters, J. (2010), "50+ ways to search Twitter", Social Media Today, April 15, 2010, available at http://socialmediatoday.com/SMC/189327, accessed 2010-08-11 13:34:52

Segradi, H. (2009), "Assessing metasearch engine performance", Online Information Review, Vol. 33 No. 6.

Skipease (2009), "Google's Matt Cutts discusses search engine trends for 2010" available at http://www.skipease.com/blog/google-news-tips/google-search-engine-tren…, accessed 8 August 2010.

Warnick, W. (2010), "Federated search as a transformational technology enabling knowledge discovery: the role of WorldWideScience.org", Interlending & Document Supply, Vol. 38 No. 2.

Wolfram|Alpha (2010), "About Wolfram|Alpha", available at http://www.wolframalpha.com/about.html.