This article attempts to convey the joys and frustrations of skimming the Internet trying to find relevant
information concerning an academic’s work as a scientist, a student or an instructor. A brief overview of
the Internet and the “do’s and don’ts” for the neophyte as well for the more seasoned “navigator” are given.
Some guidelines of “what works and what does not” and “what is out there” are provided for the scientist
with specific emphasis for biologists, as well as for all others having an interest in science but with little
interest in spending countless hours “surfing the net”. An extensive but not exhaustive list of related
websites is provided.


In the past few years the Internet has expanded to every aspect of human endeavor, especially since the
appearance of user-friendly browsers such as Netscape, Microsoft Internet Explorer and others. Browsers
allow easy access from anywhere in the world to the World Wide Web (WWW), which is a collection of
electronic files that are the fastest growing segment of the Internet. Correspondingly, we are drowning in
a sea of information while starving for knowledge. Can we manage this wealth of information into digestible
knowledge? Yes! With help and perseverance. However, given the magnitude and rate at which the Internet
changes, this article cannot provide a comprehensive guide to available resources; rather, it serves primarily
as a starting-point in the individual quest for knowledge.


The Internet is a worldwide computer network started by the US government primarily to support education
and research. Many books and reviews exist that detail the Internet in almost every aspect. Among these,
“The World Wide Web–Beneath the Surf” by Handley and Crowcroft (1) gives basic information and
history. A succinct overview in a tutorial format has been set up by the University of California at Berkeley
Library (2). It provides a quick start to finding information through the Internet. Information about teaching
and learning through the “Web” can also be found in study modules set up by Widener University’s
Wofgram Memorial Library (3). For the science aficionado, concise information containing a primer to the Internet for the biotechnologist can be found in a recent review by Lee at al., 1998 (4).

For more in-depth knowledge, two books of interest to the biologist are Swindell et al., 1996 (5) and by Peruski and Peruski,
1997 (6). However, given the scope and the rate of growth of the Internet, estimated at 40 million servers
and predicted to reach over 100 million servers by the year 2000 (7), any review can become obsolete within
months of publication. (Table 1 illustrates growth estimates of the Internet).


What are URLs?

URL stands for Universal (or Uniform) Resource Locator and is analogous to the address protocol used
in sending and receiving regular mail. The first portion usually refers to the protocol type, for example:

• HTTP (hypertext transfer protocol) allows users to access the information in hypertext format, namely
clickable sites and multimedia (sound, graphics, video).
• FTP (file transfer protocol) permits transfer of files, whether these are text files, image files or
software programs.
• GOPHER is an obsolete text transfer protocol without multimedia access that preceded HTTP.
The next portion of the URL is a set of letters or numbers that indicate website address and files. For a more
detailed explanation see “Understanding and decoding URLs” by Kirk, 1997 (8).


Due to the size of the Internet, one needs to rely on various software, called search engines, to find
appropriate information. A common start-up site that can provide quick subject catalogs by topic area is
Yahoo (11). Many single or multiple database search engines perform broad searches on a topic by keyword. Links to these can be found through the Internet Public Library (IPL) (12). The most popular engines include: Lycos (13), Excite (14), Infoseek (15), Dogpile (16), and Metacrawler (17). A recent addition that allows for one-step searching of web-pages and full-text journals is Northern Light (18). This engine is recommended for scientists, but access to its full text articles requires payment. A comparison of various search engines’ performance with overall tips for Internet searching can be found at the Okanagan University College Library (19).

Other sites containing links to sites of scientific relevance include SciCentral (20), SciWeb (21), BioMed
Net (22) and Science Channel (23), among others. A comprehensive list cataloguing selected sites for
biomedical sciences can be found at Biosites (24) and at the IPL Biological Sciences Reference (25).
Timely topics in science are provided by Scientific American (26). Abstracts of scientific articles catalogued
by the National Library of Medicine can be searched for free using Medline (27), and those catalogued by
the National Agricultural Library, using Agricola (28). Some sites allow for free perusal of full text but few
such journals exist. A good site for development, cell science and experimental biology can be found at the
Company of Biologists (29). Some free online magazines that may be of interest include: In Scight (30)
produced by Academic Press in partnership with Science Magazine, ScienceNow (31) sponsored by the
American Association for the Advancement of Science, UniSci (32), HMSBeagle (33) from BioMedNet.
As well as Network Science (34).

Despite the abundance of websites, effective and efficient searching can be frustrating when a query results
in over 100,000 hits. Successful search strategies are typically through experience and discipline, although
following the guidelines indicated by (2, 3) and the comprehensive basic guide for general researching and
writing from the IPL (35), can be most helpful. Nonetheless, searching through the Internet has become
a common and convenient feature, necessitating one to approach each WWW site with caution. Some
guidelines are given below.


The Internet changes daily as resources are added, changed, moved or deleted. Millions of people, young
and old, as individuals or within organizations create resources ranging from basic information about
themselves, their interests or their products, to complex lists of funding resources, multimedia textbooks,
full-text journals, clinical information systems, epidemiological and statistical databases, and the like. One
of the most pressing needs is to evaluate these resources for accuracy and completeness. All information
should be received with skepticism, unless an evaluation of a site can be performed.

Relevant question in evaluating a site include the following: Is the site affiliated with a reputable institution
or organization such as a University, government or research institution? URL’s may reveal this information:
“edu” includes most educational institutions, “gov” indicates government affiliated sites, and “com” refers
to commercial enterprises, while “org” suffixes are used by many non-profit organizations. The two-letter
suffix on non-USA sites indicates the country of origin (8). Is there a tilde (~) in the site address? Usually
personal webpages are indicated with a tilde, and although not necessarily bad, one should be particularly
careful when evaluating such sites. Other questions to keep in mind: Is there a particular bias? Who is the
author? What are their credentials? How current is this site? Many sites have been abandoned and sit as
“junkyards” of old information. How stable is the site? Is the general style of the site reliable? Consider
grammar and spelling

Critical evaluation of websites

Many websites provide strategies for the critical evaluation of webpages. The University of Florida with
a list of short tips (36), Purdue University provides a step by step checklist (37), and Widener University
has page-specific checklists (38). Another list of evaluating resources posted by many librarians can be
found through the University of Washington Libraries site (39).

The following are some points to consider when visiting sites:
1. Content: is real information provided? Is content original or does it contain just links? Is the
information unique or is it a review? How accurate is it? What is the depth of content?
2. Authority: who or what is the source of the information? What are the qualifications?
3. Organization: how is the site organized? Can you move easily through the site? Is the information
presented logically? Is the coverage adequate? Can you explore the links easily? Is there a search
engine for the site?
4. Accessibility: can you access the server dependably? Does the site require registration? If so, is it
billed? Can it be accessed through a variety of connections and browsers? Is it friendly for text
viewers? How current is it? Is it updated regularly?
5. Ratings: is the site rated? By whom? Using what criteria? How objective is it? If the site is a rating
service itself, does it state its criteria?


Information from any source should be properly referenced whenever possible as intellectual property and
copyright laws usually apply. Electronically stored information presents new challenges since no method
exists to easily monitor this vast “global library”. However, scholarly activity should maintain a high
standard of conduct by following appropriate citation protocols.

Several citation formats exist for referencing webpages. Two common citing conventions are the MLA style
from the Modern Language Association of America (40), and the APA style from the American
Psychological Association (41). The latter acknowledges a guide by Li and Crane, 1996 (42) to its style
for citing electronic documents. Slight variations exist, depending on whether the citation is from individual
works, parts of works, electronic journal articles, magazine articles, or discussion list messages. Detailed
information for these can be found in Crane’s webpages (43), for APA style and for the MLA style (44).
A proposed Web extension to the APA style has recently been reported by Land (45). Consider however,
that there are many citation style guides for electronic sources. Some of these sites are listed at the
University of Alberta Libraries (46).

All references should generally contain the same information that would be provided from a printed source
(or as much of that information as possible). If author of the site is given, their last name and initials are placed first, followed by the date that the file was created or modified (full date in day/month/year format or year, month/date if feasible) and the title of the site in quotations. If affiliation to organization is known, this should be indicated. The date the resource was accessed is placed next (day/month/year or year, month/date), and finally the complete URL within angle brackets. Care should be taken not to give

authorship to webmasters who are responsible for posting or maintaining information on webpages and are
not the originators of the contents. However, they can be referenced as editors with the generic Ed.
abbreviation. Finally, in some instances, Internet resources are also published on hard copies, in those cases,
the appropriate citation format should be followed and the URL address should also be indicated.

Organization of Bibliography

Bibliographic format varies according to the preference of the publisher, institution, or journal. In general, include authors in alphabetical or in numerical order of appearance. Some prefer separate bibliographies for paper-based “hardcopy” references and for “softcopy” electronic sources. Others permit intermixing (as in the present article). If the author is unknown, site names are listed in appropriate order. Should some information be missing, it is acceptable to omit this information and still cite the reference. For example, some sites may not show authors or dates or have any indication of affiliation. However, the URLs should always be indicated.


The Internet holds vast and exciting possibilities for the scientific community and for society as a whole.
The power of the individual can be multiplied by the “click of a mouse” as new capabilities are provided
by linking various computing systems to the global village. Nevertheless, the Internet as seen through the
WWW can be addictive. One “click” effortlessly from one site to another in a seemingly endless and aimless
loop. Enjoy or despair, at your own risk!

Written By: E. Misser


Categorized in Online Research


Each day more and more government agencies, foundations, corporations and other grantmakers are adding information to the Internet. In addition, information on individuals' wealth is also readily available. Most of this information can be found by going directly to Web pages or to specially designed databases (that often charge a fee). Resources in both of these areas can help fundraisers search for appropriate funding or determine a prospect's net worth. Increasingly most research can be done from your computer. This article first focuses on researching grantors online and is followed by researching individual's wealth using the Web.


On the Internet there are several approaches to information on foundations. One is visiting the homepage created by the grantmaker. This contains descriptive material on the mission, perhaps a brief listing of the most recent grants, guidelines on eligibility and how to apply. Sometimes you can note you've been there by signing their guest book, requesting further information and asking to be put on their mailing list. Always you can see how the foundation presents itself visually - does it favor lots of bells, whistles, and glitz or is the website rather plain vanilla in its look? This offers you a lot of information on how to shape your approach.

The following three websites offer free lists or connections to foundations:

  • The Council on Foundations is the trade association of grantmakers. Its site www.cof.org connects you to a growing list of their over 1800 members' Websites.
  • Philanthropy News Network Online offers current news about philanthropic organizations, allows you, through its LINKS button, to see various lists of grantmakers (private foundations, corporations, and community foundations) and offers a Meta-index of Nonprofit Sites.
  • The Foundation Center's homepage (www.fdncenter.org) connects you through its GRANTMAKERS button to many brief foundation descriptions and links to their websites. You can also contact a professional librarian who is available to answer questions submitted via e-mail about foundations, nonprofit resources, corporate giving, and the best utilization of the Center's wide range of information services and resources. Along with this Online Librarian, the Electronic Reference Desk has two other components: responses to frequently asked questions (FAQs) and a directory of links to other nonprofit sites of interest. Another service provided by The Foundation Center is their Philanthropy News Digest, a free emailed overview of grantmaker news.

Some, like the following databases, charge fees for access which can get expensive quickly. The Dialog Corporation provides definitive online information on grantmakers. You access DIALOG (after subscribing) through the Internet or directly through your modem. For details, dial them up at 800-334-2564 or look at their website www.dialog.com. There are over 600 databases which this vendor provides, some of them of particular interest to those doing funding searches. A subscription is necessary before you can access the information.

While DIALOG is not easy for a novice to search, these valuable databases offer excellent, current, and in-depth information. If you want to start your foundation grants research by using the three DIALOG databases, here is a hint to save money, time and frustration: Find a nearby techie type of corporation that searches DIALOG for its own daily information needs. Any research and development organization in fields such as pharmaceuticals, computers, or biological science will probably have a librarian on staff who does this type of database searching. Request a gift-in-kind from the corporation of an hour or so of searching with the firm's most experienced DIALOG searcher. This individual will know how to speak to DIALOG efficiently. With your thorough knowledge of the project you want funded, you will be sitting right there to help refine a too huge or too tiny search. Easy? No. But searching online saves weeks of print research.

Also, The Foundation Center, a clearinghouse of grantmakers' information, has two huge, beautifully crafted, elegantly indexed databases -- one covering 900,000 grants and the other information on 53,000 foundations. The Foundation Center databases are also available for purchase on CD-ROM (discussed below). OryxPress, publisher of many directories, has one database that indexes thousands of grants offered by federal, state, and local governments, commercial organizations, associations, and private foundations.

OryxPress also offers this database on the Internet, calling it GrantSelect (www.grantselect.com) with 10,000 funding opportunities in both the United States and Canada. Oryx also offers an E-mail Alert Service to update users.

The Chronicle Guide to Grants (www.philanthropy.com/grants) is another online database that presents all corporate and foundation grants listed in The Chronicle of Philanthropy since 1995. Access is over the Internet, by subscription. This is the archive of the grants you can review on their Website every other week.

Specialized Lists

Amazingly enough, some of the large colleges and universities are willing to share the lists of foundations they have researched. One example is Foundations Relations Digest. This publication comes from a department within the Office of University Development and Alumni Relations at Columbia University. Their Digest gives lots of interesting details on private foundations.

Search Engines

Search engines have already done the hard work of searching the Internet and entered millions of sites into databases for you. Any search engine will help you find "grantmakers." Even though some curiosities will appear, since searching the Internet is always an adventure, there are often very fruitful finds. Try this large search engine to start: Ixquick Metasearch (www.ixquick.com) searches 14 of the major web search engines simultaneously and has a star rating system to help you find the sites that include the highest number of appearances of the term for which you're searching.


The Foundation Center's FC Search CD-ROM gives access to over 53,000 foundations, over 210,000 current grants and over 200,000 trustees/officers/directors. It is easy to use and, since this is a purchase, not a subscription, you won't feel the pressure of the time clock to research quickly.

In Taft's Prospector's Choice you'll find detailed funder profiles covering nearly 10,000 foundations and corporate giving programs providing information on up to 50 grants per profile, as well as total giving figures and helpful directions for making contacts and completing applications.


Finding corporation and business funding on the Internet is sometimes as easy as remembering their name or abbreviation; for example, the Internet address for IBM is www.ibm.com. Some corporations such as IBM include corporate giving information on their website. On the "about IBM" page, there is a "philanthropy" link with details of their giving. In addition, addresses, stock filings and holdings, and annual reports of public companies are often available.

For corporate giving information, the current best Internet source is The Foundation Center's "Corporate Grantmakers on the Internet" (http://fdncenter.org/grantmaker/gws_corp/corp.html). A site search engine offers quick access by subject and geographic terms. Brief listings contain the company's interests and giving areas but often no mention of the amounts of money that have been given. Each site links you with the corporation's own website, providing many more details on the company (but not necessarily on their corporate giving).

There are any number of good sites for general corporate and business information on the Internet:

The Insider Trading Monitor is a commercial online service that is available through DIALOG or direct from Thomson Financial WealthID. Thomson Financial Wealth Identification, formerly CDA/Investnet, is the leading provider of wealth identification data, insider trading information, and innovative prospecting solutions. Since 1983, Wealth ID has been helping America's largest nonprofit organizations, banks, insurance companies, and brokerage firms identify and track high net worth prospects, donors, and clients at www.wealthid.com. The Insider Trading Monitor compiles all SEC information on 10,000 public company insiders (covering over 200,000 executives, directors, and major shareholders). You can search for your prospect and see what his or her stock holdings are or if he or she has purchased or sold stock. This helps in estimating giving capacity and liquidity, and can be used to determine what type of gift, cash or planned, makes sense, particularly if capital gains are an issue. This information is only available on publicly traded companies, and only on stock that is held by a company insider. Private portfolio information is not included.

Hoovers (www.hoovers.com). Contains, in their COMPANIES & INDUSTRIES,12,500 profiles of corporations. Snapshots are free; in-depth profiles entail a nominal monthly fee.

Business Journals' Book of Lists (www.bizjournals.com). At the end of the year, each of the 41 business journals throughout the U.S. compiles the lists of the top 25-or-so businesses they have highlighted throughout the year and makes them available on paper and on CD. Here are your neighboring businesses with addresses, CEO/decision makers, amount of sales, and number of employees.

Securities and Exchange Commission (SEC) (www.sec.gov). Salaries and stock holdings of top executives, as well as their corporate board memberships, are listed on proxy statements filed with the SEC. The information is available online through the SEC's EDGAR database.


For free information, you can't beat what the U.S. Government gives away. Government funding research begins with The Catalog of Federal Domestic Assistance (CFDA) (www.cfda.gov). Every federal funding program is listed in full detail. Click on "Search The Catalog (FAPRS)" to get to the very straightforward searching screen. CFDA contains over 1500 financial and non-financial assistance programs worth about $300 billion a year. Among them are: grants, loans, use of property, facilities, equipment, technical assistance, direct payments, insurance, advisory services and counseling, and training. The CFDA search tool is user-friendly. Use a keyword, subject, or phrase to search. A list appears of the programs that have that word anywhere in the entry. Click the program that interests you, and the entry in its entirety is shown.

Each entry contains: restrictions, eligibility, application & award process (including deadlines), assistance considerations (formula & matching requirements, timing), post assistance requirements, financial information, program accomplishments, regulations (guidelines), contacts, related programs, examples of funded projects, and criteria for selection. The Catalog is also available free in print at most public libraries.

More Government Access Points

Direct access to most of the federal agencies and departments is also available online. These sites give you more of a flavor of the departments and agencies and often give you detailed information on their grantmaking. If you know that you want to see the National Endowment for the Humanities, the site is www.neh.fed.us. Look at the "Overview of Programs" or "Grants & Applications" to get their details. A further sampling:
National Institutes of Health (www.nih.gov) Check out their "Grants & Contracts."
National Science Foundation (www.nsf.gov) Look at their "Grants & Funding Opportunities."
Department of Veterans Affairs (www.va.gov) Use their SEARCH key to look for "grants."
Department of Education (www.ed.gov). Using their SEARCH key with the term "grants" turns up a large listing. The related site of the Office of the Chief Financial Officer of the DOE (gcs.ed.gov) offers an even more compact listing, "Grants & Contracts Information."
Department of Housing and Urban Development (www.hud.gov). Look at the "Funding Announcements".
Some states, counties and even cities are beginning to list similar kinds of information on their home pages. For example, in Maine (www.state.me.us), there are over 3,000 matches to the search term "grants." In California, www.ca.gov brings up about 8000 pages under a quick search for "grants." Wyoming brings up over 4,000 grant links. In San Francisco (www.ci.sf.ca.us), not only do you access The City site but many departments with funding resources, too. You might check out the San Francisco Commission on the Status of Women (www.ci.sf.ca.us./cosw), where among the items listed are several of their grants programs. Another San Francisco government site is Grants for the Arts (www.sfgfta.org). Call your city hall clerk and ask what local funding sources are available online. Your state/county/city home page is a rapidly expanding area so what you find next week may be much more extensive than what is available today.

Individual Donors

One of the best sources of basic information is your local newspaper. If the person is active in the community, there will usually be at least one article that gives you leads on his or her business or occupation, family connections, volunteer activities, social circles, professional organizations, etc. Check out your local newspaper to be sure. Many newspapers have their own websites with search engines so you can search for current or archived information (i.e. www.nytimes.com); some are free and some charge a fee for searching. Many newspaper archives still come on microfiche, which can be found at the library. Business journals often run feature articles on local business people and they often have their own Websites as well. Check out BizJournals (www.bizjournals.com) for information on the business journal in your area.

Some websites that allow you to do your own searching, are:

Internet Prospector People (www.internet-prospector.org/bio.html) A collection of people locators, capacity tools, and specialized directories, including among many others:

1. African Americans in Biography (www.internet-prospector.org/bio-afri.html) - access to the 100 wealthiest African-Americans (specifically not sports or entertainment people), lists of Black Greeks (fraternities and sororities), African-Americans in science, prominent African-Americans, and more.

2. Women in Biography (www.internet-prospector.org/bio-women.html) links to women in mathematics, engineering, physics, architecture, politics, air & space, National Hall of Fame, technology international, and other sites.

There are also commercial search services that allow you to search newspapers electronically from around the country for articles on donors/prospects. Three of the most popular commercial services are The Dialog Corporation (www.dialog.com), Lexis-Nexis (www.lexis-nexis.com/lncc), and (dowjones.wsj.com/p/main.html) Dow Jones News Retrieval .Another recent arrival on the scene is WealthKnowledge.com (www.WealthKnowledge.com) which specializes in identifying the wealthy and studying their behaviors and attitudes.

The Biography and Genealogy Master Index from the Gale Group (www.gale.com/servlet/BrowsePageServlet) has millions of entries on individuals from hundreds of biographical and business resources. If your prospect appears in any biographical or business reference book such as Who's Who or Standard & Poor's, the index will list the prospect and all the reference sources in which he or she appears. Birth dates and middle initials are included to help you verify that it is the correct person. This comes in book form, microfiche versions (know as BioBase), and now online through the Gale's online reference service.

The Complete Marquis Who's Who ONLINE combines 20 of its publications of professional and biographical data. This is accessible through the DIALOG Corporation's File #234, and includes: vital statistics (name, address, age, birthplace, marital status), and education, family background, religious and political history, creative works, civil and political activities, profession, and club memberships. There is a CD-ROM product The Complete Marquis Who's Who on CD-ROM (www.marquiswhoswho.com/product.html) available, which makes searching much easier.

Sites useful in nailing down addresses, telephone numbers, email addresses:

Four11 (www.Four11.com), "The Internet Whitepages," contains millions of listings allowing you to search for high school or college colleagues, former or current neighbors, co-workers, researchers in your field, members who enjoy the same chat groups or Usenet groups, and a dozen other definers.
Looking for celebrities? Try (www.celebrity-addresses.com). You can search, and be listed here, for a small fee, or try (www.celebfanmail.com) which had over 14,000 listings and offers nonprofits free access to their database.
555-1212 (www.555-1212.com) is a telephone directory for the US and Canada.
Yahoo! People Search (www.yahoo.com/search/people) allows you to find the elusive someone's telephone number or email address, and is linked to (www.USSearch.com), where for a fee you can find out details such as assets, home ownership, building ownership and value

Individual Business/Occupation Information

If you know your prospect is a practicing professional, there are many directories available to learn more about someone and his or her business. On the Web you can search Martindale-Hubbell Law Directory (www.martindale.com) or West Legal Directory (www.lawoffice.com) for attorneys. The American Medical Association has a physician select search for members of the AMA (www.ama-assn.org/aps/amahg.htm). Hard copy books such as the Official ABMS Directory of Board Certified Medical Specialists and Standard & Poor's Register of Directors and Executives are also good resources to confirm businesses and titles.

The Insider Trading Monitor compiles all SEC information on 10,000 public company insiders (covering over 200,000 executives, directors, and major shareholders). You can search for your prospect and see what his or her stock holdings are or if he or she has purchased or sold stock. This helps in estimating giving capacity and liquidity, and can be used to determine what type of gift, cash or planned, makes sense, particularly if capital gains are an issue. This information is only available on publicly traded companies, and only on stock that is held by a company insider. Private portfolio information is not included.

If your prospect works for a public company traded on a stock exchange, try the Edgar website (www.sec.gov/cgi-bin/srch-edgar). At this site, you search by company name or ticker symbol and have direct access to the proxy filed by the company in which you are interested. The proxy lists all the board members and top executives plus their stockholdings and salaries. Proxies often include brief bios telling you more about your prospect. Along with Insider Trading Monitor, the Edgar Online People website (edgar-online.com/people) lets you search (for a fee) SEC filings by a person's name to determine all the companies on whose boards he or she sits. You can always call the company and ask for a proxy statement and the annual report.

Many businesses, small and large, have their own web pages that contain profiles or bios of principal owners and managers. Commercial services such as The Dialog Corporation and Lexis-Nexis also have databases of company information and industry analysis. Hoovers (www.hoovers.com) contains over 12,500 profiles of corporations: snapshots are free; in-depth profiles entail a nominal monthly fee.

Depending on the size of the business, there are several other places to search. Use a local business directory or a local Book of Lists (www.bizjournals.com), or call the chamber of commerce to see if they have any information on your prospect's business. Standard business print references include Dun's Million Dollar Directory, American Business Disc, Duns Market Identifiers, Standard & Poor's, and Disclosure, both online and on CD-ROM. All provide information on a company's size, assets, sales and top officers. You can usually find an individual if he or she is one of the top five to ten officers of a public company. As a general rule, it tends to be more difficult to find information on a private company and the owner.

Other Wealth Indicators

The American Almanac of Jobs and Salaries is a good general reference for what people earn in a wide range of professions. The Insider Trading Monitor is a commercial online service that is available through DIALOG or direct from Thomson Financial WealthID. Thomson Financial Wealth Identification, formerly CDA/Investnet, is the leading provider of wealth identification data, insider trading information, and innovative prospecting solutions. Since 1983, Wealth ID has been helping America's largest nonprofit organizations, banks, insurance companies, and brokerage firms identify and track high net worth prospects, donors, and clients at (www.wealthid.com).

Local country club membership lists, other nonprofit organizations' annual reports and membership directories of civic and volunteer groups.

Copyright 2007 Zimmerman Lehman.


Categorized in Research Methods

Jurors who do their own research on cases before them should face criminal prosecution, the Law Commission has said.

The call comes after a number of jurors have been jailed for using the internet to supplement the evidence presented to them in court. It has been welcomed by the Attorney General, who said the Government will discuss the idea more formally.

Currently judges at criminal trials direct jurors as to what they can and can't do. This would include not using the internet or any other source to find out more about the case. Jurors face action for contempt of court if they disobey those instructions.

But the commission, an independent statutory body which keeps the law under review, says this leads to inconsistency and can generate confusion.

A Law Commission spokesman said: "The Commission's recommendations will also help to clarify for jurors what they can expect if they do search for information on the trial.

"Jurors accused of this form of contempt are currently tried in an unusual procedure in the Divisional Court.

"Under the Commission's recommendations, jurors who search for information in this way would be committing a criminal offence and be tried in the Crown Court in the usual way."

The Law Commission also recommends extending the defences available to jurors who disclose their concerns after a trial that there had been a miscarriage of justice.

At present, jurors can alert the court if they are worried about the way in which a jury's deliberations are taking place.

But, the Commission's spokesman said, there would be some cases in which jurors needed to report concerns after a case was over.

The Commission is recommending that if proceedings had concluded, a juror who feared that there had been a miscarriage of justice should be able to take his or her concerns not only to the court but also to the Criminal Cases Review Commission and the police.

Attorney General Dominic Grieve QC said: "Juror contempt is a serious risk to justice but people are often not aware of the consequences.

"The Law Commission's proposal to make it an offence for jurors to search for information about their case on the internet or by other means would make the position absolutely clear and would, I hope, reduce the need for future prosecutions.

"I will now need to discuss the recommendations carefully with my Government colleagues before we respond formally."

The Law Commission recommendations come after a series of cases in which jurors have faced proceedings for contempt after researching cases on the internet.

In July, Joseph Beard, 29, was jailed for two months after being found to have committed contempt while sitting as a juror at a trial at Kingston Crown Court by doing internet research about the case.

In January last year, university lecturer Dr Theodora Dallas was jailed for six months - she was ordered to serve half of the term, with the rest spent on licence - after telling fellow jurors that she had discovering on the internet that the defendant in the trial in which they were sitting had once been accused of rape.

Source: http://www.independent.co.uk

Categorized in Online Research

Today, the internet use is more than ever before. In the last decade, there has been a considerable increase in the number of internet users who belong to different age groups, genders, cultures, and interests. Consequently, they use the internet for vastly different purposes. Easy accessibility to the internet has increased a regular individual’s reliance on the internet, for all matters, large and small.

However, to optimize the web usage some sort of specialization is necessary. This may be done through the use of specialized search engines, databases or browsers. Typically most people use regular search engines and standard browsers which works well for several different types of tasks. However, to improve your searching experience and increase efficiency using these specialized search tools is crucial.

The browser market is typically dominated by Chrome, Internet Explorer, Firefox, and Opera. Safari comes into the picture too when we look at Apple products.  Yet, other browsers also exist in the market. They are different not only because of their names, but because they offer something more than what is offered by the traditional web browsers. Strata by Kirix, for example, turns a web table or a page full of numbers into a dynamic spreadsheet. If you work regularly with numbers this browser is definitely faster and superior to any other. Or if you are passionate about music, and would like to listen to your own collection while you surf the internet, there is no need to switch between applications anymore.  Songbird allows you to do both simultaneously on a single browser. Songbird is an incredible amalgamation of Firefox and iTunes.

For research tasks, numerous specialized search options are available as well. The basic choice is between a search engine and a database. However, within them too numerous specialized options exist, depending on the field you are researching. For music enthusiasts, you have the options between MixTurtle.com and SongBoxx.com. For images, PicSearch.com can search for any photo available to the public online, and if you prefer to browse through or search a trusted database for images, vectors and videos shutterstock.com can be a good choice.

As for scholarly research, since the emphasis is more on value and authenticity, using a database is a good option. In this sense, one can make use of the specialized subject databases that are available for various disciplines. For example, for research on philosophy, medicine, sports, or economics, searching their specific databases would not only be faster, but easier as well.  However, some researchers might prefer using online content over database resources. For them a specialized search engine would be the best alternative. There are search engines for disciplines like history, math, and science, although you have to be careful when using these material. In academic research since reliability of information is important, search engines like iSeek.com will come into the rescue, which display only reviewed content, thus saving you from the trouble of verifying a resource’s authenticity.

Considering the above, it is undoubtedly useful to employ these specialized tools such as specialized browsers, databases or search engines when you have the requirement to search for specific information on a subject.





Categorized in Online Research

With blogging comes great responsibilty. You define the content of your weblog and you carry the full responsibility for every word you’ve published online. More than that, you are responsible for comments in your posts. To make sure you fulfill your legal obligations, it’s important to know, what you, as blogger, may or should do; and you have to know, how to achieve this. After all, the ignorance of the law does not make one exempt from compliance thereof.

From the legal point of view, Copyright in Web is often considered as the grey area; as such it’s often misunderstood and violated – mostly simply because bloggers don’t know, what laws they have to abide and what issues they have to consider. In fact, copyright myths are common, as well as numerous copyright debates in the Web.

That’s time to put facts straight. In this post we’ve collected the most important facts, articles and resources related to copyright issues, law and blogging. We’ve also put together most useful tools and references you can use dealing with plagiarism.

You don’t have to read the whole article, you can read a brief overview of the key-points in the beginning of the post. Let’s take a look.

Copyright in the Web: An Overview

  1. Copyright applies to the Web.
  2. Your work is protected under copyright as soon as it’s created and protected for your lifetime, plus 70 years.
  3. Copyright expires. When copyright expires, the work becomes public domain.
  4. Ideas can’t be copyrighted, only the result tangible expression of the idea can. (updated)
  5. You may use logos and trademarks in your works.
  6. You may use copyrighted material under the “fair use” doctrine.
  7. You may quote only limited portions of work. You may publish excerpts, not whole articles.
  8. You have to ask author’s permission to translate his/her article.
  9. The removal of the copyrighted material doesn’t remove the copyright infringement.
  10. If something looks copyrighted, you should assume it is. (updated)
  11. Advertising protected material without an agreement is illegal.
  12. You may not always delete or modify your visitors’ comments.
  13. User generated content is the property of the users.
  14. Copyright is violated by using information, not by charging for it.
  15. Getting explicit permission can save you a lot of trouble.

What is Copyright?

  • Copyright is a set of exclusive rights regulating the use of a particular expression of an idea or information. At its most general, it is literally “the right to copy” an original creation. In most cases, these rights are of limited duration. The symbol for copyright is ©, and in some jurisdictions may alternatively be written as either (c) or (C).” [Wikipedia: Copyright]
  • Copyrightable works include literary works such as articles, stories, journals, or computer programs, pictures and graphics as well as recordings.
  • “Copyright has two main purposes, namely the protection of the author’s right to obtain commercial benefit from valuable work, and more recently the protection of the author’s general right to control how a work is used.” [10 Big Myths about copyright explained]
  • “Copyright may subsist in creative and artistic works (e.g. books, movies, music, paintings, photographs, and software) and give a copyright holder the exclusive right to control reproduction or adaptation of such works for a certain period of time (historically a period of between 10 and 30 years depending on jurisdiction, more recently the life of the author plus several decades).” [Wikipedia: Intellectual Property]

Copyright in the Web

  • Copyright applies to the Web. Copyright laws apply to all materials in the Web. All web documents, images, source code etc. are copyrighted.
  • Copyright protects the rights of owners. “Owners have exclusive rights to make copies, create derivative works, distribute, display and perform works publicly. Certain artists have rights of integrity and attribution (moral rights) in original works of art or limited edition prints (200 or fewer).” [Copyright in Cyberspace]
  • Everything created privately after April 1, 1989 is copyrighted “automatically”and protected for your lifetime, plus 70 years. “In U.S. almost everything created privately and originally after April 1, 1989 is copyrighted and protected “automatically”. Explicit copyright is not necessary. The default you should assume for other people’s works is that they are copyrighted and may not be copied unless you know otherwise. There are some old works that lost protection without notice, but frankly you should not risk it unless you know for sure.” [10 Big Myths about copyright explainedCopyright Office Basics]
  • Your work is protected under copyright as soon as it’s created. No record or registration with the U.S. Copyright office is required for this protection. [12 Important U.S. Laws]
  • You don’t have to register the copyright, but you probably should. “The reason for this, under the US Copyright Act, is that registration of the copyright within ninety (90) days of publication (or before infringement takes place) is necessary to enable the copyright owner to receive what are referred to as “statutory damages.” [Copyright: Know The Facts]
  • Copyright expires. According to the Berne Convention, the copyright perod lasts at least the life of the author plus 50 years after his/her death. For photography, the Berne Convention sets a minimum term of 25 years from the year the photograph was created, and for cinematography the minimum is 50 years after first showing, or 50 years after creation if it hasn’t been shown within 50 years after the creation. This applies to any country that has signed the Berne Convention, and these are just the minimum periods of protection. [What is Copyright?Wikipedia]
  • When copyright expires, the work becomes public domain. “Basically, any writing that is no longer protected by copyright is in the public domain.” [Copyright Essentials]
  • Copyright hasn’t expired if the copyright date isn’t correct. “If a copyright statement reads, “© Copyright 1998, 1999 John Smith.”, it does not mean that John Smith’s copyright expired in 2000. The dates in the copyright statement refer to the dates the material was created and/or modified, but not to the dates the owner’s material will expire and become public domain.” [What is Copyright?]
  • Ideas can’t be copyrighted. “You must first write the story, because it is your own, original expression of that idea that is protected under law. If you have a brilliant idea for a story, you’d best keep it to yourself until you do.” [Copyright Essentials For Writers]
  • “The correct form for a notice is
    “Copyright [dates] by [author/owner]”. You can use C in a circle © instead of “Copyright” but “(C)” has never been given legal force. The phrase “All Rights Reserved” is not required. [10 Big Myths about copyright explained]

You May…

  • You may use copyrighted material. “Fair use is a doctrine in United States copyright law that allows limited use of copyrighted material without requiring permission from the rights holders, such as use for scholarship or review. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author’s work under a four-factor balancing test. It is based on free speech rights provided by the First Amendment to the US Constitution. [Wikipedia: Fair Use]
  • You may use materials that are not subject to copyright. “Apart from facts and ideas there are many other classes of materials that can not be protected under the Copyright Law. Those materials include names, familiar symbols, listings of ingredients or contents, short phrases, titles, slogans and procedures (notice that some of those materials might be protected by trademark, though).” [Copyright Law: 10 Do’s and Don’ts]
  • You may use logos and trademarks in your works. Commenting on some facts or reporting about a company, you can use its logo under a “nominative fair use”. [Copyright Law: 12 Do’s and Don’ts]
  • You may publish excerpts, not whole articles. “If you want to share someone else’s content with your own audience, just quote a brief excerpt, and provide proper attribution with a link to the source, but don’t republish the entire article without permission. It will save you a lot of trouble down the road. This is a fairly standard practice on popular blogs.” [Copyright and Intellectual Property]
  • You may comment upon and report about copyrighted material. “The “fair use” exemption to (U.S.) copyright law allows commentary, parody, news reporting, research and education about copyrighted works without the permission of the author.” [10 Big Myths about copyright explained]
  • You may not always quote copyrighted content. Depending on the copyrighted statement, the owners of the material may forbid the copying and distribution of articles or its parts. [What is Copyright?]
  • You may quote only limited portions of work. “Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports.” [Copyright Explained]
  • You may use the de minimis principle. “Copyright isn’t concerned with very little things. It does not protect so-called de minimis works, the classic examples of which are titles (such as The Da Vinci Code) and newspaper headlines (such as Small earthquake in Chile, not many killed); nor does copyright prevent “insubstantial copying” from a work which is protected by copyright. Unfortunately it is often difficult to decide whether a work is really de minimis, or an example of copying insubstantial.” [10 Things About Copyright]

You Should…
Bloggers’ Rights and Duties

  • Ignorance of the Law does not make one exempt from compliance thereof.You carry the full responsibility for everything you publish in you weblog. Using protected work, make sure you fulfill your legal obligations.
  • Be aware of your responsibility. Check your facts, consider the implications, control the comments, give credit where credit is due, disclose professional relationships, disclose sponsored posts, avoid “blackhat” methods. [10 Rules for Responsible Blogging]
  • Make it easy to distinguish paid and editorial content. “Never claim that you are an objective, unbiased source if you are being paid to provide information. Always make it easy for your readers to distinguish between advertising and editorial content.” [12 Important U.S. Laws Every Blogger Needs To Know]
  • You should ask author’s permission to translate his/her article. According to the Berne Convention, “Authors of literary and artistic works protected by this Convention shall enjoy the exclusive right of making and of authorizing the translation of their works throughout the term of protection of their rights in the original works.” Therefore you need a permission to translate an article into another language. [What is Copyright?]
  • You should not present stolen content. “The law does not provide protection for federal crimes or intellectual property violations, meaning that you can potentially be found contributorily liable if this type of behavior takes place on your site.” [12 Important U.S. Laws]
  • You should use copyrighted material only if you have explicit permission from the author to do so (or if you make fair use of it). Copyright infrigement is possible even if the credit to the author is given. [Copyright Law: 12 Do’s and Dont’s]

Things To Be Aware Of

  • Freeware doesn’t belong to you. “Graphic images and fonts provided for “free” are not public domain. The ownership of this material remains within the creator of the material. You may use them if you comply with the owner’s terms and conditions.” [What is Copyright?]
  • Getting explicit permission can save you a lot of trouble. If you are sued for copyright violation, you must admit to the infringement, and then hope that the judge or jury agrees with your arguments. It’s faster and safer to just ask permission. [Copyright on the Web]
  • Copyright is violated by using information, not by charging for it. “Whether you charge can affect the damages awarded in court, but that’s main difference under the law. It’s still a violation if you give it away — and there can still be serious damages if you hurt the commercial value of the property.” [10 Big Myths about copyright explained]
  • User generated content is the property of the users. “The fact that you do not own the user-driven content on your site can create a number of headaches for bloggers, such as an obligation to remove a comment whenever the author requests.” [12 Important U.S. Laws]
  • If you use protected material, the new work doesn’t belong to you. “Work derived from copyrighted works is a copyright violation. Making of what are called “derivative works” — works based or derived from another copyrighted work — is the exclusive province of the owner of the original work.” [10 Big Myths about copyright explained]
  • The removal of the copyrighted material doesn’t remove the copyright infringement. Once the copyright is violated, the case is created – it doesn’t matter, whether the protected material is currently on the Web or not.” [Copyright Law: 12 Do’s and Dont’s]
  • If something looks copyrighted, you should assume it is. “It is true that a notice strengthens the protection, by warning people, and by allowing one to get more and different damages, but it is not necessary. ” [10 Big Myths about copyright explained]
  • Advertising protected material without an agreement is illegal. “It’s up to the owner to decide if they want the free ads or not. If they want them, they will be sure to contact you. Don’t rationalize whether it hurts the owner or not, ask them.” [10 Big Myths about copyright explained]

Grey Area

  • Nobody really knows if linking is always allowed. “When linking to illegal or infringing copyrighted content the law of linking liability is currently considered a grey area. But if you have an ordinary web site, and linking is not going to bypass some security, or payment system such as advertising, and there’s no information anywhere about the site not wanting you to link in and no reason to believe they don’t want it, linking should be very safe.” [Links And LawLinking RightsWiki: Link]
  • It is reasonable to provide terms of service for comments. “Posters should be informed that they are responsible for their own postings. The newsroom should consider advising readers that the newsroom does not control or monitor what third parties post, and that readers occasionally may find comments on the site to be offensive or possibly inaccurate. Readers should be informed that responsibility for the posting lies with the poster himself/herself and not with the newsroom or its affiliated sites.” [Dialogue or Diatribe?]
  • You may not always delete or modify your visitors’ comments. “You should never treat comments as though you own them by manipulating them or deleting them without having included a terms of service which gives you permission to do so. Consider that if you are allowing anonymous posts you will have no way of verifying the true owner of a comment when someone emails you asking for you to take a comment down. Consequently, you should make sure to at least collect basic identifying information before allowing someone to comment or post on your site.” [12 Important U.S. Laws]

How to React to Plagiarism?

Tools and Services

  • Free Legal Forms for Graphic Design
  • Copyright Flow Chart
    Flowchart for determining when U.S. Copyrights in fixed works expire.

Open Content Licenses

Related Articles

  • Call for a Blogger’s Code of Conduct
    “In a discussion the other night at O’Reilly’s ETech conference, we came up with a few ideas about what such a code of conduct might entail. These thoughts are just a work in progress, and hopefully a spur for further discussion.”
  • Copyright: A Seemingly Shifting Target, Q&A
    Frequently Asked Questions Guide to copyright issues as they apply in the world of digital video production, post production and distribution. By Douglas Spotted Eagle.
  • What is Copyright Protection?
    This page covers the basic definitions regarding copyrights. It has been written using the Berne Union for the Protection of Literary and Artistic Property (Berne Convention) as the main bibliographical source.
  • 10 Things You Need to Know Before You Blog
    A list of important aspects to keep in mind if you are a blogger plus dozens of references to related articles.
  • 10 Big Myths about copyright explained
    An attempt to answer common myths about copyright seen on the net and cover issues related to copyright and USENET/Internet publication.
  • 12 Important U.S. Laws Every Blogger Needs to Know
    This article highlights twelve of the most important US laws when it comes to blogging and provide some simple and straightforward tips for safely navigating them.
  • Copyright and Intellectual Property
    While some people take issue with the concept of intellectual property and believe that all content should be free, I don’t count myself among them. In fact, for the most part I consider the anti-copyright fanatics rather juvenile and intellectually immature. By Steve Pavlina.
  • Poynter Online – Copyright Issues and Answers
    An extensive list of resources about copyright. Subject headings include: Definition of Copyright, U.S. Copyright Office, Copyright Websites, Fair Use and more.
  • Podcasting Legal Guide
    The purpose of this guide is to provide you with a general roadmap of some of the legal issues specific to podcasting.
  • Copyright Law: 12 Do’s and Dont’s
    12 Do’s and Dont’s that will clarify what you can and what you can not do as an online publisher.
  • 10 Things Webmasters Should Know About … Copyright
    This short article explains the key points of copyright law – those which should be familiar to every webmaster.
  • Copyright Essentials for Writers – Res Ipsa Loquitor
    This article deals primarily with copyright law; however, in a broader sense it deals with ethical issues every writer should carefully consider. By Holly Jahangiri
  • Copyright: get to know the facts
    If something’s valuable to you then you need to protect it. Copyright exists for that purpose, but don’t assume you have it, says Stephen Nipper.
  • A brief intro to copyright
    This document is here because many people read my original article on copyright myths without knowing very much about what copyright is to begin with. This article is not about to teach you all about copyright, though there are some decent sites out there with lots of details.
  • The Developer’s Guide To Copyright Law
    In this post the basics of copyright law are discussed. This sets the stage for further parts in which software licensing is discussed.

Related Resources

  • EFF: Bloggers
    EFF’s (Electronic Frontier Foundation) goal is to give you a basic roadmap to the legal issues you may confront as a blogger, to let you know you have rights, and to encourage you to blog freely with the knowledge that your legitimate speech is protected.
  • A Fair(y) Use Tale
    Stanford Center for Internet and Society published a documentary film, in which copyright issues are explained.
  • Copyright Committee Website
    Articles related to using copyrighted materials, fair use, getting permission, plagiarism, copyright law, what copyright protects, Public Domain and Copyright Duration.
  • Good Copy Bad Copy
    A documentary about the current state of copyright and culture.
  • Crash Course in Copyright
    An overview of Copyright basics.
  • U.S. Copyright Office: Web-site with extensive information about copyright, including copyright basics, current legislation and the copyright law itself.
  • Stanford Copyright & Fair Use
    Copyright and Fair Use in an extensive overview.
  • Using Copyright Notices
    Information fact sheet – exploring website copyright, design ideas, registration advice and specific considerations that apply to website designers.
  • EFF: Legal Guide for Bloggers
    This guide compiles a number of FAQs designed to help you understand your rights and, if necessary, defend your freedom.
  • QuestionCopyright.org
    Promoting public understanding of the history and effects of copyright, and encouraging the development of alternatives to information monopolies. 


Categorized in Internet Privacy

is 38, single, and based in Seattle, Washington. He is going through a wilderness phase in his life, which has made him a drug vendor on hidden online marketplaces. "Right now I'm doing this as I try to steer my life out of its current stasis," says Atheist666 (a Net pseudonym), who studied economics, comparative religion and literature at university. "Maybe someday I'll figure out what I want to do when I grow up." Till then he intends to be a part of a growing community of underground based out of the Deep Web.

Imagine a space where everything is available to you. No authority can dictate what you can or cannot purchase or what information can be shared. A place where there is unlimited freedom. Technology allows such a place to exist. It's called the or Darknet or Hidden Web. Running beneath the World Wide Web of Facebook, and YouTube, the Deep Web is like a vast, dark ocean. The web, in comparison, is like a pond. Every time you sink into the Deep Web, this world fades, and is replaced by one far more terrible and strange. One can deal in drugs, weapons, contract a killer, hire a hacker or meet jihadists here, all completely untraced.

These entrepreneurs claim to be heirs of Austrian economists such as Ludwig von Mises, in the sense that they hold libertarian economic beliefs and deep scepticism of government intervention, especially in the monetary system. For example, most of Deep Web's dealings are in the virtual currency, bitcoin. Unlike conventional currencies, bitcoin's integrity is maintained by the computing power of thousands of users, and not by any bank or government.

But the thrust of the Austrian school of economics is the market forces of creative destruction. Although the Deep Web has seen destruction, it is not always due to market forces. Illegal drugs, goods and services are offered by a marketplace platform, which, unlike an inventory-based forum, brings vendors together rather than stocks products. Silk Road, the largest and possibly first of its kind, was one such. The Federal Bureau of Investigation shut the website, seized its assets, including 26,000 bitcoins, and arrested the alleged owner, 29-year-old Ross Ulbricht, in San Francisco on October 1. Silk Road 2.0 rose in its place, only to announce in February that all bitcoins belonging to users and staff had been stolen in a hack. Another called Project Black Flag closed after its owner fled with the customers' bitcoins. Users of Sheep Marketplace, too, had their funds stolen, in an incident that has not been proven to be an inside job or otherwise. Atlantis Market, a competitor to Silk Road, shut for "security reasons". People say the owners fled with the deposits. In the light of these shake-ups, many are now gravitating towards a new order - fresh marketplaces that they feel are more predictable: websites such as Andromeda, Agora, Outlaw and Hydra.

"The only law here is our law," says (also a pseudonym), owner of Andromeda. He speaks to me using US National Security Agency-proof Bitmessage. Surfing along the site's supposedly safe corridors gives you a strange out-of-government-reach sensation. The products have photos, descriptions and vendor names. "The only restriction is on the sale of child pornography, all else is allowed." The sellers are located all over the world, a large portion of them in the US, Canada and Europe.

But even has its limits. There are no plutonium-grade weapons up for sale. Contrary to the belief that the Deep Web markets have no restrictions and are dark, these forums seem capable of taking moral decisions. HeadOfHydra, the chief of marketplace Hydra, says, "Everything except child porn and assassinations or any other service that constitutes doing harm to another is allowed on the site." This effectively means a vendor cannot sell just anything.

Despite restrictions, sellers seem a happy lot. Atheist666 is registered as a vendor on Andromeda. He says, "If you think of it as a public safety issue, the benefits are clear: no turf battles, bloodshed, meeting people in dark alleys et cetera." Atheist666 seems a seller of some repute. Andromeda has a reputation-based trading system similar to Amazon's. Atheist666 is especially trusted. Andromeda and Hydra attentively address user concerns. HeadOfHydra talks of dispute resolution, "If you do not receive the package, you can contact the support team. We will resend/refund based on your vendor's policy and your stats, like total sum spent on the market and refund rate."

What makes all this possible is anonymity technology called Tor. Moritz Bartl, founder,Torservers.net, wrote in from a Tor developer meeting in Paris: "Tor is free software developed with the help of researchers from universities. By redirecting your traffic through a network of computers that share their Internet connection with all Tor users, it hides who you are communicating with, and when. Since you are not connecting directly with the destination, it is also useful to circumvent local network restrictions and reach destinations that would otherwise be blocked. It also offers ways to transform outgoing data into something that looks innocent to get around more sophisticated filtering."

According to Bartl, Tor consists of two parts. One is the software (Tor browser) that you can download for free. The other is the Tor network (part of the Deep Web) that can be accessed only through the Tor browser. All sites on the network have the .onion suffix. Since normal search engines cannot operate in this space, Grams, a special engine, lets you find sites selling drugs, guns, stolen credit card numbers, counterfeit cash and fake IDs - sites that previously could only be found by users who knew the exact address.

Bartl goes on to explain his outfit's role. "Torservers.net is a network of non-profit organisations formed by experts who run Tor relays. This means that people who want to help make the Tor network faster and better, but don't have the skills or interest in doing it, can donate, and their donations will be turned into Tor bandwidth available for all users." Torservers.net, he adds, operates a large number of Tor bridges - entry points into the Tor network that are hard to enumerate and are only handed out in small numbers. For all that his organisation does, Bartl knows every action is closely monitored. "The slides on Tor provided by fugitive intelligence contractor Edward Snowden mention an analysis of Torservers.net."

Bartl is not alone in this. Giant targets have been painted on many backs. After the arrest of Silk Road's alleged chief, Ulbricht, known on the site as Dread Pirate Roberts, or DPR, last December saw two alleged site moderators being taken into custody. The successor to DPR, known as DPR 2.0, ran Silk Road 2.0 but imagined FBI coming for him. Paranoid, he reportedly smashed his computer and went on the run. He resurfaced but resigned after a short while. Defcon has been heading Silk Road 2.0 since then. In February, under him, the site saw all of its bitcoins stolen in a hack. "I am sweating as I write this," Defcon wrote on the site's forum. "I must utter words all too familiar to this scarred community: we have been hacked."

Many site owners would have given up at this point, and attempted to join another site, or start a new one. Why bother to pay back millions of dollars when you could just disappear? But Silk Road 2.0 appears to rebuild and repay users' bitcoins. This is a significant development for the Deep Web since till now it had been viewed as a place for those without morals. Defcon wrote on May 27 that 82.09 per cent of all victims of the hack had been fully repaid. "Very nice surprise when I logged in!" a user called uglypapersbox wrote on the site's forum. "Despite having to wait one week short of two months, I got paid back in full. Bitcoins are in my account."

Source : 


Categorized in Deep Web

Should you use the Internet for quantitative survey research?

To paraphrase a long-distance carrier’s commercials: if you haven’t done Internet survey research – you will. Because there are some very powerful reasons why you should consider using the Internet for quantitative survey research. 

First, there is the speed with which a questionnaire can be created, distributed to respondents, and the data returned. Since printing, mailing, and data keying delays are eliminated, you can have data in hand within hours of writing a questionnaire. Data are obtained in electronic form, so statistical analysis programs can be programmed to process standard questionnaires and return statistical summaries and charts automatically. 

A second reason to consider Internet surveys is cost. Printing, mailing, keying, and interviewer costs are eliminated, and the incremental costs of each respondent are typically low, so studies with large numbers of respondents can be done at substantial savings compared to mail or telephone surveys. Of course, there are some offsetting costs of preparing and distributing an Internet questionnaire. These costs range widely, according to the type of Internet interviewing used. Figure 1 shows some typical comparative costs of mail, telephone, and Internet (Web) survey research. The cost curves are based on a 5-page questionnaire, with a 35% return rate for mail and a 7-minute duration for telephone interviewing. As the figure shows, the Internet survey is always cheaper by a substantial margin than a telephone survey, is only slightly more expensive than a mail survey for surveys with fewer than about 500 respondents, and becomes increasingly less expensive than mail for more than 500 respondents.

Figure 1

An often overlooked benefit of Internet survey research is the ease with which an Internet survey can be quickly modified. For example, early data returns may suggest additional questions that should be asked. Changing or adding questions on-the-fly would be nearly impossible with a mail questionnaire and difficult with a telephone questionnaire, but can be achieved in a matter of minutes with some Internet survey systems. 

Internet questionnaires delivered with the World Wide Web (WWW) have some unique advantages. They can be made visually pleasing with attractive fonts and graphics. The graphical and hypertext features of the WWW can be used to present products for reaction, or to explain service offerings. For respondents with current versions of Netscape or Internet Explorer, the two most popular web browsers, audio and video can be added to the questionnaire. This multimedia ability of Web-delivered questionnaires is unique. 

Appropriate Populations for Internet Survey Research 

Not all populations are candidates for Internet survey research. The general consumer population is often a poor fit, because fewer than 10% of the U.S. households regularly use Internet services (although more are connected, many are infrequent users). There is also a potential problem in the general population with reluctance to use computers, as well as some fear of the intentions of those who use the Internet to ask questions. This fear has been fanned by sensational media accounts of "cyberstalkers" and con artists who prey on Internet users.

However, there are some exceptions to this broad statement. For example, computer products purchasers and users of Internet services are both ideal populations. Both populations are likely to have very high connectivity (100% in the case of Internet services), and neither are likely to have high levels of cyberphobia. Consumers who have purchased products or services using the Internet are not likely to be fearful of Internet surveys. Web-delivered questionnaires can be made part of the purchase transaction (for customer satisfaction studies, for example), with attendant high levels of motivation and participation from the respondents.

Business and professional users of Internet services are also an excellent population to reach with Internet surveys. Over 80% of businesses are currently estimated to have Internet connections, with the number expected to reach 90% by next year. Business users are likely to have experience with the Internet and to recognize its convenience in replying to questionnaires. In business-to-business research, product and service demonstrations are often crucial. Web-delivered questionnaires, with their ability to weave text and audio-visual demonstrations into the questionnaire, are an excellent way to reach a business population.

Internet questionnaires can frequently be used to supplement traditional methods of collecting questionnaire data. The portion of the target population that uses the Internet can be reached cheaply and quickly with Internet questionnaires, while those not connected can be reached by mail or telephone. Supplementing traditional survey methods provides some immediate cost savings, as well as a migration path toward fuller Internet interviewing in the future as the connectivity of the general population increases. 

Internet Samples

Internet samples fall into three categories: unrestrictedscreened, and recruited.

In an unrestricted sample, anyone on the Internet who desires may complete the questionnaire. These samples may have poor representativeness due to self-selection of the respondents. The rate of participation (completion rate in traditional survey terms) is generally low. Unrestricted samples do have utility in applications like point-of-sale surveys for Web commerce, web site user profiles, ‘bingo card’ -like customer interest surveys, or recruitment of potential focus group members.

Screened samples adjust for the unrepresentativeness of the self-selected respondents by imposing quotas based on some desired sample characteristics. These are often demographic characteristics such as gender, income, and geographic region, or product-related criteria such as past purchase behavior, job responsibilities, or current product use. The applications for screened samples are generally similar to those for unrestricted samples.

Screened sample questionnaires typically use a branching or skip pattern for asking screening questions to determine whether or not the full questionnaire should be presented to a respondent. Some Web survey systems can make immediate market segment calculations that assign a respondent to a particular segment based on screening questions, then select the appropriate questionnaire to match the respondent’s segment.

Alternatively, some Internet research providers maintain a "panel house" that recruits respondents who fill out a preliminary classification questionnaire. This information is used to classify respondents into demographic segments. Clients specify the desired segments, and the respondents who match the desired demographics are permitted to fill out the questionnaires of all clients who specify that segment. This approach is somewhat less flexible than using tailored screening questions that are unique to the survey being conducted, and also raises questions about the representativeness of respondents who are willing to spend the time to fill out many different questionnaires for different clients.

Recruited samples are used for targeted populations in surveys that require more control over the make-up of the sample. Respondents are recruited by telephone, mail, e-mail, or in person. After qualification, they are sent the questionnaire by e-mail, or are directed to a web site that contains a link to the questionnaire. At web sites, passwords are normally used to restrict access to the questionnaire to the recruited sample members. Since the makeup of the sample is known, completions can be monitored, and follow-up messages can be sent to those who do not complete the questionnaire, in order to improve the participation rate.

Recruited samples are ideal in applications that already have a database from which to recruit the sample. For example, a good application would be a survey that used a customer database to recruit respondents for a purchaser satisfaction study. Another application might be the construction of a consumer panel for tracking research. The convenience of filling out a short Internet survey as compared to a paper diary that must be mailed back should increase the participation rate and the accuracy of the answers. 

Different Methods of Conducting Internet Surveys

E-mail Questionnaires. The questionnaire is prepared like a simple e-mail message, and is sent to a list of known e-mail addresses. The respondent fills in the answers, and e-mails the form plus replies back to the research organization. A computer program is typically used to prepare the questionnaire, the e-mail address list, and to extract the data from the replies.

E-mail questionnaires are simple to construct and fast to distribute. By showing up in the respondent’s e-mailbox, they demand immediate attention.

However, they are generally limited to plain text, although graphics can be sent as e-mail attachments that are decoded separately from the questionnaire text. Many standard questionnaire lay-out techniques, such as creating grids of questions and scale responses, cannot be done in a visually attractive way in e-mail. There is no check for validity of data until the whole questionnaire is returned, so there is virtually no opportunity to request that the respondent reenter bad data. The respondent may damage the questionnaire text in the process of responding, making automatic data extraction impossible and requiring hand coding of damaged responses. In addition, all question skips are carried out by the respondent, who is given a set of instructions embedded in the text ("If you replied ‘yes’ to this question, skip to Question 23"). This can result in illegal skip patterns, which may require more hand recoding, or result in missing data or rejected questionnaires.

Converted CATI systems. A software translator program takes questionnaires programmed in the CATI vendor’s questionnaire construction language and translates them for distribution over the Web. The web server may be located in the research supplier’s facility, or time may be rented from a service bureau that has the CATI system installed. The web server is linked to a database that receives the respondents’ replies and stores them.

Converted CATI systems have the good sample and quota management typical of CATI programs. They also inherit the ability to set up complex skip patterns for screening and to adapt to respondents’ replies. They can do data verification at the time of entry, and request reentry of illegal data immediately. Converted CATI systems provide quick migration to Internet interviewing for current users of a particular CATI system and permit reuse of existing programmed questionnaires. In some systems, progress of the Internet survey can be monitored while data is being collected, with some intermediate data extracts available for a fee (daily summaries, for example).

On the negative side, the CATI systems on which these Internet survey products are based were designed for a telephone interviewer working from a computer screen. Respondent screen formatting is somewhat limited as a result. In addition, the CATI languages frequently do not take advantage of the Web’s ability to present graphics and audio-visual material. The researcher is locked into a single CATI system provider’s technology, which is only a small disadvantage if the researcher is already using that CATI system, but a larger one if the researcher is not. Finally, the converted CATI systems are expensive to purchase and use.

Converted Disk-By-Mail Systems. These are similar to converted CATI systems. Disk-by-mail systems provide a questionnaire construction tool that creates a program file on a floppy disk that the respondent subsequently runs on a personal computer. The program presents the questions on the computer screen and records the answers on the program floppy disk, which is then mailed back to the research organization. The converted disk-by-mail system adapts the questionnaire for presentation via the Web, and provides a data management program to record the answers provided by the respondents.

Converted disk-by-mail systems have the same skip pattern management and data verification advantages of converted CATI systems, with the addition of more flexible questionnaire construction tools that include graphical and audio/visual material. However, they inherit the limitations on quota management of the disk-by-mail approach, which is designed to present a single questionnaire to a single respondent. They typically require that the user manage his/her own web site and install and maintain the software on that site.

Web CGI programs. In this approach to Internet survey research, each questionnaire is programmed directly in HTML (the presentation language used by the WWW) using a computer script language such as PERL or a programming language such as Visual Basic. The programmed questionnaire is placed on a Web server at the client’s location or on a server located in a service bureau. The program uses the Common Gateway Interface (CGI) of the WWW to place respondents’ replies into a data base. Data base queries can be programmed to give periodic reports of the data to-date, including statistical analyses.

The CGI programming approach is the most flexible of all. Complex question skips and data verification and reentry can be achieved, and programming languages can use the full capability of Web. Since all questionnaires are custom programmed, Web CGI programs are not tied to a proprietary CATI language, or a single technology vendor. Database operations and queries can be programmed to adapt to virtually any special reporting need of the researcher.

This flexibility comes with a cost, however. Since questionnaires and data base operations are essentially custom computer programs that must be created and debugged by highly-trained programmers, they are expensive. The computer languages contain no special tools for tasks like screening, quota management and question skip pattern management, so programming these features in each questionnaire further increases the cost.

The CGI program must be placed on a web server system to distribute the questionnaires and collect the data. This can be the research client’s web server, or a server provided by the research supplier. If the survey is placed on the client’s web server, time for programming and debugging can be difficult to schedule. Large corporate sites often require several administrative approvals before any modifications of the site can be made, and technical staff are frequently leery of allowing an outside programmer to place a program on their site.

Web Survey Systems. These are software systems specifically designed for Web questionnaire construction and delivery. In essence, they combine the survey administration tools of a CATI system with the flexibility of CGI programming. They consist of an integrated questionnaire designer, web server, data base, and data delivery program, designed for use by non-programmers.

In a typical use, the questionnaire is constructed with an easy-to-use questionnaire editor using a visual interface, then automatically transmitted to the Web server system. The Web server distributes questionnaire and files responses in a database. The user can query the server at any time via the Web for completion statistics, descriptive statistics on responses, and graphical displays of data. Data can be downloaded from the server at any time for analysis at the researcher’s location. The questionnaire construction and data display programs reside on the user’s computer system, while the Web server is located in a survey technology provider’s office.

Web survey systems include tools that allow non-programmers to create complex questionnaires that are visually appealing. The complexity of skip patterns and data verification that can be achieved approaches that of the CGI programming approach. Users do not have to maintain a Web site or data base, so there is less disruption of clients’ web sites and computing facilities. Sample quota control is as good as that provided by converted CATI systems. In addition, tools to personalize questionnaires with data base information (like inserting the respondent’s name in a questionnaire delivered to a restricted sample respondent) and to add graphics and sound without programming are often included.

Web survey systems typically have a lower cost per completed interview than converted CATI, converted disk-by-mail, or CGI programs, although they are more expensive than e-mail surveys for small surveys (under 500 respondents). The lower cost results from the efficiencies of using software tools designed specifically for Web use, and from the cost-sharing of Internet access costs and hardware costs that a central server system provides.

Like converted CATI, converted disk-by-mail, and CGI programming, Web survey systems use the more passive Web retrieval for questionnaires. E-mail, although it has many limitations, is more immediately attention-demanding. Also, for current users of CATI systems, migration of existing questionnaires to Web survey systems is more difficult than migration to a converted CATI system. Questionnaires must be manually cut and pasted into the Web survey questionnaire constructor. 


Internet survey research is not appropriate for all populations and all projects, but for many applications it provides definite advantages. For populations already using the Internet, or for "early adopter" populations, quantitative survey research on the Internet can give faster results at a lower cost than traditional methods. Internet questionnaires can be used to supplement traditional quantitative data collection methods as a way of reducing the overall cost of a project or as the beginning of a migration to all-Internet surveys in the future.

The kind of Internet survey technology to use for a project depends on the circumstances of survey. The following grid (Table 1) summarizes the strengths and weaknesses of each.

  E-Mail Converted
Web CGI Programs Web Survey Systems
Ease of creation / modification Excellent Fair Good Poor Excellent
Ease of Access to Preliminary Data Poor Fair Good Excellent Excellent
Sample Quota Control Poor Excellent Fair Excellent Excellent
Data Validity Checks Poor Good Good Excellent Excellent
Demand of Respondent’s Attention Excellent Good Good Good Good
Personalization of Questionnaires Fair Fair Poor Excellent Excellent
Conversion of Existing Questionnaires Fair Excellent Good Good Good
Expertise Required by Questionnaire Creator Low High Moderate Very High Moderate
Cost per completion Inexpensive Expensive Expensive Very Expensive Moderate to Inexpensive



Categorized in Market Research

On Jan. 1, a little-noticed, but important milestone in the history of the internet marks its 30th anniversary. It was on this date in 1983 that ARPANET (Advanced Research Projects Agency Network -- the world's first operational packet switching network and the progenitor of what was to become the global Internet) officially switched to using Transmission Control Protocol and Internet Protocol (TCP/IP). While this may not be the best known advancement in the development of the Internet, it is arguably one of the most significant, since it was this change in protocol that established the course of the Internet that is inexorably interwoven throughout our business and personal lives today.

The Internet has impacted all industries in ways we could not have imagined three decades ago. But nowhere has that impact been felt more so than in science research and academic publishing, especially during last 15 years of transition from hard copy to electronic files and the more recent emergence of networked science.

Since the very early days of the printing press, science has been dependent upon the publishing industry to advance knowledge. When Galileo's Discorsi e dimostrazioni matematiche, intorno a due nuove scienze(The Discourses and Mathematical Demonstrations Relating to Two New Sciences) was published by the House of Elsevier in 1638, it challenged the widely held beliefs of that time about the origins of the universe. Such thinking was held to be bordering on heresy by the religious institutions of the day, but the availability of the written word that could be easily transported and made available for others to study, propelled enlightenment and knowledge.

Collaboration between researchers in different countries, of the kind we take for granted today, would have been unheard of even as late as WWII. But the decline of the Cold War saw laboratory walls melt away, as a global economy and the rise of the multinational corporation, increased competition and the need to access the best scientific talent in order to build modern economies and address problems that are now global in nature. More than 35 percent of all research papers published today document active international collaboration, a 40 percent increase from 15 years ago and double since 1990. China dominates in cross-border collaborations; Japan and the E.U. are second and third.

In the first decade of the nascent Internet, little impact outside of the (then) narrow computing community was felt, but in 1992 the first digital versions of research papers became available to the science community via The University Loicensing Project (TULIP), a cooperative effort between Elsevier and eight U.S. universities (Carnegie Mellon, Cornell, Georgia Tech, University of California, University of Michigan, MIT, Virginia Tech, University of Washington). Now the publishing process no longer required a lengthy typesetting and production timeline to create a journal or paper -- content could be created in bytes and pixels and made available virtually.

Six years later (April 1998) the journal Computer Networks and ISDN Systems published a research paper by two computer scientists, Sergey Brin and Lawrence Page, titled: "The anatomy of a large-scale hypertextual Web search engine." Their resulting Google search engine launched in September of that year and revolutionized the knowledge transfer process.

By 2000, digital versions of more than 11 million research articles and the first e-books became available and by the end of the first decade of the new century, international sales growth for digital academic content surpassed hard copy. More than 1.5 million research papers are currently generated by over 200 countries and e-marketing of such content through the use of social networks now is the norm.

A more significant advancement in the past five years has been the emergence of "networked science" -- the concept that scientific content cannot, and should not, exist in a vacuum. Articles by different authors are now linked to banks of data sets, reference books, videos, presentations and audio tracks. Scientists and engineers representing a wide variety of cross-disciplines can debate research findings in online forums, and society will ultimately benefit from the resulting scientific discourse that will open up limitless new avenues for search and discovery.

Today, it is estimated that we create 2.5 quintillion bytes of data every day, much of which (90 percent) has been created in the last two years alone, according to IBM. The data comes from everywhere: satellites and sensors, social media, digital pictures and videos, transaction records, and cellphone GPS signals to name a few. This massive volume of information has given rise to the term Big Data and the basis of the New Research Economy as global spend for R&D reached $1 trillion in 2012, an increase of 45 percent since 2002.

As with any advancement, the assets provided by the Internet come with their own set of liabilities, and they are legion. Most notably are the increases in plagiarism, piracy of Intellectual Property, the debate over Open Access as well as how we manage and vet Big Data. Internet search engines can provide researchers with inexhaustible sources of information, but they cannot determine whether the content can be trusted. The peer review process which is at the very core of scientific publishing still works, and may never be more crucial than it is right now.

The emerging economies in China, India and Brazil, intensifying global competition as well as the need for the very best and most trusted scientific research to address the cross-border problems the world now faces, will continue to fuel the new research economy. The resulting mass of Big Data will grow exponentially. Science and the publishing industry will need each other even more so to help manage it. 

Written By:  Olivier Dumon



Categorized in Market Research

As an entrepreneur, you already know the Internet offers a wealth of information to assist you with business and strategic planning. But do you ever have that "needle-in-a-haystack" feeling when trying to locate a critical piece of information?

Despite powerful search engines like Google and Yahoo, it can be difficult to sort through the wealth of information available for the golden nugget you need. Plus, much of the good business information is hidden in "the invisible Web"—the 80 percent of the Internet not accessible to popular search engines. Often the really great information is under "cyber lock and key" and available only to large companies with budgets to pay for subscription databases.

The good news is there are free and low-cost ways to access business information online-if you know where and how to look. Following are some valuable Web sites to visit the next time you need information for your business or strategic planning.


CensusScope - Knowing the attributes of your buyers and their community can be critical during planning. Unfortunately, the official U.S. Census Bureau's site can be overwhelming. CensusScope takes census data and makes it easy. Click on the Maps tab and select a state. In the lower-left corner, choose a county and then the type of information you want. A chart will appear; now you can right mouse click and copy/paste directly into your own document or plan. Underlying data can also be copied.

Company Information

Manta - Locating information on private companies can be challenging. You can look at a company's Web site, but remember you're only going to see what the company wants you to see. Manta leverages the Dun and Bradstreet database to feature information on more than 45 million companies. Free registration is required. Type in the name of a company and learn things like revenue and employee figures, industry data, and contact information. You can also search for companies by geography or industry.

Industry Information

Alacra Wiki - The Alacra Wiki features a Spotlight section where site users contribute information and resources specific to a particular industry. To visit the Spotlight, click the Alacra Spotlights link on the left-side navigation, then choose an industry. Each Spotlight gives a description of the industry with direct links to information resources where you can learn everything from industry financials to trends and issues.

Inside Information

Technorati - A blog is a Web site or online diary written by an individual (usually the blog writer) about a topic of interest. However, some blogs are filled with industry market data and you can access them by using Technorati's Blog Directory. Type in a broad search term in the search box (e.g., pharmaceutical); and in the pull down menu next to the Search button, choose "in blog directory." Your results will contain blogs that feature industry news, commentary, and also links to other industry resource sites.

Industry Market Research

MarketResearch.com - MarketResearch.com features thousands of market research reports. Use the search engine to locate a relevant report, and purchase it if it is critical to your business. If money is tight, however, write down the report's name and the publisher's name. Then go to Google, type in the name of the report surrounded by quotation marks (so the name of the report is treated as a phrase) and then the publisher. On the search results, you'll most likely find other sites trying to sell you the same report. Sometimes, however, you can find an article that contains the "meat" of an expensive research report, and often all you need for planning are key statistics and data.

One Stop Biz Info Shop

BizToolkit - BizToolkit is a free program of the non-profit James J. Hill Reference Library, the nation's premier practical business information organization. BizToolkit features direct links to the best Web sites as they relate to planning, marketing, managing, and growing a business. The Marketing area in particular features excellent links to helpful market research sites, including the Special Issues Index under "Research and Industry" where you can order reports from the Hill Library's expansive trade journal collection. Also note the Biz Site Recommender on the left side navigation, featuring direct links to the best "Invisible Business Web." For $7.95, you can upgrade to a BizToolkit Premium membership, which features additional resources including expert, live help (just click a button and let a Hill Expert search for you) and the BizRewards program.

Power Research

HillSearch - HillSearch is considered the most powerful business research engine available to individuals. Use the OneSearch tool to instantly search the "open Web" plus virtually every company in North America, and every key newspaper, magazine, and industry trade journal. Or use the Custom Search area for specific databases to meet your research needs, including industry metrics and market research reports. HillSearch is just $59 per month for access to the same types of expert research tools that big companies have. You can get a HillSearch trial. HillResearch is the Hill Library's professional research service; for $100 per hour, the experts do all of the work for you. You can get a complimentary reference interview and discuss your project with an expert by sending an e-mail to This email address is being protected from spambots. You need JavaScript enabled to view it..

You can access expert market research information, without having to be a market research expert. You just have to know where to look, and what free and low-cost resources are available.

In today's globally competitive environment, those with the right information win. If you make use of resources like the ones highlighted in this article, you will not only save a tremendous amount of time versus searching with just popular search engines, you'll also access relevant and credible data that can make a big difference in ensuring that your plans succeed.

© 2007 Sam Richter. All rights reserved.


Categorized in Online Research

Introduction to How Internet Search Engines Work

The good news about the Internet and its most visible component, the World Wide Web, is that there are hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad news about the Internet is that there are hundreds of millions of pages available, most of them titled according to the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a particular subject, how do you know which pages to read? If you're like most people, you visit an Internet search engine.

Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:
They search the Internet -- or select pieces of the Internet -- based on important words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.
Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day. In this article, we'll tell you how these major tasks are performed, and how Internet search engines put the pieces together in order to let you find the information you need on the Web.

Web Crawling

When most people talk about Internet search engines, they really mean World Wide Web search engines. Before the Web became the most visible part of the Internet, there were already search engines in place to help people find information on the Net. Programs with names like "gopher" and "Archie" kept indexes of files stored on servers connected to the Internet, and dramatically reduced the amount of time required to find programs and documents. In the late 1980s, getting serious value from the Internet meant knowing how to use gopher, Archie, Veronica and the rest.

Today, most Internet users limit their searches to the Web, so we'll limit this article to search engines that focus on the contents of Web pages.

Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. (There are some disadvantages to calling part of the Internet the World Wide Web -- a large set of arachnid-centric names for tools is one of them.) In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.

Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.

When the Google spider looked at an HTML page, it took note of two things:
The words within the page
Where the words were found

Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.

These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.

Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags. Learn more about meta tags on the next page.

Meta Tags

Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings -- the meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct. There is, however, a danger in over-reliance on meta tags, because a careless or unscrupulous page owner might add meta tags that fit very popular topics but have nothing to do with the actual contents of the page. To protect against this, spiders will correlate meta tags with page content, rejecting the meta tags that don't match the words on the page.

All of this assumes that the owner of a page actually wants it to be included in the results of a search engine's activities. Many times, the page's owner doesn't want it showing up on a major search engine, or doesn't want the activity of a spider accessing the page. Consider, for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed. If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed. This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone -- to neither index the words on the page nor try to follow its links.

Building the Index

Once the spiders have completed the task of finding information on Web pages (and we should note that this is a task that is never actually completed -- the constantly changing nature of the Web means that the spiders are always crawling), the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:

The information stored with the data

The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.

Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting -- whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it's ready for indexing.

An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness.

In English, there are some letters that begin many words, while others begin fewer. You'll find, for example, that the "M" section of the dictionary is much thicker than the "X" section. This inequity means that finding a word beginning with a very "popular" letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.

Building a Search

Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search.
The Boolean operators most often seen are:
AND - All the terms joined by "AND" must appear in the pages or documents. Some search engines substitute the operator "+" for the word AND.
OR - At least one of the terms joined by "OR" must appear in the pages or documents.
NOT - The term or terms following "NOT" must not appear in the pages or documents. Some search engines substitute the operator "-" for the word NOT.
FOLLOWED BY - One of the terms must be directly followed by the other.
NEAR - One of the terms must be within a specified number of words of the other.
Quotation Marks - The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file.

Future Search

The searches defined by Boolean operators are literal searches -- the engine looks for the words or phrases exactly as they are entered. This can be a problem when the entered words have multiple meanings. "Bed," for example, can be a place to sleep, a place where flowers are planted, the storage space of a truck or a place where fish lay their eggs. If you're interested in only one of these meanings, you might not want to see pages featuring all of the others. You can build a literal search that tries to eliminate unwanted meanings, but it's nice if the search engine itself can help out.

One of the areas of search engine research is concept-based searching. Some of this research involves using statistical analysis on pages containing the words or phrases you search for, in order to find other pages you might be interested in. Obviously, the information stored about each page is greater for a concept-based search engine, and far more processing is required for each search. Still, many groups are working to improve both results and performance of this type of search engine. Others have moved on to another area of research, called natural-language queries.

The idea behind natural-language queries is that you can type a question in the same way you would ask it to a human sitting beside you -- no need to keep track of Boolean operators or complex query structures. The most popular natural language query site today is AskJeeves.com, which parses the query for keywords that it then applies to the index of sites it has built. It only works with simple queries; but competition is heavy to develop a natural-language query engine that can accept a query of great complexity.

Written By: Curt Franklin


Categorized in Online Research


World's leading professional association of Internet Research Specialists - We deliver Knowledge, Education, Training, and Certification in the field of Professional Online Research. The AOFIRS is considered a major contributor in improving Web Search Skills and recognizes Online Research work as a full-time occupation for those that use the Internet as their primary source of information.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.