Shadows on the Net
Luís Ángel Fernández Hermana - @luisangelfh
20 February, 2018
Fecha de publicación original: 20 julio, 1999
Nobody sees what you know, but everyone can see what you do
Habits and usage in the connected world are changing at great speed. Almost as fast as our difficulty in understanding what is going on on the Internet and learning to move around it with some kind of personal purpose in mind, increases. These are some of the conclusions of a study conducted by engineers at the NEC Research Institute in Princeton, USA, the second of its kind published in the last two years. Authors Steve Lawrence and C. Lee Giles focussed their attention on the big directories to find out “how much web” there was in them and what we get out of our searches. The results are rather disappointing. No search engine indexes more than 16% of the pages on the WWW. And there are quite a lot of them, about 800 million to be precise. But as this figure increases, coverage decreases. Last year the best search engine contained just a third of the Web. Now it seems that the more portals there are the less places they take us to. For, this study, called “Accessibility to Information on the Web”, took as its point of reference the 16 main search engines on the Internet, from Northern Light and Alta Vista, to Yahoo! and Excite, as well as HotBot, Infoseek, Lycos, Google and Snap.
The first report by Lawrence and Lee was published in the USA scientific journal Science on the 3/4/98. The second in British scientific journal Nature on the 8/7/99. In that time, the Web had had 300 million pages distributed in 3 million servers added to it (not sites, as some media erroneously reported). In December 1997, these researchers calculated that the Web stored 320 million indexable pages (which excludes those that need passwords, form-filling access, or use the robots exclusion standard), out of an estimated total of 500 million. Indexable pages have now reached 800 million. Nevertheless the six most popular search engines put together –Alta Vista, Excite, HotBot, Infoseek, Lycos and Northern Light– only just cover 60% of the Web.
The authors of the report think that the Web is growing too quickly and that these directories haven’t got time to incorporate the flood of new pages. As a result they are forced to take drastic decisions which, from their point of view, are “logical”. In the first place, they favour the places that receive the most traffic and, secondly, of these they choose mostly those based in the US. The rest of cyberspace lies in the penumbra. The contradictions of this situation are further highlighted as the profile of activities predominating on the Net are emerging. Social use of the Net is increasing all the time. People use directories to find areas of interest, and when the need arises, to decide what to buy, plan holidays, the best medical treatment or even which way to vote. Even scientists dive into the Web to define the content and the area of their research.
Nevertheless, the big search engines only just offer 16% of all the possibilities and opportunities available on the Net. Lawrence maintains that it is very unlikely that the Internet will continue to rush on ahead of the machines trying to take a census of it. More powerful search engines, along with automatic agents guided by artificial intelligence and other systems of this kind, will bring real content closer and closer to indexed content. However, by 2001, the present frontier set by digital prophets, it is expected that the Internet population will have multiplied and grown to 700 million. At a conservative estimate of two pages per person (not that they will necessarily create the pages but that their mere presence will stimulate the productive capacity of the Net, as is already the case now), that means another 1,400 million pages on top of what we have now. The technicians are going to have their work cut out for them.
The difficulty in indexing all, or a significant proportion of them, only potentiates the search engines’ “bias”. In the first place, concentrating on links to find new pages. In the second, they depend on pages registered by users. This creates a pernicious Darwinian kind of environment not driven by mutation and the survival of the fittest, but by traffic and the number of links exchanged. Added to this is the “popularity” factor for classifying relevant pages. This increases the visibility of these pages apparently condemning the other (millions) of pages to a kind of limbo, since, despite the quality of their pages, they don’t fulfil the directories’ fixed parameters.
The problems posed by these “search prejudices” are predictable. At present, 83% of servers contain information of a commercial nature, such as company web pages, local administrations, etc. How will we know they exist? How will we get to them? Lagging far behind these, but no less interesting, comes scientific and educational content (6% of servers), much of which can only be found on the Web and is not even available in traditional data bases. Following these are pages on health, personal pages, citizen or community networks and pornography (the latter making up less than 1.8% of the total). If the ranking based on popularity prevails, the accessibility of the Web might become just a good slogan but one which is far from reflecting the truth. “This might retard or even impede mass visibility of new high quality information”, the researchers conclude.
But, as the indomitable pirate the Blackbear used to say before his ship went down, surrounded by enemy vessels, “All is not lost!” The NEC study only refers to what 16, the most well-known, search engines index. Now what is needed is a study on how internauts get the information, which might not correspond to what the search engines offer. A superficial glance around the Web shows that more and more systems include their own search engines such as is the case, for example, with en.red.ando. These robots index their own information, content which, for the above-mentioned reasons, doesn’t normally appear when the most well-known search engines are used. And despite all this, the Black Corsair, I mean the internaut, manages somehow to get there. In my opinion, it is this part of the Net which we need to perfect and enrich. Specialised search engines, capable of giving content-rich answers about the material they index and not just lists of addresses, and possibly linked to one another through areas of interest, offer a viable and rational way out of the present morass. After all, one doesn’t have to get access to all the information on the Net but, basically, to that which we are looking for or –and this is always a bit more complicated– that will interest us although we didn’t know it beforehand. This is a whole new area open to investigation which we have, as yet, heard very little about.
From these pages we have often insisted that, when one talks about the Internet, one has to define as precisely as possible exactly which Internet we are referring to. Whether it is the Internet reflected by the media, or that of the quick killings on the stock market, or that of the millions of pages where relationships of all kinds are interwoven and which are, perhaps, never indexed by the big search engines. The question, as always, is who loses out. The truth is that the growing tendency to describe the Internet in fashionable terms such as portals, megaportals or mini-portals, “traffic distributing webs”, “pass-through” or “destination webs”, “impact increasing/amplifying webs”, etc., etc., etc., only obscures the dynamic of the Net and the complexity of this world, built upon the activities of millions of internauts and not just by a handful of companies trapped in the quick sands of the Stock Exchange.
Translation: Bridget King