Problem About Repositories on Internet

There are sites on the Internet where a lot of resources have been collected in one place. Sometimes these are plain html pages, sometimes they are items of a certain type such as music files or software. The files in these repositories are multiplying because more bandwidth is becoming available. As an immediate consequence, we are becoming more wired. Every piece of information around us is being transferred to a digital medium, and often these bytes are plugged into the cyberspace. These repositories and the files they contain are already in abundance.

So what problem is there with this, is not the abundance of information a desirable thing? It of course is, that is what has made the WWW such an area of interest. The HTTP protocol and the HTML language (and its followers) provided the users with a good way to author hypertext documents, and thus a navigable and searchable body of worldwide information emerged. However, as many researchers have reported the browsing paradigm makes some of the tasks more difficult among which is searching with detailed queries such as "Who is the daughter of Jay Random Student's advisor?" Of course, one can devise more meaningful queries. Another thing that is very difficult to find is comprehensive indices on a specific topic. These are hard tasks because most of the content is in HTML or plain text which are by design intended for people to interpret, not computers. As a result of this, computers cannot construct comprehensive indices because they don't understand what they read and they can provide search methods which are rather limited. Neither can people, because there are a limited amount of people interested in maintaining indices and collective index maintenance has its limits evidently.

In our current approach to these problems every search engine and resource index is destined to be inaccurate, limited, and out-of-date.

The problem with a single repository site is more specific. Such repositories are like private collections, they are not really distributed arbitrarily over the Net so it is easier to handle them. Your home directory on a UNIX machine is such a repository. Probably you have a directory called "doc" or "documents" containing text,html,pdf,etc. documents. Unfortunately, even a private collection can be ardous to manage. The complexity of managing the knowledge of structuring and classification of your files can be burdensome. In fact, if you have a lot of files, organizing the files in directories and in symbolic links can be so difficult that it can be infeasible to manage them at all. Their structure becomes chaotic, and it becomes more difficult to find a specific document. It is also impossible to make an overall sense of what there is in the collection: it has no structure. Entropy has taken over.

There are two solutions to such a repository collapse. Either one keeps the size of the repository reasonably small to manage, or finds a more efficient and sensible method to structure it. In this work we explore the second solution.

One obvious solution is to make computers as smart as humans and tell them to continuously organize the web efficiently. We cannot do that now, but we can crudely approximate it in a number of ways. First, we can let them find similarities in meaning and build hierarchies of these similarities. This can be termed automatic categorization making the computer break down data by clustering the similarity measurements. This is the subject of Data Mining and Machine Learning. There are already systems which work this way. For instance, CORA CS Research Paper Search Engine automatically categorizes papers according to their field. Another method is to describe in a computer language the subject of a resource. In a simple sense, this is about tagging documents. You might say that this text-document is of category "project-description-page" for instance. Of course that would be insufficient, but the greater problem is just what a claimed category is; how it stands in relation to other categories. How could we possibly communicate the content of an idea or object to a computer? That requires Knowledge Representation, a subfield of AI and the researchers of Ontology in AI, Cognitive Sciences and Linguistics are interested in representation and use of Categories. This project aims to convey how categories can help us organize information in Net repositories.

A software repository like freshmeat can contain many more items than in your bookmarks file, each with a description. A collection of book reviews from around the globe could have many thousands of entries. With a comprehensive organization, users can browse the repository in a meaningful and efficient manner. If it is human edited by editing indices or without indices they have to scan through each entry or use a search engine with the limits previously mentioned. To solve this, we are trying to apply the tools of an area of research to a particular problem on the Internet. An Ontology language allows the site maintainers (or with proper software the users as well) specify the conceptual structure in a concise and abstract manner which can be used to automatically generate intuitive user interfaces.