Technical Reports 2009

Technical Reports Published in 2009

BU-CE-0901: PDF

Voting Features based Classifier with Feature Construction and its Application to Predicting Financial Distress

H. Altay Güvenir and Murat Çakır
Voting Features based Classifiers, shortly VFC, have been shown to perform well on most real-world data sets. They are robust to irrelevant features and missing feature values. In this paper, we introduce an extension to VFC, called Voting Features based Classifier with feature Construction, VFCC for short, and show its application to the problem of predicting if a bank will encounter financial distress, by analyzing current financial statements. The previously developed VFC learn a set of rules that contain a single condition based on a single feature in their antecedent. The VFCC algorithm proposed in this work, on the other hand, constructs rules whose antecedents may contain conjuncts based on several features. Experimental results on recent financial ratios of banks in Turkey show that the VFCC algorithm achieves better accuracy than other well-known rule learning classification algorithms.
Keywords: Classification, voting, feature construction, financial distress, feature projections
BU-CE-0902: PDF
Modeling Interestingness of Streaming Association Rules as a Benefit Maximizing Classification Problem (Ph. D. Thesis)
Tolga Aydın
In a typical application of association rule learning from market basket data, a set of transactions for a fixed period of time is used as input to rule learning algorithms. For example, the well-known Apriori algorithm can be applied to learn a set of association rules from such a transaction set. However, learning association rules from a set of transactions is not a one-time only process. For example, a market manager may perform the association rule learning process once every month over the set of transactions collected through the previous month. For this reason, we will consider the problem where transaction sets are input to the system as a stream of packages. The sets of transactions may come in varying sizes and in varying periods. Once a set of transactions arrive, the association rule learning algorithm is run on the last set of transactions, resulting in a new set of association rules. Therefore, the set of association rules learned will accumulate and increase in number over time, making the mining of interesting ones out of this enlarging set of association rules impractical for human experts. We refer to this sequence of rules as "association rule set stream" or "streaming association rules" and the main motivation behind this research is to develop a technique to overcome the interesting rule selection problem. A successful association rule mining system should select and present only the interesting rules to the domain experts. However, definition of interestingness of association rules on a given domain usually differs from one expert to the other and also over time for a given expert. In this thesis, we propose a post-processing method to learn a subjective model for the interestingness concept description of the streaming association rules. The uniqueness of the proposed method is its ability to formulate the interestingness issue of association rules as a benefit-maximizing classification problem and obtain a different interestingness model for each user. In this new classification scheme, the determining features are the selective objective interestingness factors, including the rule's content itself, related to the interestingness of the association rules; and the target feature is the interestingness label of those rules. The proposed method works incrementally and employs user interactivity at a certain level. It is evaluated on a real market dataset. The results show that the model can successfully select the interesting ones.
Keywords: Interestingness learning, incremental learning, classification learning, association rules, data mining.
BU-CE-0903: PDF
Storage and Access Schemes for Aggregate Query Processing on Road Networks (Ph. D. Thesis)
Engin Demir
A well-known example of spatial networks is road networks, which form an integral part of many GIS applications such as location-based services, intelligent traveling systems, vehicle telematics, location-aware advertising. In practice, road network data is too large to fit into the volatile memory, and a considerable portion of the data must be stored on the secondary storage since several spatial and non-spatial attributes as well as points of interest data are associated with junctions and links. In network query processing, the spatial coherency that exists in accessing data leads to a temporal coherency, that is, connected junctions are accessed almost concurrently. Taking this fact into consideration, it seems reasonable to place the data associated with connected junctions in the same disk pages so that the data can be fetched to the memory with fewer disk accesses. We show that the state-of-the-art clustering graph model for allocation of data to disk pages is not able to correctly capture the disk access cost of successor retrieval operations. We propose clustering models based on hypergraph-partitioning, which correctly encapsulate the spatial and temporal coherency in query processing via the utilization of query logs in order to minimize the number of disk accesses during aggregate query processing. We introduce the link-based storage scheme for road networks as an alternative to the widely used junction-based storage scheme. We present Get-Undone-Successors (GUS) as a new successor retrieval operation for network queries, where the candidate successors to be retrieved are pruned during processing a query. We investigate two different instances of GUS operation: the Get-Unprocessed-Successors operation typically arises in Dijkstra's single source shortest path algorithm, and the Get-Unvisited-Successors operation typically arises in the incremental network expansion framework. The simulation results show that our storage and access schemes utilizing the proposed clustering hypergraph models are quite effective in reducing the number of disk accesses during aggregate query processing.
Keywords: Spatial network databases, road networks, storage management, query processing, clustering, hypergraph partitioning.
BU-CE-0904: PDF
Hypergraph Partitioning-Based Fill-Reducing Ordering
Ümit Çatalyürek, Cevdet Aykanat and Enver Kayaaslan
A typical first step of a direct solver for linear system Mx=b is reordering of symmetric matrix M to improve execution time and space requirements of the solution process. In this work, we propose a novel nested-dissection-based ordering approach that utilizes hypergraph partitioning. Our approach is based on formulation of graph partitioning by vertex separator (GPVS) problem as a hypergraph partitioning problem. This new formulation is immune to deficiency of GPVS in a multilevel framework hence enables better orderings. In matrix terms, our method relies on existence of a structural factorization of the input $M$ matrix in the form of M=AA^T (or M=AD²A^T). We show that the partioning of the row-net hypergraph representation of rectangular matrix A induces a GPVS of the standard graph representation of matrix M. In the absence of such factorization, we also propose simple, yet effective structural factorization techniques that are based on finding an edge clique cover of the standard graph representation of matrix M, and hence applicable to any arbitrary symmetric matrix M. Our experimental evaluation has shown that the proposed method achieves better ordering in comparison to state-of-the-art graph-based ordering tools even for symmetric matrices where structural M=AA^T factorization is not provided as an input. For matrices coming from linear programming problems, our method enables even faster and better orderings.
Keywords: Fill-reducing ordering, hypergraph partitioning, combinatorial scientific computing.
BU-CE-0905: PDF
A MPEG-7 Compatible Video Retrieval System with Integrated Support for Complex Multimodal Queries
Muhammet Baştan, Hayati Çam, Uğur Güdükbay, Özgür Ulusoy
We present BilVideo-7, an MPEG-7 compatible, distributed video database management system that supports complex multimodal queries in an integrated way. An MPEG-7 profile is developed to represent the videos by decomposing them into Shots, Keyframes, Still Regions and Moving Regions. The MPEG-7 compatible XML representations of videos according to this profile are obtained by the MPEG-7 compatible video feature extraction and annotation tool of BilVideo-7, and stored in a native XML database. Users can formulate text-based semantic, color, texture, shape, location, motion and spatiotemporal queries on an intuitive, easy-to-use Visual Query Interface, whose Composite Query Interface can be used to specify very complex queries containing any type and number of video segments with their descriptors. The multi-threaded Query Processing Server parses queries into subqueries and executes each subquery in a separate thread. Then, it fuses subquery results in a bottom-up manner to obtain the final query result. The whole system is unique in that it provides very powerful querying capabilities with a wide range of descriptors and multimodal query processing in an MPEG-7 compatible interoperable environment. We present sample queries to demonstrate the capabilities of the system.
Keywords: MPEG-7, video retrieval, feature extraction, annotation, multimodal query processing, client-server architecture.
BU-CE-0906: PDF
Asymptotically Optimal Assignments in Ordinal Evaluations of Proposals
A. Yavuz Oruç, Abdullah Atmaca
In ordinal evaluations of proposals in peer review systems, a set of proposals is assigned to a fixed set of referees so as to maximize the number of pairwise comparisons of proposals under certain referee capacity and proposal subject constraints. In this paper, the following two related problems are considered: (1) Assuming that each referee has a capacity to review k out of n proposals, 2 ≤ k ≤ n, determine the minimum number of referees needed to ensure that each pair of proposals is reviewed by at least one referee, (2) Find an assignment that meets the lower bound determined in (1). It is easy to see that one referee is both necessary and sufficient when k = n, and n(n-1)/2 referees are both necessary and sufficient when k = 2. We show that 6 referees are both necessary and sufficient when k = n/2. We further show that 11 referees are necessary and 12 are sufficient when k = n/3, and 18 referees are necessary and 20 referees are sufficient when k = n/4. A more general lower bound of n(n-1)/k(k-1) referees is also given for any k, 2 < k < n, and an assignment asymptotically matching this lower bound within a factor of 2 is presented. These results are not only theoretically interesting
Keywords: asymptotically optimal assignment, panel assignment problem, peer review, proposal evaluation.
BU-CE-0907: PDF
Content-Based Video Copy Detection Using Multimodal Analysis (M. Sc. Thesis)
Onur Küçüktunç
Huge and increasing amount of videos broadcast through networks has raised the need of automatic video copy detection for copyright protection. Recent developments in multimedia technology introduced content-based copy detection (CBCD) as a new research field alternative to the watermarking approach for identification of video sequences.
This thesis presents a multimodal framework for matching video sequences using a three-step approach: First, a high-level face detector identifies facial frames/shots in a video clip. Matching faces with extended body regions gives the flexibility to discriminate the same person (e.g., an anchor man or a political leader) in different events or scenes. In the second step, a spatiotemporal sequence matching technique is employed to match video clips/segments that are similar in terms of activity. Finally the non-facial shots are matched using low-level visual features. In addition, we utilize fuzzy logic approach for extracting color histogram to detect shot boundaries of heavily manipulated video clips. Methods for detecting noise, frame-droppings, picture-in-picture transformation windows, and extracting mask for still regions are also proposed and evaluated.
The proposed method was tested on the query and reference dataset of CBCD task of TRECVID 2008. Our results were compared with the results of top-8 most successful techniques submitted to this task. Experimental results show that the proposed method performs better than most of the state-of-the-art techniques, in terms of both effectiveness and efficiency.
Keywords: copy detection, video processing, shot-boundary detection, video segmentation, subsequence matching, face detection.
BU-CE-0908: PDF
Improving The Efficiency of Search Engines: Strategies for Focused Crawling, Searching, and Index Pruning (Ph. D. Thesis)
İsmail Sengör Altıngövde

Search engines are the primary means of retrieval for text data that is abundantly available on the Web. A standard search engine should carry out three fundamental tasks, namely; crawling the Web, indexing the crawled content, and finally processing the queries using the index. Devising efficient methods for these tasks is an important research topic. In this thesis, we introduce efficient strategies related to all three tasks involved in a search engine. Most of the proposed strategies are essentially applicable when a grouping of documents in its broadest sense (i.e., in terms of automatically obtained classes/clusters, or manually edited categories) is readily available or can be constructed in a feasible manner. Additionally, we also introduce static index pruning strategies that are based on the query views.
For the crawling task, we propose a rule-based focused crawling strategy that exploits interclass rules among the document classes in a topic taxonomy. These rules capture the probability of having hyperlinks between two classes. The rule-based crawler can tunnel toward the on-topic pages by following a path of off-topic pages, and thus yields higher harvest rate for crawling on-topic pages.
In the context of indexing and query processing tasks, we concentrate on conducting efficient search, again, using document groups; i.e., clusters or categories. In typical cluster-based retrieval (CBR), first, clusters that are most similar to a given free-text query are determined, and then documents from these clusters are selected to form the final ranked output. For efficient CBR, we first identify and evaluate some alternative query processing strategies. Next, we introduce a new index organization, so-called cluster-skipping inverted index structure (CS-IIS). It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies (with an ordinary index) for a number of datasets and under varying search parameters. In this thesis, an enhanced version of CS-IIS is further proposed, in which all information to compute query-cluster similarities during query evaluation is stored. We introduce an incremental-CBR strategy that operates on top of this latter index structure, and demonstrate its search efficiency for different scenarios.
Finally, we exploit query views that are obtained from the search engine query logs to tailor more effective static pruning techniques. This is also related to the indexing task involved in a search engine. In particular, query view approach is incorporated into a set of existing pruning strategies, as well as some new variants proposed by us. We show that query view based strategies significantly outperform the existing approaches in terms of the query output quality, for both disjunctive and conjunctive evaluation of queries.
Keywords: Search engine, focused crawling, cluster-based retrieval, static index pruning.
BU-CE-0909: PDF
Data Dissemination Strategies for Mobile Peer-To-Peer Information Systems with Applications to Healthcare
Fatih Melih Özbekoğlu

Peer-to-peer (P2P) architecture is becoming increasingly popular for various applications, replacing the classical Client-Server architecture. With the enhanced capabilities of mobile devices (PDAs, mobile phones, etc.) wireless networks started to take advantage of P2P paradigm along with its properties like infrastructure-free operation, scalability, balanced and distributed workload. Mobile peer-to-peer (MP2P) networks refer to the application of P2P architecture over wireless networks. Problems about dissemination of data in both P2P and MP2P networks are widely studied, and there are many proposed solutions.
Healthcare information systems are helping clinicians to hold the information belonging to patients and diseases, and to communicate with each other since early 1950s. Today, they are widely used in hospitals, being constructed using Client-Server network architecture. Wireless technologies are also applied to medical domain, especially for monitoring purposes.
In this thesis, we present and evaluate various data dissemination strategies to work on a mobile peer-to-peer (MP2P) network designed for a medical healthcare environment. First, the designed network system is presented along with the network topology. Then, the proposed data dissemination strategies are described. And finally, these strategies are evaluated according to the properties and needs of a medical system.
Keywords: Peer-to-peer architecture, Mobile Peer-to-peer networks, Data dissemination, Healthcare information systems.
BU-CE-0910: PDF
A Scenario-based Query Processing Framework for Video Surveillance (Ph. D. Thesis)
Ediz Şaykol

Video surveillance has become one of the most interesting and challenging application areas in video processing domain. Automated access to the semantic content of surveillance videos to detect anomalies is among the basic tasks; however due to the high variability of the visual features and large size of the video input, it still remains a challenging issue. A considerable amount of research dealing with automated access to video surveillance have appeared in the literature; however, significant semantic gaps in event models and content-based access remain. In this thesis, we propose a scenario-based query processing framework for video surveillance archives, especially for indoor environments. A scenario is specified as a sequence of event predicates that can be enriched with object-based low-level features and directional predicates. We also propose a keyframe labeling technique, which assigns labels to keyframes extracted based on keyframe detection algorithm, hence transforms the video input to an event sequence based representation. The keyframe detection scheme relies on an inverted tracking scheme, which is a view-based representation of the actual content by an inverted index. We also devise mechanisms based on finite state automata using this event sequence representation to detect a typical set of anomalous events in the scene, which are also used for meta-data extraction. Our query processing framework also supports inverse querying and view-based querying, for after-the-fact activity analysis, since the inverted tracking scheme effectively tracks the moving objects and enables view-based addressing of the scene. We propose a specific surveillance query language to express the supported query types in a scenario-based manner. We also present a visual query specification interface devised to enhance the query-specification process. It has been shown through performance experiments that the keyframe labeling algorithm significantly reduces the storage space and the input size, and yields a reasonable anomaly detection performance. We have also conducted performance experiments to show that our query processing technique has an effective expressive power and satisfactory retrieval accuracy in video surveillance.
Keywords: video surveillance, scenario-based querying, keyframe labeling, inverse querying, view-based querying, anomaly detection, after-the-fact analysis.
BU-CE-0911: PDF
Threshold Cryptography with Chinese Remainder Theorem (Ph. D. Thesis)
Kamer Kaya

Information security has become much more important since electronic communication is started to be used in our daily life. The content of the term information security varies according to the type and the requirements of the area. However, no matter which algorithms are used, security depends on the secrecy of a key which is supposed to be only known by the agents in the rst place.
The requirement of the key being secret brings several problems. Storing a secret key on only one person, server or database reduces the security of the system to the security and credibility of that agent. Besides, not having a backup of the key introduces the problem of losing the key if a software/hardware failure occurs. On the other hand, if the key is held by more than one agent an adversary with a desire for the key has more exibility of choosing the target. Hence the security is reduced to the security of the least secure or least credible of these agents.
Secret sharing schemes are introduced to solve the problems above. The main idea of these schemes is to share the secret among the agents such that only predened coalitions can come together and reveal the secret, while no other coalition can obtain any information about the secret. Thus, the keys used in the areas requiring vital secrecy like large-scale nance applications and commandcontrol mechanisms of nuclear systems, can be stored by using secret sharing schemes.
Threshold cryptography deals with a particular type of secret sharing schemes. In threshold cryptography related secret sharing schemes, if the size of a coalition exceeds a bound t, it can reveal the key. And, smaller coalitions can reveal no information about the key. Actually, the rst secret sharing scheme in the literature is the threshold scheme of Shamir where he considered the secret as the constant of a polynomial of degree t - 1, and distributed the points on the polynomial to the group of users. Thus, a coalition of size t can recover the polynomial and reveal the key but a smaller coalition can not. This scheme is widely accepted by the researchers and used in several applications. Shamir's secret sharing scheme is not the only one in the literature. For example, almost concurrently, Blakley proposed another secret sharing scheme depending on planar geometry and Asmuth and Bloom proposed a scheme depending on the Chinese Remainder Theorem. Although these schemes satisfy the necessary and su cient conditions for the security, they have not been considered for the applications requiring a secret sharing scheme.
Secret sharing schemes constituted a building block in several other applications other than the ones mentioned above. These applications simply contain a standard problem in the literature, the function sharing problem. In a function sharing scheme, each user has its own secret as an input to a function and the scheme computes the outcome of the function without revealing the secrets. In the literature, encryption or signature functions of the public key algorithms like RSA, ElGamal and Paillier can be given as an example to the functions shared by using a secret sharing scheme. Even new generation applications like electronic voting require a function sharing scheme. As mentioned before, Shamir's secret sharing scheme has attracted much of the attention in the literature and other schemes are not considered much. However, as this thesis shows, secret sharing schemes depending on the Chinese Remainder Theorem can be practically used in these applications. Since each application has erent needs, Shamir's secret sharing scheme is used in applications with several extensions. Basically, this thesis investigates how to adapt Chinese Remainder Theorem based secret sharing schemes to the applications in the literature. We first propose some modications on the Asmuth-Bloom secret sharing scheme and then by using this modied scheme we designed provably secure function sharing schemes and security extensions.
Keywords: Threshold cryptography, secret sharing, function sharing, Asmuth-Bloom, Chinese Remainder Theorem, provable security.