Bilkent University
Department of Computer Engineering
Ph.D Dissertation


Efficient Social Network Analysis on Big Data Platform


Hidayet Aksu
PhD Student
Computer Engineering Department
Bilkent University

In recent years, the rise of very large, rich content networks re-ignited interests to complex/social network analysis at the big data scale that makes it possible to understand social interactions at large scale while it poses computation challenges to early works with algorithm complexity greater than O(n). With the popularization of mobile phone usage, telecommunication networks have turned into a socially binding medium. Considering the traces of human communication held inside these networks, telecommunication networks are now able to provide a proxy for human social networks. Researchers who realized the value of call graphs to analyze social networks began using call detail records, which are stored by telecommunication operators, to determine social network characteristics. To study degree characteristics and structural properties in large-scale social networks, we gathered a tera-scale dataset of call detail records. We first empirically evaluate some statistical models against the degree distribution of the country's call graph and determine that a Pareto log-normal distribution provides the best fit, despite claims in the literature that power-law distribution is the best model. We then question how network operator, size, density and location affect degree distribution to understand the parameters governing it in social networks. Our empirical analysis indicates that changes in density, operator and location do not show a particular correlation with degree distribution; however, the average degree of social networks is proportional to the logarithm of network size. We also report on the structural properties of the communication network.

Besides structural property analysis, community identification is of great interest in practice to learn high cohesive subnetworks about different subjects in a network. In graph theory, $k$-core is a key metric used to identify subgraphs of high cohesion, also known as the 'dense' regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the topologies change dynamically, we are challenged not only to materialize $k$-core subgraphs for one time but also to maintain them in order to keep up with continuous updates. Adding to the challenge is that real world datasets are outgrowing the capacity of a single server and its main memory. These challenges inspired us to propose a new set of distributed algorithms for $k$-core view construction and maintenance on a horizontally scaling storage and computing platform. Our algorithms execute against the partitioned graph data in parallel and take advantage of $k$-core properties to aggressively prune unnecessary computation. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining $k$-core incrementally and in batch windows over complete reconstruction. %Our algorithms thus enable practitioners to create and maintain many $k$-core views on different topics in rich social network content simultaneously.

Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multiresolution community representation that has to be maintained over time. We first formalize this problem using the $k$-core metric projected at multiple $k$-values, so that multiple community resolutions are represented with multiple $k$-core graphs. Recognizing that large graphs and their even larger attributed content cannot be stored and managed by a single server, we then propose distributed algorithms to construct and maintain a multi-$k$-core graph, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-$k$-core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on multiple subjects in rich network content simultaneously.


DATE: 10 July, 2014, Thursday @ 10:00