The Link Prediction Problem for Social Networks
∗
David Liben-Nowell†
Laboratory for Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139 USA
dln@theory.lcs.mit.edu
Jon Kleinberg‡
Department of Computer Science
Cornell University
Ithaca, NY 14853 USA
kleinber@cs.cornell.edu
January 8, 2004
Abstract
Given a snapshot of a social network, can we infer which new interactions among its members
are likely to occur in the near future? We formalize this question as the link prediction problem,
and develop approaches to link prediction based on measures for analyzing the “proximity” of
nodes in a network. Experiments on large co-authorship networks suggest that information
about future interactions can be extracted from network topology alone, and that fairly subtle
measures for detecting node proximity can outperform more direct measures.
1 Introduction
As part of the recent surge of research on large, complex networks and their properties, a consid-
erable amount of attention has been devoted to the computational analysis of social networks—
structures whose nodes represent people or other entities embedded in a social context, and whose
edges represent interaction, collaboration, or influence between entities. Natural examples of social
networks include the set of all scientists in a particular discipline, with edges joining pairs who
have co-authored papers; the set of all employees in a large company, with edges joining pairs
working on a common project; or a collection of business leaders, with edges joining pairs who
have served together on a corporate board of directors. The availability of large, detailed datasets
encoding such networks has stimulated extensive study of their basic properties, and the identi-
fication of recurring structural features. (See, for example, the work of Watts and Strogatz [28],
Watts [27], Grossman [9], Newman [19], and Adamic and Adar [1], or, for a thorough recent survey,
Newman [20].)
Social networks are highly dynamic objects; they grow and change quickly over time through
the addition of new edges, signifying the appearance of new interactions in the underlying social
structure. Understanding the mechanisms by which they evolve is a fundamental question that
is still not well understood, and it forms the motivation for our work here. We define and study
a basic computational problem underlying social network evolution, the link prediction problem:
∗An abbreviated version of this paper appears in Proceedings of the Twelfth Annual ACM International Conference
on Information and Knowledge Management (CIKM’03), November 2003, pp. 556–559.
†Supported in part by an NSF Graduate Research Fellowship.
‡Supported in part by a David and Lucile Packard Foundation Fellowship and NSF ITR Grant IIS-0081334.
1
Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be
added to the network during the interval from time t to a given future time t′.
In effect, the link prediction problem asks: to what extent can the evolution of a social network
be modeled using features intrinsic to the network itself? Consider a co-authorship network among
scientists, for example. There are many reasons, exogenous to the network, why two scientists who
have never written a paper together will do so in the next few years: for example, they may happen
to become geographically close when one of them changes institutions. Such collaborations can be
hard to predict. But one also senses that a large number of new collaborations are hinted at by
the topology of the network: two scientists who are “close” in the network will have colleagues in
common, and will travel in similar circles; this suggests that they themselves are more likely to
collaborate in the near future. Our goal is to make this intuitive notion precise, and to understand
which measures of “proximity” in a network lead to the most accurate link predictions. We find
that a number of proximity measures lead to predictions that outperform chance by factors of 40 to
50, indicating that the network topology does indeed contain latent information from which to infer
future interactions. Moreover, certain fairly subtle measures—involving infinite sums over paths in
the network—often outperform more direct measures, such as shortest-path distances and numbers
of shared neighbors.
We believe that a primary contribution of the present paper is in the area of network evolution
models. While there has been a proliferation of such models in recent years—see, for example, the
work of Jin et al. [11], Barabasi et al. [2], and Davidsen et al. [5] for recent work on collaboration
networks, or the survey of Newman [20]—they have generally been evaluated only by asking whether
they reproduce certain global structural features observed in real networks. As a result, it has been
difficult to evaluate and compare different approaches on a principled footing. Link prediction, on
the other hand, offers a very natural basis for such evaluations: a network model is useful to the
extent that it can support meaningful inferences from observed network data. One sees a related
approach in recent work of Newman [17], who considers the correlation between certain network
growth models and data on the appearance of edges of co-authorship networks.
In addition to its role as a basic question in social network evolution, the link prediction problem
could be relevant to a number of interesting current applications of social networks. Increasingly, for
example, researchers in artificial intelligence and data mining have argued that a large organization,
such as a company, can benefit from the interactions within the informal social network among its
members; these serve to supplement the official hierarchy imposed by the organization itself [13, 23].
Effective methods for link prediction could be used to analyze such a social network, and suggest
promising interactions or collaborations that have not yet been utilized within the organization.
In a different vein, research in security has recently begun to emphasize the role of social network
analysis, largely motivated by the problem of monitoring terrorist networks; link prediction in this
context allows one to conjecture that particular individuals are working together even though their
interaction has not been directly observed [14].
The link prediction problem is also related to the problem of inferring missing links from an
observed network: in a number of domains, one constructs a network of interactions based on
observable data and then tries to infer additional links that, while not directly visible, are likely
to exist [8, 22, 26]. This line of work differs from our problem formulation in that it works with
a static snapshot of a network, rather than considering network evolution; it also tends to take
into account specific attributes of the nodes in the network, rather than evaluating the power of
prediction methods based purely on the graph structure.
We now turn to a description of our experimental setup, in Section 2. Our primary focus is on
2
training period Core
authors papers edges authors |Eold | |Enew |
astro-ph 5343 5816 41852 1561 6178 5751
cond-mat 5469 6700 19881 1253 1899 1150
gr-qc 2122 3287 5724 486 519 400
hep-ph 5414 10254 17806 1790 6654 3294
hep-th 5241 9498 15842 1438 2311 1576
Figure 1: The five sections of the arXiv from which co-authorship networks were constructed:
astro-ph (astrophysics), cond-mat (condensed matter), gr-qc (general relativity and quantum
cosmology), hep-ph (high energy physics—phenomenology), and hep-th (high energy physics—
theory). The set Core is the subset of the authors who have written at least κtraining = 3 papers
during the training period and κtest = 3 papers during the test period. The sets Eold and Enew
denote edges between Core authors which first appear during the training and test periods, respec-
tively.
understanding the relative effectiveness of network proximity measures adapted from techniques
in graph theory, computer science, and the social sciences, and we review a large number of such
techniques in Section 3. Finally, we discuss the results of our experiments in Section 4.
2 Data and Experimental Setup
Suppose we have a social network G = 〈V, E〉 in which each edge e = 〈u, v〉 ∈ E represents
an interaction between u and v that took place at a particular time t(e). We record multiple
interactions between u and v as parallel edges, with potentially different time-stamps. For two
times t < t′, let G[t, t′] denote the subgraph of G consisting of all edges with a time-stamp between
t and t′. Here, then, is a concrete formulation of the link prediction problem. We choose four times
t0 < t
′
0 < t1 < t
′
1, and give an algorithm access to the network G[t0, t
′
0]; it must then output a list
of edges, not present in G[t0, t
′
0], that are predicted to appear in the network G[t1, t
′
1]. We refer to
[t0, t
′
0] as the training interval and [t1, t
′
1] as the test interval.
Of course, social networks grow through the addition of nodes as well as edges, and it is not
sensible to seek predictions for edges whose endpoints are not present in the training interval. Thus,
in evaluating link prediction methods, we will generally use two parameters κtraining and κtest (each
set to 3 below), and define the set Core to be all nodes incident to at least κtraining edges in G[t0, t
′
0]
and at least κtest edges in G[t1, t
′
1]. We will then evaluate how accurately the new edges between
elements of Core can be predicted.
We now describe our experimental setup more specifically. We work with five co-authorship
networks G, obtained from the author lists of papers at five sections of the physics e-Print arXiv,
www.arxiv.org. (See Figure 1 for statistics on the size of each of these five networks.) Some
heuristics were used to deal with occasional syntactic anomalies; and authors were identified by
first initial and last name, a process that introduces a small amount of noise due to multiple authors
with the same identifier [18]. The errors introduced by this process appear to be minor.
Now consider any one of these five graphs. We define the training interval to be the three years
[1994, 1996], and the test interval to be [1997, 1999]. We denote the subgraph G[1994, 1996] on the
training interval by Gcollab := 〈A, Eold〉, and use Enew to denote the set of edges 〈u, v〉 such that
u, v ∈ A, and u, v co-author a paper during the test interval but not the training interval—these
are the new interactions we are seeking to predict.
3
graph distance (negated) length of shortest path between x and y
common neighbors |Γ(x)∩ Γ(y)|
Jaccard’s coefficient
|Γ(x)∩Γ(y)|
|Γ(x)∪Γ(y)|
Adamic/Adar
∑
z∈Γ(x)∩Γ(y)
1
log |Γ(z)|
preferential attachment |Γ(x)| · |Γ(y)|
Katzβ
∑∞
`=1 β
` · |paths
〈`〉
x,y |
where paths
〈`〉
x,y := {paths of length exactly ` from x to y}
weighted: paths
〈1〉
x,y := number of collaborations between x, y.
unweighted: paths
〈1〉
x,y := 1 iff x and y collaborate.
hitting time −Hx,y
stationary-normed −Hx,y · piy
commute time −(Hx,y + Hy,x)
stationary-normed −(Hx,y · piy + Hy,x · pix)
where Hx,y := expected time for random walk from x to reach y
piy := stationary distribution weight of y
(proportion of time the random walk is at node y)
rooted PageRankα stationary distribution weight of y under the following random walk:
with probability α, jump to x.
with probability 1 − α, go to random neighbor of current node.
SimRankγ
{
1 if x = y
γ ·
�
a∈Γ(x)
�
b∈Γ(y) score(a,b)
|Γ(x)|·|Γ(y)| otherwise
Figure 2: Values for score(x, y) under various predictors; each predicts pairs 〈x, y〉 in descending
order of score(x, y). The set Γ(x) consists of the neighbors of the node x in Gcollab .
Evaluating a link predictor. Each link predictor p that we consider outputs a ranked list Lp of
pairs in A×A−Eold ; these are predicted new collaborations, in decreasing order of confidence. For
our evaluation, we focus on the set Core, so we define E∗new := Enew ∩(Core×Core) and n := |E
∗
new |.
Our performance measure for predictor p is then determined as follows: from the ranked list Lp, we
take the first n pairs in Core × Core, and determine the size of the intersection of this set of pairs
with the set E∗new .
3 Methods for Link Prediction
In this section, we survey an array of methods for link prediction. All the methods assign a
connection weight score(x, y) to pairs of nodes 〈x, y〉, based on the input graph Gcollab , and then
produce a ranked list in decreasing order of score(x, y). Thus, they can be viewed as computing a
measure of proximity or “similarity” between nodes x and y, relative to the network topology. In
general, the methods are adapted from techniques used in graph theory and social network analysis;
in a number of cases, these techniques were not designed to measure node-to-node similarity, and
hence need to be modified for this purpose. Figure 2 summarizes most of these measures; below we
4
discuss them in more detail. We note that some of these measures are designed only for connected
graphs; since each graph Gcollab that we consider has a giant component—a single component
containing most of the nodes—it is natural to restrict the predictions for these measures to this
component.
Perhaps the most basic approach is to rank pairs 〈x, y〉 by the length of their shortest path in
Gcollab . Such a measure follows the notion that collaboration networks are “small worlds,” in which
individuals are related through short chains [18]. (In keeping with the notion that we rank pairs
in decreasing order of score(x, y), we define score(x, y) here to be the negative of the shortest path
length.) Pairs with shortest-path distance equal to 1 are joined by an edge in Gcollab , and hence
they belong to the training edge set Eold . For all of our graphs Gcollab , there are well more than n
pairs at shortest-path distance two, so our shortest-path predictor simply selects a random subset
of these distance-two pairs.
Methods based on node neighborhoods. For a node x, let Γ(x) denote the set of neighbors
of x in Gcollab . A number of approaches are based on the idea that two nodes x and y are more
likely to form a link in the future if their sets of neighbors Γ(x) and Γ(y) have large overlap; this
follows the natural intuition that such nodes x and y represent authors with many colleagues in
common, and hence are more likely to come into contact themselves. Jin et al. [11] and Davidsen
et al. [5] have defined abstract models for network growth using this principle, in which an edge
〈x, y〉 is more likely to form if edges 〈x, z〉 and 〈z, y〉 are already present for some z.
• Common neighbors. The most direct implementation of this idea for link prediction is to define
score(x, y) := |Γ(x) ∩ Γ(y)|, the number of neighbors that x and y have in common. Newman [17]
has computed this quantity in the context of collaboration networks, verifying a correlation between
the number of common neighbors of x and y at time t, and the probability that they will collaborate
in the future.
• Jaccard’s coefficient and Adamic/Adar. The Jaccard coefficient—a commonly used similarity
metric in information retrieval [24]—measures the probability that both x and y have a feature f ,
for a randomly selected feature f that either x or y has. If we take “features” here to be neighbors
in Gcollab , this leads to the measure score(x, y) := |Γ(x)∩Γ(y)|/|Γ(x)∪Γ(y)|. Adamic and Adar [1]
consider a related measure, in the context of deciding when two personal home pages are strongly
“related.” To do this, they compute features of the pages, and define the similarity between two
pages to be ∑
z : feature shared by x, y
1
log(frequency(z))
.
This refines the simple counting of common features by weighting rarer features more heavily. This
suggests the measure score(x, y) :=
∑
z∈Γ(x)∩Γ(y)
1
log |Γ(z)| .
• Preferential attachment has received considerable attention as a model of the growth of net-
works [16]. The basic premise is that the probability that a new edge involves node x is proportional
to |Γ(x)|, the current number of neighbors of x. Newman [17] and Barabasi et al. [2] have further
proposed, on the basis of empirical evidence, that the probability of co-authorship of x and y is
correlated with the product of the number of collaborators of x and y. This corresponds to the
measure score(x, y) := |Γ(x)| · |Γ(y)|.
Methods based on the ensemble of all paths. A number of methods refine the notion of
shortest-path distance by implicitly considering the ensemble of all paths between two nodes.
5
• Katz [12] defines a measure that directly sums over this collection of paths, exponentially
damped by length to count short paths more heavily. This leads to the measure
score(x, y) :=
∞∑
`=1
β` · |paths〈`〉x,y|,
where paths
〈`〉
x,y is the set of all length-` paths from x to y. (A very small β yields predictions much
like common neighbors, since paths of length three or more contribute very little to the summation.)
One can verify that the matrix of scores is given by (I − βM)−1 − I , where M is the adjacency
matrix of the graph. We consider two variants of this Katz measure: (1) unweighted, in which
paths
〈1〉
x,y = 1 if x and y have collaborated and 0 otherwise, and (2) weighted, in which paths
〈1〉
x,y is
the number of times that x and y have collaborated.
• Hitting time, PageRank, and variants. A random walk on Gcollab starts at a node x, and
iteratively moves to a neighbor of x chosen uniformly at random. The hitting time Hx,y from x to y
is the expected number of steps required for a random walk starting at x to reach y. Since the hitting
time is not in general symmetric, it is also natural to consider the commute time Cx,y := Hx,y+Hy,x.
Both of these measures serve as natural proximity measures, and hence (negated) can be used as
score(x, y).
One difficulty with hitting time as a measure of proximity is that Hx,y is quite small whenever y
is a node with a large stationary probability piy , regardless of the identity of x. To counterbalance this
phenomenon, we also consider normalized versions of the hitting and commute times, by defining
score(x, y) := −Hx,y · piy or score(x, y) := −(Hx,y · piy + Hy,x · pix).
Another difficulty with these measures is their sensitive dependence to parts of the graph far
away from x and y, even when x and y are connected by very short paths. A way of counteracting
this is to allow the random walk from x to y to periodically “reset,” returning to x with a fixed
probability α at each step; in this way, distant parts of the graph will almost never be explored.
Random resets form the basis of the PageRank measure for Web pages [3], and we can adapt it
for link prediction as follows: Define score(x, y) under the rooted PageRank measure to be the
stationary probability of y in a random walk that returns to x with probability α each step, moving
to a random neighbor with probability 1 − α.
• SimRank [10] is a fixed point of the following recursive definition: two nodes are similar to
the extent that they are joined to similar neighbors. Numerically, this is specified by defining
similarity(x, x) := 1 and
similarity(x, y) := γ ·
∑
a∈Γ(x)
∑
b∈Γ(y) similarity(a, b)
|Γ(x)| · |Γ(y)|
for some γ ∈ [0, 1]. We then define score(x, y) := similarity(x, y). SimRank can also be interpreted
in terms of a random walk on the collaboration graph: it is the expected value of γ`, where ` is a
random variable giving the time at which random walks started from x and y first meet.
Higher-level approaches. We now discuss three “meta-approaches” that can be used in con-
junction with any of the methods discussed above.
• Low-rank approximation. Since the adjacency matrix M can be used to represent the graph
Gcollab , all o