MA/CS-109: Wandering around in a graph
Graph Random Walks
n We saw that given a map (graph), one can
Leo Reyzin determine paths that satisfy a specific objective
Slides with the help of Azer Bestavros
¨ Minimize cost for going from A to B (Shortest Path)
¨ Visit every node once (Traveling Salesman)
¨ Cross every edge once returning to starting point (Euler Tour)
4/25/13 MA/CS-109 (Azer Bestavros) 2
Wandering around in a graph Wandering around in DC
n We saw that given a map (graph), one can
determine paths that satisfy a specific objective
¨ Minimize cost for going from A to B (Shortest Path)
¨ Visit every node once (Traveling Salesman)
¨ Cross every edge once returning to starting point (Euler Tour)
n What if we have no objective?
¨ A tourist roaming around the city
¨ A web surfer aimlessly clicking from a web page to another
4/25/13 MA/CS-109 (Azer Bestavros) 4
Random walks on a graph A specific question
1. Start in some random node n What are the relative frequencies (= probabilities)
2. At random, select one of the outgoing edges of of visiting the various nodes in the graph.
the current node ¨ What is the likelihood of finding the wandering tourist at
3. Walk over that edge to a new node a given subway station?
4. Repeat from step 2
¨ What is the likelihood that a wandering tourist will
n What properties would emerge as a result of stumble upon the Museum of Science?
doing the above for a very very very long time?
¨ What is the likelihood that a web surfer will end up
visiting Google? How about the BU web site?
4/25/13 MA/CS-109 (Azer Bestavros) 5 4/25/13 MA/CS-109 (Azer Bestavros) 6
1
Shuttling between BU campuses Let’s make it more interesting
Always Tails
BU Always BU BU BU
Medical Always Main Medical Main
Campus Campus Campus Campus
n 50% chance to be on either campus Heads
n I take one step per hour
n If I start on the main campus, what’s the probability that
I am on the main campus about 100 hours from now?
n If I start on the medical campus, what’s the probability that
I am on the main campus about 100 hours from now?
n Does it matter, 100 hours from now, where I started?
n A better question: what fraction of the time am I on the main
campus in the long term?
4/25/13 MA/CS-109 (Azer Bestavros) 7 4/25/13 MA/CS-109 (Azer Bestavros) 8
An equivalent formulation A simple example (continued)
Always Tails
Prob(Always)=1 Prob(Tails)=0.5
BU BU BU BU
Medical Main Medical Main
Campus Campus Campus Campus
Heads
n Instead of thinking of a single person randomly walking, Prob(Heads)=0.5
think of a very large population (ants!)
n Fact (no proof here, because not enough time):
n Instead of asking
“in the long term, what fraction of the time is the person at for most graphs, we will eventually reach a
each campus,” ask
“in the long term, what is the percentage of the population “steady state”: even though each individual ant moves
we should expect on each campus?”
around, the fraction of ants at each location remains the
same. In other words, for each node, number of incoming will
n The two questions are equivalent (this can be proven) equal to number of outgoing.
4/25/13 MA/CS-109 (Azer Bestavros) 9 4/25/13 MA/CS-109 (Azer Bestavros) 10
A simple example (continued) A simple example (continued)
1 0.5
Prob(Always)=1 Prob(Tails)=0.5
AB
BU BU 0.5
Medical Main
Campus Campus
Prob(Heads)=0.5 Pr[A] = 0.5Pr[B]
Fraction of the total at A Fraction of the total at B
over the long term over the long term
n So, let Pr[v] be fraction of the population of vertex v n Intuitively: because to get to A you have to come from B
after the steady state is reached (equivalently, the and half the people from B do that
probability that the person is at vertex v).
n This clearly holds when we think of Pr[B] at given moment
4/25/13 MA/CS-109 (Azer Bestavros) 11 in time and Pr[A] at the next moment in time, after one step.
But after the steady state is reached, moments in time don’t
m4/25a/13tter, because Pr[A] and Pr[B] remainMA/CS-109 (Azer Bestavros) the same over time! 16
2
A simple example (continued) A simple example (continued)
1 0.5 1 0.5
AB AB
0.5 0.5
Pr[B] = Pr[A]+ 0.5Pr[B] Pr[A]+ Pr[B] = 1
n Intuitively: because to get to B you have to come from A n Always true about probabilities: you are either at A or at B
(which everyone at A does) or from B (which half the people do)
4/25/13 MA/CS-109 (Azer Bestavros) 17 4/25/13 MA/CS-109 (Azer Bestavros) 18
A simple example (solution) How about this one? 0.9
1 0.5 1
AB AB
0.5 0.1
Pr[A] = 0.5Pr[B] Pr[A] = 0.1Pr[B]
Pr[B] = Pr[A]+ 0.5Pr[B] Pr[B] = Pr[A]+ 0.9 Pr[B]
Pr[A]+ Pr[B] = 1
Pr[A]+ Pr[B] = 1
Pr( A) = 1 Pr(B) = 2 Pr(A) ≈ 9.1% Pr(B) ≈ 90.9%
3 3
4/25/13 MA/CS-109 (Azer Bestavros) 19 4/25/13 MA/CS-109 (Azer Bestavros) 20
How about this one? How about this one?
0.3 Pr[A] = .3Pr[A]+.4 Pr[B]+.3Pr[C]
A Pr[B] = .2 Pr[A]+.3Pr[C]
Pr[C] = .5Pr[A]+.6 Pr[B]+.4 Pr[C]
0.2 0.5 Pr[A]+ Pr[B]+ Pr[C] = 1
0.4 0.3
B C0.3 Pr[A] ≈ 32%
Pr[B] ≈ 21%
0.4 Pr[C] ≈ 47%
0.6 Why is it interesting to randomly walk web graphs?
4/25/13 MA/CS-109 (Azer Bestavros) 23
3
Random walks on web graphs Is surfing really a random walk?
n Web pages are nodes of the graph; a link from a n Of course not…
page to another is an edge
¨ People are not “robots” clicking randomly on links
n If web users surf randomly, then they are ¨ People are not synchronized in their clicks
effectively doing a random walk on the web graph ¨ People often just type the URL they want
¨ Links on a page may change over time
n Computing the relative frequency with which a ¨ Many web pages are dynamic
page is visited would make for a good measure of ¨ …
the page’s popularity
n But despite all that…
¨ Where a random walk is more likely to take us tells us
something about which pages are more important!
¨ That’s what models are for!
4/25/13 MA/CS-109 (Azer Bestavros) 24 4/25/13 MA/CS-109 (Azer Bestavros) 27
Random walks on web graphs Random Walks as Measures of Importance
n But, the web graph has ~ 100 billion pages! We n It’s hard to know how to measure “importance” of
are not about to write 100 billion equations and nodes in graphs
solve them?!
n The probability that a random walk takes me to a
particular node seems like a good measure
n We can simulate the random walk and measure n Interesting implication of this measure: if I am
the frequency of visits to the various pages. connected to someone “important”, I am more
likely to be “important”, too
n This process is at the heart of Google’s
“PageRank” algorithm (and ~ $260 billion ¨ Because I am more likely to get random walkers,
market value). because there are many of them at my important
neighbor
4/25/13 MA/CS-109 (Azer Bestavros) 28 4/25/13 MA/CS-109 (Azer Bestavros) 29
(Larry) PageRank Other Applications
PageRank relies on the uniquely democratic n Planning capacity of food courts / bathrooms in a
building or park
nature of the web by using its vast link A B
structure as an indicator of an individual n Pricing of billboard advertising in a mall
n Trying to measure importance of nodes in other
page's value. In essence, Google interprets a
graphs
link from page A to page B as a vote, by page C
A, for page B. But, Google looks at more than ¨ E.g., identifying opinion leaders on a graph of human
interactions
the sheer volume of votes, or links a page
¨ E.g, identifying influential interest groups in politics
receives; it also analyzes the page that casts
the vote. Votes cast by pages that are
themselves "important" weigh more heavily
and help to make other pages "important".
Source: Google
4/25/13 MA/CS-109 (Azer Bestavros) 33 4/25/13 MA/CS-109 (Azer Bestavros) 34
4
Google Bombs Example Bomb from 2005-2007:
n If we know how an algorithm works, we can
manipulate its output.
n How can we take advantage of PageRank to
make a web page show up on the first line of the
first page of a search?
n A big business and even a political instrument! Try
this search: “miserable failure”
4/25/13 MA/CS-109 (Azer Bestavros) 37 4/25/13 MA/CS-109 (Azer Bestavros) 38
Food for Thought
n PageRank depends on the web’s link structure to
do a good job. But the web’s link structure is
influenced by how PageRank works!
n In other words, the web is changing because
people modify links in order to rank higher with
Google. Does that mean that this measure of
importance is no longer valid? Is Google
undermining the assumption that a link = a vote?!
n Remember, “all models are wrong, but some are
useful” (E.P. Box). Google (and others) are
tweaking the model to make it less wrong and
more useful! MA/CS-109 (Azer Bestavros) 39
4/25/13
5