Many data-mining problems can be expressed as finding “similar” sets.
Three Essential Techniques for Similar Documents
- Shingling : convert documents, emails, etc., to sets.
- Minhashing : convert large sets to short signatures, while preserving similarity.
- Locality-sensitive hashing : focus on pairs of signatures likely to be similar.
Jaccard Similarity \( Sim(C_1, C_2) = C_1 ∪ C_2 / (C_1 ∩ C_2) \)
- Rows = elements of the universal set.
- e.g.: the set of all k-shingles.
- Columns = sets
- 1 in row e and column S iff e is a member of S
- Column similarity is the Jaccard similarity of the sets of their row with 1.
- Typical matrix is sparse.
- Imagine the rows permuted randomly
- Define minhash function h(C) = the number of the first row(in permuted order) in which column C has 1
- Use several(e.g., 100) independent hash functions to create a signature for each column.
- The signatures can be displayed in another matrix – the signature matrix – whose columns represent the sets and the rows represent the minhash values, in order for that column.
a # 1 1 b # 0 1 c # 1 0 d # 0 0
- The probability (over all permutaions of the rows) that h(C_1) = h(C_2) is the same as Sim(C_1, C_2)
- Both are a/(a+b+c)!
- The similarity of signatures is the fraction of the minhash functions in which they agree
- Thinking of signatures as columns of integers, the similarity of signatures is the fraction of rows in which they agree
- Thus, the expected similarity of two signatures equals the jaccard similarity of the columns or sets that the signatrues represent
suppose 1 billion rows
- A good approximation to permuting rows: pick, say, 100 hash functions.
- For each column c and each hash function h_i, keep a “slot” M(i,c)
- Intent: M(i,c) will become the smallest value of h_i(r) for which column c has 1 in row r.
- i.e., h_i(r) gives order of rows for ith permutation.
- Big idea: hash columns of signature matrix M several times.
- Arrange that (only) similar columns are likely to hash to the same bucket.
- Candidate pairs are those that hash at least once to the same bucket.
1st pass: hash p 2nd pass:
two hash tables, 2 bitmaps, 3 passes. important points The hash functions have to be independent. We need to check both hashes on the third pass. Otherwise we will count pairs(freq item, freq item) hashTo [infreq bucket] hashTo [freq bucket].
Key idea: use multiple independent hash tables in 1st pass.
Take a random sample of the market baskets.
Only work if the number of candidates can be counted in the main memory Subset Key “monotonicity”
no false negative, in some case it may not give an answer, so you need to rerun it. No gurantee for finishing. Start as in the simple algorithm, but lower the threshold slightly. Goal is to
Negative Border {A, B, C, D}
Community, C Memberships, M Nodes, V AGM -> Graph \( P(u, v) = 1 - ∏\limitsc∈ M_u ∩ M_v(1-p_c) \)
can express a variety of community stuctures: Non-overlapping, Overlapping, Nested
FuA The membership strength of node
\( P(u,v) = 1 - ∏\limits_c (1-p_c(u,v)) \)
Then prob. at least one common
&= 1 - exp(-F_u⋅ F_vT)
\end{align*}
Given a Network G(V,E), estimate F
\( arg max_F ∏ p(u,v) ∏ (1-p(u,v)) \)
take the log likelihood
gradient descent slow because \( \bigtriangledown l(F_u) \) takes linear time
\( ∑\limitsv¬ ∈ N(u) F_v = (∑\limits_v F_v - F_u - ∑\limitsv∈ N(u) F_v) \)
We cache \( sumv¬ ∈ N(u) F_v \)
now it takes linear time in the degree
Goal: find densely linked clusters
Discovering social circles, circles of trust
Graph Partitioning Criteria: Conductance
Adjacency matrix
Graph Partitioning
Task: Partition the graph into two pieces such the resulting pieces have low conductance.
Problem: Computing optimal cut is NP-hard.
Three basic stages:
- Pre-processing
Construct a matrix representation of the graph
- Decomposition
Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors
- Grouping
Assign points to two or more clusters, based on the new representation
K-way Sepectral Clustering Recursive bi-partitioning (‘92) Disadvantages: inefficient, unstable Cluster multiple eigenvectors (‘00 preferable)
Searching for small communities in the Web graph Frequent itemsets = complete bipartite graphs! view each mode i as a set Si of nodes i points to Ks,t = a set Y of size t that occurs in s sets S_i s … minimum support (|S| = s) t … itemset size (|Y| = t)
Data Management VS Steam Management DBMS e.p. SQL internel insert Stream Management is import when the input rate is controlled externally e.g. Google query
Ad-hoc queries Standing queries query once, active all the time e.g. Report each new max value ever seen in stream S archive streams, if working storage is limited
Mining query streams Mining click streams IP packets can be monitored at a switch
queries are about a window of length N
what if N > main memory
E.G. Averages
Stream of integers
Standing query: what is the average of the integers in the window(size N)?
For the first N inputs, simply sum and count the average
Afterward, when a new input
Bloom Filters can have false positive A Bloom filter is an array of bits, together with a number of hash functions The argument of each hash function is a stream element, and it returns a position in the array. E.G. Bloom Filter Use N = 11 bits for our filter Stream elements = integers Use two hash functions: h_1(x) = odd numbered bits h_2(x) = the same, but takes even numbered bits Bloom Filter Lookup compute h(y) for each hash function y. if all resulting in 1
Counting Bits
Problem: given a stream of 0’s and 1’s, be prepared to answer queries of the form “how many 1’s in the last k bits?” where k
O(log^2 N bits per stream)
Each bit in the stream has a timestamp, starting 0, 1, … Record timestamps modulo N(the window size), so we can represent any relevant timestamp in O(log2N) bits.
A bucket is a segment of the window; it is represented by a record consisting of:
- The timestamp of its end [O(log N) bits]
- The number of 1’s between its beginning and its beginning and end [O(log log N) bits].
Either one Bucket do not overlap Buckets are sorted by size Buckets disappear when their end time > N
When a new bit comes in, drop the oldest bucket if its end-time is prior to N time units before the cur time If the current bit is 0, no other changes If the current bit is 1:
- Create a new bucket of size 1, for just this bit. End timestamp = current time.
- If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
- If there are now three buckets of size 2, …
so on …
Google find unique query for the past month
E.G.: Fixed Sample Size Sampling Key-Value Pairs
Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Applications How many different words are founc among the web pages being crawled at a site? How many unique users visited Facebook the past month?
Flajolet-Martin Approach Pick a hash function h that maps each of the n elements to at least log2n bits. For each stream element a, let r(a) be the number of trailing 0’s in h(a) Record R = the maximum r(a) seen Estimate = 2^R
The probability that a given h(a) ends in at least i 0’s is 2-i If there are m different elements, the probability that R \geq i is 1-(1-2-i)^m Since 2-i is small, 1-(1-2-i)^m = 1 - e-m2^{-i}
Partition your samples into small groups. Log n, where n = size of universal set, suffices Take the average of groups Then take the median of the averages
AMS Expected Value of X 2nd moment is ∑_a(m_a)^2 E(X) = (1/n)(∑all times t n * (twice the number of times the stream element at time t appears from that time on)-1)) = ∑_a(1+3+5+\ldots + 2m_a-1) Problem: Streams never end Fixups
h(1) 3+7=10 1010 =1 h(9) 3*9+7=34%11=3 0011 = 0 h(8) 31%11=9 1001 = 0 h(5) 22 = 0 0000 = 4 h(2) 2 0010 = 1 h(6) 25%11=3 0011 = 0 h(3) 16%11=5 0101 = 0 h(4) 19%11=8 1000 = 3 h(7) 28%11=6 0110 = 1 h(10) 37%11=4
long tail
Editorial and hand curated Simple aggregates top 10, most popular, recent uploads Tailored to individual users
C = set of Customers S = set of Items Utility function u: C× S → R R = set of ratings R is a totally ordered set
implicit: purchase means high ratings
Key problem: matrix U is sparse Cold start: New items have no ratings New items have no history Three approaches to recommender systems Content-based Collaborative Latent factor based
Main idea: Recommand items to customer x similar to previous items rated highly by x Users like → Item profiles → User profile
For each item, create an item profile Profile is a set of features (a vector) Text features Profile = set of “important” words in item (document) How to pick important words? TF-IDF
User has rated items with profiles i_1, \ldots, i_n Simple: (weighted) average of rated item profiles Variant: Normalize weights using average rating of user
Items are movies, only feature is “Actor” Suppose users x has watch 5 movies 2 movies featuring actor A 3 movies featuring actor B User Profile A rating = 0.4 B rating = 0.6
Same example, 1-5 star ratings Actor A’s movies rated 3, 5 Actor B’s movies rated 1, 2, 4 Useful step: Normalize ratings by subtracting user’s mean rating(3)
User profile x, Item profile i Estimate U(x, i) = cos(θ) = (x ⋅ i) / (|x||i|) cosine distance
No need for data on other users Able to recommend to users with unique tastes Able to recommend new & unpopular items No first-rater problem Explanations for recommended items Content features that caused an item to be recommended
Finding the appropriate features is hard Overspecialization Never recommends items outside user’s content profile People might have multiple interests Unable to exploit quality judgments of other users Cold-start problem for new users How to build a user profile?
Consider user x Find set N of other users whose ratings are “similar” to x’s ratings Estimate x’s ratings based on ratings of users in N
Option 1 : Jaccard Similarity sim(A,B) = |r_A∩ r_B|/|r_A∪ r_B| Problem: Ignores rating values Option 2: Cosine similarity sim(A,B) = cos(r_A, r_B) Problem: treat missing data as negtive Option 3: Centered cosine Normalize ratings by subtracting row mean Also known as Pearson Correlation
Let r_x be the vector of user x’s ratings Let N be the set of k users most similar to x who have also rated item i Prediction for user x and item i Option 1: rxi = 1/k ∑y∈ N ryi Option 2: rxi = ∑y∈ N sxyryi/sumy∈ N sxy where sxy is the similarity between x and y
Estimate rating for item i based on ratings for similar items rxi = \cfrac{∑j∈ N(i;x)s_{ij*rij}}{∑j∈ N(i;x)sij}
In theory, user-user and item-item are dual approaches In practice, item-item outperforms user-user in many user cases Item are “simpler” than users Items belong to a small set of “genres”, users have varied tastes Item Similarity is more meaningful than user Similarity
Root-mean-square erroe (RMSE) Problems: Narrow focus on accuracy sometimes misses the point Prediction Diversity Prediction Context Order of predictions In practice, we care only to predict high ratings Alternative: precision at top k
Multi-scale modeling of the data Global overall deviation Factorization addressing “regional” effects Collaborative filtering extract local patterns
Global: Baseline estimation Local neighborhood (CF/NN): Final estimate
In practice we get better estimates if we model deviations: bxi = μ + b_x + b_i rxi = bxi + \frac{∑ sij(rxj-bxj)}{∑ sij} μ = overall mean rating b_x = rating deviation of user x b_i = (avg. rating of movie i) - mu Problems/Issues: Similarity measures are “arbitrary” Pairwise similarities neglect interdependencies among users Taking a weight average can be restricting
Recommendations via Optimization
“SVD” on Netflix data: R ≈ Q⋅ P^T SVD gives minimum reconstruction error (Sum of squared errors)
Pros & Cons +Easy interpretation +Sparse basis -Duplicate columns and rows columns of large norms will be sampled many times
- Q: Find users that like ‘Matrix’
- A: Map query into a ‘concept space’ – how?
Project into concept space: inner product with each ‘concept’ vector v_i Compactly, we have: q_concept = qV
- Observation: User d that rated (‘Alien’) will be similar to user q that rated (‘Matrix’), although d and q have zero ratings in common!
BFR is a variant of k-means for very large (disk-resident) data sets Assumes each cluster is normally distributed around a centroid in Euclidean space
Points are read from disk one main-memory-full at a time Most points from previous memory are summarized by simple statistics To begin, from the initial load we select the initial k centroid
Discard set (DS): Points close enough to a centroid to be summarized Compression set (CS): Groups of points that are close together but not close to any existing centroid These points are summarized, but not assigned to a cluster Retained set (RS): isolated points waiting to be assigned to a compression set
For each cluster, DS is summarized by: The number of points, N The vector SUM, whose ith component = sum of the coordinates of the points in ith dimension The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension centroid and variance can be calculated by N, SUM and SUMSQ
Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster
Q1) How do we decide if a point is “close enough” to a cluster (and discard BFR approach The Mahalanobis distance is less than a threshold High likelihood of the point belonging to currently 3 σ Q2) Should 2 CS subclusters be combined? Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density
Limitations of BFR Algorithm Makes strong assumptions, not work on non-linear separable
Assumes a Euclidean distance Allows clusters to assume any shape Uses a collection of representative points to represent cluster
Pass 1 of 2: Pick a random sample of points that fit in main memory Cluster sample points hierarchically to create the initial clusters Pick representative points: For each cluster, pick k representative points, as dispersed as possible Move each representative point a fixed fraction (e.g., 20%) toward the centroid of the cluster Pass 2 of 2: Now, rescan the whole dataset and visit each point p in the data set Place it in the “closest cluster”
Matching Algorithm Problem: Find a maximum matching for a given bipartite graph A perfect one if exists There is a polynomial-time offline algorithm based on augmenting paths Online Graph Matching Problem girls -> boys In each round, one girl’s choices are revealed At that time, we have decide to either pair them. Example of application: Assigning tasks to servers.
just pick a boy eligible for a new girl Competitive Ration = minall possible inputs I(|Mgreedy|/|Mopt|) the worst case performance overall all possible inputs of greedy algorithm
Mopt | <= 2 | Mgreedy |
Performance-based advertising works! Multi-billion-dollar industry What ads to show for a given query?
A stream of queries arrives at the search engine: q_1,q_2,… Several advertisers bid on each query When query q_i arrives, search engine must pick a subset of advertisers whose ads are shown Goal: Maximize serach engine’s revenues Clearly we need an online algorithm!
CTR (click through rate) * Bid
Given: A set of bids by advertisers for search queries A click-through rate for each advertiser-query pair A budget for each advertiser (say for 1day, month…) A limit on the number of ads to be displayed with each search query Respond to each search query with a set of advertisers such that: The size of the set is no larger than limitation Each advertiser has bid on the serach query Each advertiser has enough budget left to pay
CTR of an ad is unknown Advertisers have limited budgets and bid on multiple ads (BALANCE algorithm)
CTR for a query-ad pair is measured historically Averaged over a time peroid Some complications we won’t cover in this lecture: CTR is position dependent Explore v Exploit: Keep showing ads we already know the CTR of, or show new ads to estimate their CTR?
Simplest algorithm is greedy.
Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B This is the worst case! And it’s determinstic. Greedy always give the same answer to the same situation.
For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily (but in a determinstic way)
BALANCE must exhaust at least one advertiser’s budget: if not, we can allocate more queries Assume BALANCE exhausts A_2’s budget
In the general case, worst competitive ration of BALANCE is 1-1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio
N advertisers: A_1, A_2, … A_N Queries: N\( ⋅ \) B queries appear in N rounds of B queries each Bidding: Round 1 queries: bidders A_1, A_2, …, A_N Round 2 queries: bidders A_2, …, A_N Round i queries: bidders A_i, …, A_N
After k rounds, the allocation to advertiser k is: S_K = ∑1\leq i \leq k B/(N-i+1)
Fact: for large n Result due to Euler ln(N) - 1 = ln(N - k) k = N(1 - 1/e) So after the first k = N(1-1/e) rounds, we cannot allocate a query to any advertiser Revenue = B⋅ N(1-1/e) Competitive ratio = 1 - 1/e
So far: all bids = 1, all budgets equal (=B) In a general setting BALANCE can be terrible
Consider query q, bidder i Bid = x_i Budget = b_i Amount spent so far = m_i Fraction of budget left over f_i = 1 - m_i/b_i Define φ_i(q) = x_i(1-e-f_i) Allocate query q to bidder i with largest value of φ_i(q) Same competitive ratio (1 - 1/e)
Hinge Loss
There is a subtlety about what a “hash function” really is in the context of LSH family. A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison. E.g.: the family of minhash functions computes minhash values and says “yes” iff they are the same. Shorthand: “h(x) = h(y)” means h says “yes” for pair elements x and y
Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d_1, d_2, p_1, p_2)-sensitive if for any x and y in S:
- If \( d(x,y) \leq d_1 \), then the probability over all h in H, that h(x) = h(y) is at least p_1.
- If \( d(x,y) \geq d_2 \), then the probability over all h in H, that h(x) = h(y) is at most p_2.
Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations. Then Prob[h(x)=h(y)] = 1 - d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance. Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.
The “bands” technique we learned for signature matrices carries over to this more general setting. Goal: the “S-curve” effect seen here. AND construction like “rows in a band.” OR construction like “many bands.”
Given family H, construct family H’ whose members each consist of r functions from H. For \( h = {h_1, \ldots, h_r} \) in H’, h(x) = h(y) iff h_i(x) = h_i(y) for all i. Theorem: If H is (d_1, d_2, p_1, p_2)-sensitive, then H’ is (d_1, d_2, (p_1)^r, (p_2)^r)-sensitive. Proof: Use fact that h_i’s are independent.
Given family H, construct family H’ whose members each consist of b functions from H. For \( h = {h_1, \ldots, h_b} \) in H’, h(x) = h(y) iff h_i(x) = h_i(y) for some i. Theorem: If H is (d_i, d_2, p_1, p_2)-sensitive, then H’ is (d_1, d_2, 1- (1-p_1)^b, (1-p_2)^b)-sensitive.
AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.
As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND’s and OR’s alternating.
Each of the two probabilities p is transformed into 1-(1-p^r)^b. The “S-curve” studied before. E.g.: Take H and construct H’ by the AND construction with r=4. Then, from H’, construct H” by the OR construction with b=4. (1-(1-p^4)^4)
Each of the two probabilities p is transformed 1-(1-p^b)^r The same S-curve, mirrored horizontally and vertically.
E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction. Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive
For each S-curve 1-(1-p^r)^b, there is a threshold t, for which 1-(1-t^r)^b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant. Iteratea as you like.
For cosine distance, there is a technique analogous to minhashing for generating a (d_1,d_2,(1-d_1/180),(1-d_2/180))-sensitive family for andy d_1 and d_2 Called random hyperplane.
Each vector v determines a hash function h_v with two buckets. h_v(x) = +1 if \( v ⋅ x > 0 \); = -1 if \( v ⋅ x < 0 \) LS-family H = set of all functions derived from any vector. Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)
Pick some number of vectors, and hash your data for each vector. The result is a signature(sketch) of +1’s and -1’s that can be used for LSH lke the minhash signatures for Jaccard distance. But you don’t have to think this way. The existence of the LSH-family is sufficient amplification by AND/OR.
We need not pick from among all possible vectors v to form a component of a sketch. It suffices to consider only vector v consisting of +1 and -1 components.
Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containig its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.
If points are distance \( \geq 2a \) apart then \( 60 \leq θ \leq 90 \) for there to be a chance that the points go in the same bucket. I.e., at most 1/3 probability If points are distance \( \leq a/2 \), then there is at least 1/2 chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.
For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions. Here, we seem to need \( e \geq 4d \). But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so. Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.
Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. “sports” or “history”. Allow search queries to be answered based on interests of the user. E.g.:Query “Trojan” wants different pages depending on whether you are interested on sports, history or computer security. Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability to avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic e.g., Open Directory(DMOZ) pages for a given topic/query For each teleport set S, we get a different vector r_s.
To make this work all we need is to update the teleportating part of the PageRank formulationg:
\begin{equation}
Aij = \begin{cases}
β Mij+(1-β)/|S| &\mbox{if}\ i∈ S
β Mij & \mbox{otherwise}
\end{cases}
\end{equation}
A is stochastic!
We weighted all pages in the teleport set S equally
Could also assign different weights to pages!
Random walk with Restart: S is a single element
Compute as for regular PageRank:
Multiply by M, the add a vector
Maintains sparseness
- Create different PageRanks for different topics
- The 16 DMOZ top-level categories: arts, business, sports, \ldots
- Which topic ranking to use?
- User can pick from a menu
- Classify query into a topic
- Can use the context of the query
- E.g., query is launched from a web page talking about a known topic
- History of queries e.g., “basketball” followed by “Jordan”
- User context, e.g., user’s bookmarks, /ldots
a.k.a: Similarity, Relevance
- Shortest path is not good
- No effect of degree-1 nodes (E,F,G)!
- Multi-faceted relationships
- Network flow is not good
- Does not punish long paths
- Multiple Connections
- Quality of connection
- Direct & In-direct connections
- Length, Degree, Weight \ldots
- SimRank: Random walks from a fixed node on k-partite graphs
- Setting: k-partite graph with k types of nodes
- e.g.: picture nodes and tag nodes
- Do a Random Walk with Restarts from node u
- i.e., teleport set S = {u}
- Resulting scores measures similarity to node u
- Problem:
- Must be done once for each node u
- Suitable for sub-Web-scale applications
- Spamming:
- Any deliberate action to boost a web page’s position in search engine results, incommensurate with page’s real value
- Spam:
- Web pages that are the result of spamming
- This is a very broad definition
- SEO industry minght disagree!
- SEO = search engine optimization
- Approximately 10-15% of web pages are sapmming
- Early search engines:
- Crawl the Web
- Index pages by the words they contained
- Respond to search queries (lists of words) with the pages containing those words
- Early page ranking:
- Attempt to order pages matching a search query by “importance”
- First search engines considered:
- Number of times query words appeared
- Prominence of word position, e.g. title, header
- As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to thir own site – whether they wanted to be thre or not
- E.g.:
- Shirt-seller might pretend to be about “movies”
- Techniques for achieving high relevance/importance for a web page
- How do you make your page appear to be about movies?
- (1)Add the word “movie” 1,000 times to your page
- Set text color to the background color, so only search engines would see it
- (2)Or, run the query “movie” on your target search engine
- See what page came first in the listings
- Copy it into your page, make it “invisible”
- These and similar techniques are term spam
- Believe what people say about you, rather than what you say about yourself
- Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text
- PageRank as a tool to measure the “importance” of Web pages
- Our hypothetical shirt-seller looses
- Saying he is about movies doesn’t help, because others don’t say he is about movies
- His page isn’t very important, so it won’t be ranked high for shirts or movies
- E.g.:
- Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text
- These pages have no links in, so they get little PageRank
- So the shirt-seller can’t beat truly important movie pages, like IMDB
Google vs. Spammers
- Spam farms were developed to concentrate PageRank on a single page
- Link spam:
- Creating link structures that boost PageRank of a particular page
- Three kinds of web pages from a spammer’s point of view
- Inaccessible pages
- Accessible pages
- e.g., blog comments pages
- spammer can post links to his pages
- Own pages
- Completely controlled by spammer
- May span multiple domain names
- Spammer’s goal:
- Max PageRank of a target page t
- Technique:
- Get as many links from accessible pages as possible to target page t
- Construct “link farm” to get PageRank multiplier effect
N – # pages on the web M – # of pages spammer owns
- x: PageRank contributed by accessible pages
- y: PageRank of target page t
- Rank of each “farm” page \( = \cfrac{β y}{M} + \cfrac{1-β}{N} \)
&= x + β^2y + \cfrac{β(1-β)M}{N}+\xcancel{\cfrac{1-β}{N}}
\end{align*}
-
$y = \cfrac{x}{1-β^2} + C\cfrac{M}{N}$ where$c = \cfrac{β}{1+β}$ - For β = 0.85, 1/(1-β^2) = 3.6
- Multiplier effect for acquired PageRank
- By making M large, we can make y as large as we want
- Combating term spam
- Analyze text using statistical methods
- Similar to email spam filtering
- Also useful: Detecting approximate duplicate pages
- Combating link spam
- Detection and blacklisting of structure that look like spam farms
- leads to another war – hiding and detecting spam farms
- TrustRank = topic-specific PageRank with a teleport set of trusted pages
- E.g.: .edu domains, similar domains for non-US schools
- Detection and blacklisting of structure that look like spam farms
- Basic principle: Approximate isolation
- It is rare for a “good” page to point to a “bad” (spam) page
- Sample a set of seed pages from the web
- Have an oracle (human) to identify the good pages and the spam pages in the seed set
- Expensive task, so we must make seed as small as possible
- Call the subset of seed pages that are identified as good the trusted pages
- Perform a topic-sensitive PageRank with teleport set = trusted pages
- Propagate trust through links:
- Each page gets a trust value between 0 and 1
- Propagate trust through links:
- Solution 1: Use a threshold value and mark all page below the trust threshold as spam
- Trust attenuation
- The degree of trust conferred by a trusted page decreases with the distance in the graph
- Trust splitting:
- The larger the number of out-links from a page, the less scrutiny the page author give each out-link
- Trust is split across out-links
- Two conflicting considerations:
- Huamn has to inspect each seed page, so seed set must be as small as possible
- Must ensure every good page gets adequate trust rank, so need make all good pages reachable from seed set by short pagths
- Suppose we want to pick a seed set of k pages
- How to do that?
- (1)PageRank:
- Pick the top k pages by PageRank
- The idea/hope is that you can’t get a bad page’s rank really really high
- (2)Use trusted domains whose membership is controlled, like .edu, .mil, .gob
- In the TrustRank model, we start with good pages and propagate trust
-Complementary view: What fraction of a page’s PageRank comes from spam pages?
- In practice, we don’t know all the spam pages, so we need to estimate
Solution 2:
- r_p = PageRank of page p
- r_p^+ = PageRank of p with teleport into trusted pages only
- Then: What fraction of a page’s PageRank comes from spam pages? r_p^- = r_p - r_p^+
- Spam mass of p = \( \cfrac{r_p^-}{r_p} \)