Skip to content

Latest commit

 

History

History
863 lines (844 loc) · 35.8 KB

Mining-Massive-Datasets.org

File metadata and controls

863 lines (844 loc) · 35.8 KB

Week 2

Finding Similar Sets

Many data-mining problems can be expressed as finding “similar” sets.

Similar Documents

Three Essential Techniques for Similar Documents

  1. Shingling : convert documents, emails, etc., to sets.
  2. Minhashing : convert large sets to short signatures, while preserving similarity.
  3. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

Min-hashing

Jaccard Similarity \( Sim(C_1, C_2) = C_1 ∪ C_2 / (C_1 ∩ C_2) \)

From Sets to Boolean Matrice

  • Rows = elements of the universal set.
    • e.g.: the set of all k-shingles.
  • Columns = sets
  • 1 in row e and column S iff e is a member of S
  • Column similarity is the Jaccard similarity of the sets of their row with 1.
  • Typical matrix is sparse.

Minhashing

  • Imagine the rows permuted randomly
  • Define minhash function h(C) = the number of the first row(in permuted order) in which column C has 1
  • Use several(e.g., 100) independent hash functions to create a signature for each column.
  • The signatures can be displayed in another matrix – the signature matrix – whose columns represent the sets and the rows represent the minhash values, in order for that column.

Suprising Property

a # 1 1 b # 0 1 c # 1 0 d # 0 0

  • The probability (over all permutaions of the rows) that h(C_1) = h(C_2) is the same as Sim(C_1, C_2)
  • Both are a/(a+b+c)!

Similarity for Signatures

  • The similarity of signatures is the fraction of the minhash functions in which they agree
    • Thinking of signatures as columns of integers, the similarity of signatures is the fraction of rows in which they agree
  • Thus, the expected similarity of two signatures equals the jaccard similarity of the columns or sets that the signatrues represent

Implementation

suppose 1 billion rows

  • A good approximation to permuting rows: pick, say, 100 hash functions.
  • For each column c and each hash function h_i, keep a “slot” M(i,c)
  • Intent: M(i,c) will become the smallest value of h_i(r) for which column c has 1 in row r.
    • i.e., h_i(r) gives order of rows for ith permutation.

LSH

  • Big idea: hash columns of signature matrix M several times.
  • Arrange that (only) similar columns are likely to hash to the same bucket.
  • Candidate pairs are those that hash at least once to the same bucket.

Improvements to A-Priori

PCY Alogrithm (Park-Chen-Yu-Algorithm)

1st pass: hash p 2nd pass:

Multistage

two hash tables, 2 bitmaps, 3 passes. important points The hash functions have to be independent. We need to check both hashes on the third pass. Otherwise we will count pairs(freq item, freq item) hashTo [infreq bucket] hashTo [freq bucket].

Multihash

Key idea: use multiple independent hash tables in 1st pass.

Single-Pass Approximate Algorithms

All (Or Most) Frequent Itemsets In \leq 2 Passes

Simple Algorithm

Take a random sample of the market baskets.

Savasere-Omiecinski-Navathe (SON) Algorithm

Only work if the number of candidates can be counted in the main memory Subset Key “monotonicity”

Toivonen’s Algorithm

no false negative, in some case it may not give an answer, so you need to rerun it. No gurantee for finishing. Start as in the simple algorithm, but lower the threshold slightly. Goal is to

Negative Border {A, B, C, D}

Week 3

The Affiliation Grpah Model

Community-Affiliation Graph

Community, C Memberships, M Nodes, V AGM -> Graph \( P(u, v) = 1 - ∏\limitsc∈ M_u ∩ M_v(1-p_c) \)

AGM: Flexibility

can express a variety of community stuctures: Non-overlapping, Overlapping, Nested

From AGM to BigCLAM

FuA The membership strength of node $u$ Each community $A$ links nodes independently: \( P_A(u, V) = 1 - exp(-FuA⋅ FvA) \)

community membership strength Factor Matrix F

\( P(u,v) = 1 - ∏\limits_c (1-p_c(u,v)) \) Then prob. at least one common $C$ links them: \begin{align*} P(u,v) &= 1 - exp(-∑_C FuC⋅ FvC)
&= 1 - exp(-F_u⋅ F_vT) \end{align*}

Solving the BigCLAM

How to find F

Given a Network G(V,E), estimate F \( arg max_F ∏ p(u,v) ∏ (1-p(u,v)) \) take the log likelihood $l(F_u)$ \begin{equation*} l(F_u) = ∑\limitsv∈ N(u) log(1-exp(-F_uF_v^T)) - ∑\limitsv¬ ∈ N(u)(F_u F_V^T) \end{equation*}

BigCLAM: V1.0

gradient descent slow because \( \bigtriangledown l(F_u) \) takes linear time

BigCLAM: V2.0

\( ∑\limitsv¬ ∈ N(u) F_v = (∑\limits_v F_v - F_u - ∑\limitsv∈ N(u) F_v) \) We cache \( sumv¬ ∈ N(u) F_v \) now it takes linear time in the degree $|N(u)|$ of $u$

Detecting Clusters

Goal: find densely linked clusters Discovering social circles, circles of trust Graph Partitioning Criteria: Conductance $φ(A) = \cfrac{CUT(A)}{VOL(A)}$ vol(A): total weight of the edges with at least one endpoint in A: vol(A) = ∑\limitsi∈ Ad_i Why use this criteria? Produces more balanced partitions.

The Graph Laplacian Matrix

Adjacency matrix Graph Partitioning Task: Partition the graph into two pieces such the resulting pieces have low conductance. Problem: Computing optimal cut is NP-hard. $A$: adjacency matrix of undirected G Aij = 1 if (i, j) is an edge, else 0 $x$ is a vector of label/value of each node of $G$ A⋅ x (1) L = D - A (2) λ_2 = \cfrac{x^T Lx}{x^T x} (2) can be derived from (1). min f(y) = sum (y_i - y_j)^2 = y^T Ly Fiedler vector

Spectral Clustering Alogirthms

Three basic stages:

  1. Pre-processing

Construct a matrix representation of the graph

  1. Decomposition

Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors

  1. Grouping

Assign points to two or more clusters, based on the new representation

K-way Sepectral Clustering Recursive bi-partitioning (‘92) Disadvantages: inefficient, unstable Cluster multiple eigenvectors (‘00 preferable)

Trawling

Searching for small communities in the Web graph Frequent itemsets = complete bipartite graphs! view each mode i as a set Si of nodes i points to Ks,t = a set Y of size t that occurs in s sets S_i s … minimum support (|S| = s) t … itemset size (|Y| = t)

Mining Data Streams

The Stream Model

Data Management VS Steam Management DBMS e.p. SQL internel insert Stream Management is import when the input rate is controlled externally e.g. Google query

Two Forms of Query

Ad-hoc queries Standing queries query once, active all the time e.g. Report each new max value ever seen in stream S archive streams, if working storage is limited

Applications

Mining query streams Mining click streams IP packets can be monitored at a switch

Sliding Windows

queries are about a window of length N what if N > main memory E.G. Averages Stream of integers Standing query: what is the average of the integers in the window(size N)? For the first N inputs, simply sum and count the average Afterward, when a new input $i$ arrives, change the average by adding $(i-j)/N$

Counting 1’s

Bloom Filters

Bloom Filters can have false positive A Bloom filter is an array of bits, together with a number of hash functions The argument of each hash function is a stream element, and it returns a position in the array. E.G. Bloom Filter Use N = 11 bits for our filter Stream elements = integers Use two hash functions: h_1(x) = odd numbered bits h_2(x) = the same, but takes even numbered bits Bloom Filter Lookup compute h(y) for each hash function y. if all resulting in 1

Counting 1’s

Counting Bits Problem: given a stream of 0’s and 1’s, be prepared to answer queries of the form “how many 1’s in the last k bits?” where k $\leq$ N

DGIM Algorithm

O(log^2 N bits per stream)

Timestamps

Each bit in the stream has a timestamp, starting 0, 1, … Record timestamps modulo N(the window size), so we can represent any relevant timestamp in O(log2N) bits.

Buckets

A bucket is a segment of the window; it is represented by a record consisting of:

  1. The timestamp of its end [O(log N) bits]
  2. The number of 1’s between its beginning and its beginning and end [O(log log N) bits].

Representing a Stream by Buckets

Either one Bucket do not overlap Buckets are sorted by size Buckets disappear when their end time > N

Updating Buckets

When a new bit comes in, drop the oldest bucket if its end-time is prior to N time units before the cur time If the current bit is 0, no other changes If the current bit is 1:

  1. Create a new bucket of size 1, for just this bit. End timestamp = current time.
  2. If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
  3. If there are now three buckets of size 2, …

so on …

Sampling Streams

When Sampling Doesn’t Work

Google find unique query for the past month

Smapling Based on Hash Value

E.G.: Fixed Sample Size Sampling Key-Value Pairs

Counting Distinct Items

Computing Moments

Counting Distinct Elements

Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Applications How many different words are founc among the web pages being crawled at a site? How many unique users visited Facebook the past month?

Estimating Counts

Flajolet-Martin Approach Pick a hash function h that maps each of the n elements to at least log2n bits. For each stream element a, let r(a) be the number of trailing 0’s in h(a) Record R = the maximum r(a) seen Estimate = 2^R

Why it works

The probability that a given h(a) ends in at least i 0’s is 2-i If there are m different elements, the probability that R \geq i is 1-(1-2-i)^m Since 2-i is small, 1-(1-2-i)^m = 1 - e-m2^{-i}

Solution

Partition your samples into small groups. Log n, where n = size of universal set, suffices Take the average of groups Then take the median of the averages

Generalization: Moments

AMS Expected Value of X 2nd moment is ∑_a(m_a)^2 E(X) = (1/n)(∑all times t n * (twice the number of times the stream element at time t appears from that time on)-1)) = ∑_a(1+3+5+\ldots + 2m_a-1) Problem: Streams never end Fixups

h(1) 3+7=10 1010 =1 h(9) 3*9+7=34%11=3 0011 = 0 h(8) 31%11=9 1001 = 0 h(5) 22 = 0 0000 = 4 h(2) 2 0010 = 1 h(6) 25%11=3 0011 = 0 h(3) 16%11=5 0101 = 0 h(4) 19%11=8 1000 = 3 h(7) 28%11=6 0110 = 1 h(10) 37%11=4

Week 4 Recommender System

overview

long tail

Types of Recommendations

Editorial and hand curated Simple aggregates top 10, most popular, recent uploads Tailored to individual users

Formal Model

C = set of Customers S = set of Items Utility function u: C× S → R R = set of ratings R is a totally ordered set

Key Problems

Gathering Known Ratings

implicit: purchase means high ratings

Extrapolating Utilities

Key problem: matrix U is sparse Cold start: New items have no ratings New items have no history Three approaches to recommender systems Content-based Collaborative Latent factor based

Content-based

Main idea: Recommand items to customer x similar to previous items rated highly by x Users like → Item profiles → User profile

Item Profiles

For each item, create an item profile Profile is a set of features (a vector) Text features Profile = set of “important” words in item (document) How to pick important words? TF-IDF

User Profiles

User has rated items with profiles i_1, \ldots, i_n Simple: (weighted) average of rated item profiles Variant: Normalize weights using average rating of user

Example 1: Boolean Utility Matrix

Items are movies, only feature is “Actor” Suppose users x has watch 5 movies 2 movies featuring actor A 3 movies featuring actor B User Profile A rating = 0.4 B rating = 0.6

Example 2: Star Ratings

Same example, 1-5 star ratings Actor A’s movies rated 3, 5 Actor B’s movies rated 1, 2, 4 Useful step: Normalize ratings by subtracting user’s mean rating(3)

Making predictions

User profile x, Item profile i Estimate U(x, i) = cos(θ) = (x ⋅ i) / (|x||i|) cosine distance

Pros: Content-based Approach

No need for data on other users Able to recommend to users with unique tastes Able to recommend new & unpopular items No first-rater problem Explanations for recommended items Content features that caused an item to be recommended

Cons: Content-based Approach

Finding the appropriate features is hard Overspecialization Never recommends items outside user’s content profile People might have multiple interests Unable to exploit quality judgments of other users Cold-start problem for new users How to build a user profile?

Collaborative Filtering

Consider user x Find set N of other users whose ratings are “similar” to x’s ratings Estimate x’s ratings based on ratings of users in N

Option 1 : Jaccard Similarity sim(A,B) = |r_A∩ r_B|/|r_A∪ r_B| Problem: Ignores rating values Option 2: Cosine similarity sim(A,B) = cos(r_A, r_B) Problem: treat missing data as negtive Option 3: Centered cosine Normalize ratings by subtracting row mean Also known as Pearson Correlation

Rating Predictions

Let r_x be the vector of user x’s ratings Let N be the set of k users most similar to x who have also rated item i Prediction for user x and item i Option 1: rxi = 1/k ∑y∈ N ryi Option 2: rxi = ∑y∈ N sxyryi/sumy∈ N sxy where sxy is the similarity between x and y

Item-Item Collaborative Filtering

Estimate rating for item i based on ratings for similar items rxi = \cfrac{∑j∈ N(i;x)s_{ij*rij}}{∑j∈ N(i;x)sij}

Item-Item v. User-User

In theory, user-user and item-item are dual approaches In practice, item-item outperforms user-user in many user cases Item are “simpler” than users Items belong to a small set of “genres”, users have varied tastes Item Similarity is more meaningful than user Similarity

Evaluating Recommender System

Root-mean-square erroe (RMSE) Problems: Narrow focus on accuracy sometimes misses the point Prediction Diversity Prediction Context Order of predictions In practice, we care only to predict high ratings Alternative: precision at top k

Latent Factor Models

A Modern Recommender System

Multi-scale modeling of the data Global overall deviation Factorization addressing “regional” effects Collaborative filtering extract local patterns

Modeling Local & Global Effects

Global: Baseline estimation Local neighborhood (CF/NN): Final estimate

Collaborative filtering (Item-Item)

In practice we get better estimates if we model deviations: bxi = μ + b_x + b_i rxi = bxi + \frac{∑ sij(rxj-bxj)}{∑ sij} μ = overall mean rating b_x = rating deviation of user x b_i = (avg. rating of movie i) - mu Problems/Issues: Similarity measures are “arbitrary” Pairwise similarities neglect interdependencies among users Taking a weight average can be restricting

Latent Factor Recommender System

Recommendations via Optimization

Latent Factor Models

“SVD” on Netflix data: R ≈ Q⋅ P^T SVD gives minimum reconstruction error (Sum of squared errors)

Finding the Latent Factors

CUR

Pros & Cons +Easy interpretation +Sparse basis -Duplicate columns and rows columns of large norms will be sampled many times

SVD Example and Conclusion

  • Q: Find users that like ‘Matrix’
  • A: Map query into a ‘concept space’ – how?

Project into concept space: inner product with each ‘concept’ vector v_i Compactly, we have: q_concept = qV

  • Observation: User d that rated (‘Alien’) will be similar to user q that rated (‘Matrix’), although d and q have zero ratings in common!

Week 5 Clustering

Bradley-Fayyad-Reina (BFR) Algorithm

BFR is a variant of k-means for very large (disk-resident) data sets Assumes each cluster is normally distributed around a centroid in Euclidean space

BFR Algorithm

Points are read from disk one main-memory-full at a time Most points from previous memory are summarized by simple statistics To begin, from the initial load we select the initial k centroid

Three Classes of Points

Discard set (DS): Points close enough to a centroid to be summarized Compression set (CS): Groups of points that are close together but not close to any existing centroid These points are summarized, but not assigned to a cluster Retained set (RS): isolated points waiting to be assigned to a compression set

Summarizing Sets of Points

For each cluster, DS is summarized by: The number of points, N The vector SUM, whose ith component = sum of the coordinates of the points in ith dimension The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension centroid and variance can be calculated by N, SUM and SUMSQ

Processing a chuck of points

Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

A Few Details…

Q1) How do we decide if a point is “close enough” to a cluster (and discard BFR approach The Mahalanobis distance is less than a threshold High likelihood of the point belonging to currently 3 σ Q2) Should 2 CS subclusters be combined? Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density

CURE Algorithm

Limitations of BFR Algorithm Makes strong assumptions, not work on non-linear separable

Clustering Using REpresentatives:

Assumes a Euclidean distance Allows clusters to assume any shape Uses a collection of representative points to represent cluster

Starting CURE

Pass 1 of 2: Pick a random sample of points that fit in main memory Cluster sample points hierarchically to create the initial clusters Pick representative points: For each cluster, pick k representative points, as dispersed as possible Move each representative point a fixed fraction (e.g., 20%) toward the centroid of the cluster Pass 2 of 2: Now, rescan the whole dataset and visit each point p in the data set Place it in the “closest cluster”

Performance-based Advertising

Matching Algorithm Problem: Find a maximum matching for a given bipartite graph A perfect one if exists There is a polynomial-time offline algorithm based on augmenting paths Online Graph Matching Problem girls -> boys In each round, one girl’s choices are revealed At that time, we have decide to either pair them. Example of application: Assigning tasks to servers.

Greedy Algorithm

just pick a boy eligible for a new girl Competitive Ration = minall possible inputs I(|Mgreedy|/|Mopt|) the worst case performance overall all possible inputs of greedy algorithm

Analyzing the Greedy Algorithm

Mopt<= 2Mgreedy

Algorithmic Challenges

Performance-based advertising works! Multi-billion-dollar industry What ads to show for a given query?

AdWords Problems

A stream of queries arrives at the search engine: q_1,q_2,… Several advertisers bid on each query When query q_i arrives, search engine must pick a subset of advertisers whose ads are shown Goal: Maximize serach engine’s revenues Clearly we need an online algorithm!

Expected Revenue

CTR (click through rate) * Bid

Adwords Problem

Given: A set of bids by advertisers for search queries A click-through rate for each advertiser-query pair A budget for each advertiser (say for 1day, month…) A limit on the number of ads to be displayed with each search query Respond to each search query with a set of advertisers such that: The size of the set is no larger than limitation Each advertiser has bid on the serach query Each advertiser has enough budget left to pay

Limitations of Simple Algorithm

CTR of an ad is unknown Advertisers have limited budgets and bid on multiple ads (BALANCE algorithm)

Estimating CTR

CTR for a query-ad pair is measured historically Averaged over a time peroid Some complications we won’t cover in this lecture: CTR is position dependent Explore v Exploit: Keep showing ads we already know the CTR of, or show new ads to estimate their CTR?

The BALANCE Algorithms

Dealing with Limited Budgets

Simplest algorithm is greedy.

Bad Scenario for Greedy

Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B This is the worst case! And it’s determinstic. Greedy always give the same answer to the same situation.

BALANCE Algorithm [MSVV]

For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily (but in a determinstic way)

Analyzing 2-advertiser BALANCE

BALANCE must exhaust at least one advertiser’s budget: if not, we can allocate more queries Assume BALANCE exhausts A_2’s budget

BALANCE: General Result

In the general case, worst competitive ration of BALANCE is 1-1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio

Worst case for BALANCE

N advertisers: A_1, A_2, … A_N Queries: N\( ⋅ \) B queries appear in N rounds of B queries each Bidding: Round 1 queries: bidders A_1, A_2, …, A_N Round 2 queries: bidders A_2, …, A_N Round i queries: bidders A_i, …, A_N

BALANCE Allocation

After k rounds, the allocation to advertiser k is: S_K = ∑1\leq i \leq k B/(N-i+1)

BALANCE: Analysis

Fact: for large n Result due to Euler ln(N) - 1 = ln(N - k) k = N(1 - 1/e) So after the first k = N(1-1/e) rounds, we cannot allocate a query to any advertiser Revenue = B⋅ N(1-1/e) Competitive ratio = 1 - 1/e

General Version of the Problem

So far: all bids = 1, all budgets equal (=B) In a general setting BALANCE can be terrible

Generalized BALANCE

Consider query q, bidder i Bid = x_i Budget = b_i Amount spent so far = m_i Fraction of budget left over f_i = 1 - m_i/b_i Define φ_i(q) = x_i(1-e-f_i) Allocate query q to bidder i with largest value of φ_i(q) Same competitive ratio (1 - 1/e)

week 6

Soft-Margin SVMs

Hinge Loss

week 7

LSH Families of Hash Functions

Hash Functions Decide Equality

There is a subtlety about what a “hash function” really is in the context of LSH family. A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison. E.g.: the family of minhash functions computes minhash values and says “yes” iff they are the same. Shorthand: “h(x) = h(y)” means h says “yes” for pair elements x and y

LSH Families Defined

Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d_1, d_2, p_1, p_2)-sensitive if for any x and y in S:

  1. If \( d(x,y) \leq d_1 \), then the probability over all h in H, that h(x) = h(y) is at least p_1.
  2. If \( d(x,y) \geq d_2 \), then the probability over all h in H, that h(x) = h(y) is at most p_2.

E.g.: LS Family

Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations. Then Prob[h(x)=h(y)] = 1 - d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance. Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.

Amplifying a LSH-Family

The “bands” technique we learned for signature matrices carries over to this more general setting. Goal: the “S-curve” effect seen here. AND construction like “rows in a band.” OR construction like “many bands.”

AND of Hash Functions

Given family H, construct family H’ whose members each consist of r functions from H. For \( h = {h_1, \ldots, h_r} \) in H’, h(x) = h(y) iff h_i(x) = h_i(y) for all i. Theorem: If H is (d_1, d_2, p_1, p_2)-sensitive, then H’ is (d_1, d_2, (p_1)^r, (p_2)^r)-sensitive. Proof: Use fact that h_i’s are independent.

OR of Hash Functions

Given family H, construct family H’ whose members each consist of b functions from H. For \( h = {h_1, \ldots, h_b} \) in H’, h(x) = h(y) iff h_i(x) = h_i(y) for some i. Theorem: If H is (d_i, d_2, p_1, p_2)-sensitive, then H’ is (d_1, d_2, 1- (1-p_1)^b, (1-p_2)^b)-sensitive.

Effect of AND and OR Constructions

AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.

Composing Constructions

As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND’s and OR’s alternating.

AND-OR Composition

Each of the two probabilities p is transformed into 1-(1-p^r)^b. The “S-curve” studied before. E.g.: Take H and construct H’ by the AND construction with r=4. Then, from H’, construct H” by the OR construction with b=4. (1-(1-p^4)^4)

OR-AND Composition

Each of the two probabilities p is transformed 1-(1-p^b)^r The same S-curve, mirrored horizontally and vertically.

Cascading Constructions

E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction. Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive

General Use of S-Curves

For each S-curve 1-(1-p^r)^b, there is a threshold t, for which 1-(1-t^r)^b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant. Iteratea as you like.

More LSH Families

For cosine distance, there is a technique analogous to minhashing for generating a (d_1,d_2,(1-d_1/180),(1-d_2/180))-sensitive family for andy d_1 and d_2 Called random hyperplane.

Random Hyperplanes

Each vector v determines a hash function h_v with two buckets. h_v(x) = +1 if \( v ⋅ x > 0 \); = -1 if \( v ⋅ x < 0 \) LS-family H = set of all functions derived from any vector. Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)

Signatures for Cosine Distance

Pick some number of vectors, and hash your data for each vector. The result is a signature(sketch) of +1’s and -1’s that can be used for LSH lke the minhash signatures for Jaccard distance. But you don’t have to think this way. The existence of the LSH-family is sufficient amplification by AND/OR.

Simplification

We need not pick from among all possible vectors v to form a component of a sketch. It suffices to consider only vector v consisting of +1 and -1 components.

LSH for Euclidean Distance

Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containig its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.

If points are distance \( \geq 2a \) apart then \( 60 \leq θ \leq 90 \) for there to be a chance that the points go in the same bucket. I.e., at most 1/3 probability If points are distance \( \leq a/2 \), then there is at least 1/2 chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.

Fixup: Euclidean Distance

For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions. Here, we seem to need \( e \geq 4d \). But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so. Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.

Topic Specific (aka Personalized) PageRank

Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. “sports” or “history”. Allow search queries to be answered based on interests of the user. E.g.:Query “Trojan” wants different pages depending on whether you are interested on sports, history or computer security. Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability to avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic e.g., Open Directory(DMOZ) pages for a given topic/query For each teleport set S, we get a different vector r_s.

Matrix Formulation

To make this work all we need is to update the teleportating part of the PageRank formulationg: \begin{equation} Aij = \begin{cases} β Mij+(1-β)/|S| &\mbox{if}\ i∈ S
β Mij & \mbox{otherwise} \end{cases} \end{equation} A is stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Random walk with Restart: S is a single element Compute as for regular PageRank: Multiply by M, the add a vector Maintains sparseness

Discovering the Topic Vector S

  • Create different PageRanks for different topics
    • The 16 DMOZ top-level categories: arts, business, sports, \ldots
  • Which topic ranking to use?
    • User can pick from a menu
    • Classify query into a topic
    • Can use the context of the query
      • E.g., query is launched from a web page talking about a known topic
      • History of queries e.g., “basketball” followed by “Jordan”
  • User context, e.g., user’s bookmarks, /ldots

Applicaiton to Measuring Proximity of Graph

a.k.a: Similarity, Relevance

Good proximity measure?

  • Shortest path is not good
    • No effect of degree-1 nodes (E,F,G)!
    • Multi-faceted relationships
  • Network flow is not good
    • Does not punish long paths

What is good notion of proximity?

  • Multiple Connections
  • Quality of connection
    • Direct & In-direct connections
    • Length, Degree, Weight \ldots

SimRank: Idea

  • SimRank: Random walks from a fixed node on k-partite graphs
  • Setting: k-partite graph with k types of nodes
    • e.g.: picture nodes and tag nodes
  • Do a Random Walk with Restarts from node u
    • i.e., teleport set S = {u}
  • Resulting scores measures similarity to node u
  • Problem:
    • Must be done once for each node u
    • Suitable for sub-Web-scale applications

Web Spam

What is Web Spam?

  • Spamming:
    • Any deliberate action to boost a web page’s position in search engine results, incommensurate with page’s real value
  • Spam:
    • Web pages that are the result of spamming
  • This is a very broad definition
    • SEO industry minght disagree!
    • SEO = search engine optimization
  • Approximately 10-15% of web pages are sapmming

Web Search

  • Early search engines:
    • Crawl the Web
    • Index pages by the words they contained
    • Respond to search queries (lists of words) with the pages containing those words
  • Early page ranking:
    • Attempt to order pages matching a search query by “importance”
  • First search engines considered:
    1. Number of times query words appeared
    2. Prominence of word position, e.g. title, header

First Spammers

  • As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to thir own site – whether they wanted to be thre or not
  • E.g.:
    • Shirt-seller might pretend to be about “movies”
  • Techniques for achieving high relevance/importance for a web page

First Spammers: Term Spam

  • How do you make your page appear to be about movies?
    • (1)Add the word “movie” 1,000 times to your page
    • Set text color to the background color, so only search engines would see it
    • (2)Or, run the query “movie” on your target search engine
    • See what page came first in the listings
    • Copy it into your page, make it “invisible”
  • These and similar techniques are term spam

Google’s Solution to Term Spam

  • Believe what people say about you, rather than what you say about yourself
    • Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text
  • PageRank as a tool to measure the “importance” of Web pages

Why It Works?

  • Our hypothetical shirt-seller looses
    • Saying he is about movies doesn’t help, because others don’t say he is about movies
    • His page isn’t very important, so it won’t be ranked high for shirts or movies
  • E.g.:
    • Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text
    • These pages have no links in, so they get little PageRank
    • So the shirt-seller can’t beat truly important movie pages, like IMDB

Spam Farming

Google vs. Spammers

  • Spam farms were developed to concentrate PageRank on a single page
  • Link spam:
    • Creating link structures that boost PageRank of a particular page

Link Spamming

  • Three kinds of web pages from a spammer’s point of view
    • Inaccessible pages
    • Accessible pages
      • e.g., blog comments pages
      • spammer can post links to his pages
  • Own pages
    • Completely controlled by spammer
    • May span multiple domain names

Link Farms

  • Spammer’s goal:
    • Max PageRank of a target page t
  • Technique:
    • Get as many links from accessible pages as possible to target page t
    • Construct “link farm” to get PageRank multiplier effect

./771.png

Analysis

N – # pages on the web M – # of pages spammer owns

  • x: PageRank contributed by accessible pages
  • y: PageRank of target page t
  • Rank of each “farm” page \( = \cfrac{β y}{M} + \cfrac{1-β}{N} \)

$\require{cancel}$ - \begin{align*} y &= x + β M [\cfrac{β y}{M} + \cfrac{1-β}{N}] + \cfrac{1-β}{N}
&= x + β^2y + \cfrac{β(1-β)M}{N}+\xcancel{\cfrac{1-β}{N}} \end{align*}

  • $y = \cfrac{x}{1-β^2} + C\cfrac{M}{N}$ where $c = \cfrac{β}{1+β}$
  • For β = 0.85, 1/(1-β^2) = 3.6
  • Multiplier effect for acquired PageRank
  • By making M large, we can make y as large as we want

TrustRank

Combating Spam

  • Combating term spam
    • Analyze text using statistical methods
    • Similar to email spam filtering
    • Also useful: Detecting approximate duplicate pages
  • Combating link spam
    • Detection and blacklisting of structure that look like spam farms
      • leads to another war – hiding and detecting spam farms
    • TrustRank = topic-specific PageRank with a teleport set of trusted pages
      • E.g.: .edu domains, similar domains for non-US schools

TrustRank: Idea

  • Basic principle: Approximate isolation
    • It is rare for a “good” page to point to a “bad” (spam) page
  • Sample a set of seed pages from the web
  • Have an oracle (human) to identify the good pages and the spam pages in the seed set
    • Expensive task, so we must make seed as small as possible

Trust Propagation

  • Call the subset of seed pages that are identified as good the trusted pages
  • Perform a topic-sensitive PageRank with teleport set = trusted pages
    • Propagate trust through links:
      • Each page gets a trust value between 0 and 1
  • Solution 1: Use a threshold value and mark all page below the trust threshold as spam

Why is it a good idea?

  • Trust attenuation
    • The degree of trust conferred by a trusted page decreases with the distance in the graph
  • Trust splitting:
    • The larger the number of out-links from a page, the less scrutiny the page author give each out-link
    • Trust is split across out-links

Picking the Seed Set

  • Two conflicting considerations:
    • Huamn has to inspect each seed page, so seed set must be as small as possible
    • Must ensure every good page gets adequate trust rank, so need make all good pages reachable from seed set by short pagths

Approaches to Picking Seed Set

  • Suppose we want to pick a seed set of k pages
  • How to do that?
  • (1)PageRank:
    • Pick the top k pages by PageRank
    • The idea/hope is that you can’t get a bad page’s rank really really high
  • (2)Use trusted domains whose membership is controlled, like .edu, .mil, .gob

Spam Mass

  • In the TrustRank model, we start with good pages and propagate trust

-Complementary view: What fraction of a page’s PageRank comes from spam pages?

  • In practice, we don’t know all the spam pages, so we need to estimate

Solution 2:

  • r_p = PageRank of page p
  • r_p^+ = PageRank of p with teleport into trusted pages only
  • Then: What fraction of a page’s PageRank comes from spam pages? r_p^- = r_p - r_p^+
  • Spam mass of p = \( \cfrac{r_p^-}{r_p} \)