La situazione non è buona: Week 4 of MofMD. Recommender Systems

Two major classes

Content based
Collaborative filtering
Latent factor based

Discussion on "Long Tail" effect. Some products (less popular) may mot exists in physical stores because it's not reasonable to stock them.

Formal model:
C - set of Customers
S - set of Items

Utility function u: C x S -> R (set of Ratings)
R - totally ordered
Key problems:

Gathering "known" ratings for matrix
- Explicit (ask people, don't scale)
- Implicit (learn from actions, hard to learn low ratings)
Extrapolate unknown ratings from the known ones (mostly interested in high unknown ratings)
- Sparsity
- Cold start: new items have no ratings, new users have no history
Evaluating extrapolation methods

Content-based recommendations

Main idea to recommend customer x items similar to previous items highly recommended by customer x
Item profiles - set of "important" features
Text features - set of "important" words
Extract using heuristics like TF-IDF (Term frequency * Inverse Doc Frequency)
`f_{ij}` - frequency of ter (feature) `i` in doc (item) `j`
`TF_{ij}= f_{ij}/{max_kf_{ij}}`
`n_j` - number of docs that mention term `i`
`N`- total number of docs
`IDF_i=log(N/n_i)` TF-IDF score: `w_{ij}=TF_{ij} xx IDF_i`
Doc profile - set of words with highest TF-IDF scores (together with scores)
I.e. word is frequent in (small) set of document - it's important
Steps include building user profile as sort of aggregated features of items and estimate `U(x,i)=cos(theta)=(x*i)/(|x||i|)` where x is User profile, i Item profile. Aggregation can be mean or normalized mean of features of items.
Pros:

No need for data on other users
Able to recommend to users with unique tastes
Able to recommend new & unpopular items
Explanation for recommended items

Cons:

Hard to find appropriate features
Overspecialization - never recommend items outside user's profile, unable to exploit quality judgments of other users
Cold start problem - how to build user profile

Collaborative filtering

Main idea to find N other users whose rating is similar to x's rating and estimate x's rating based on ratings of users in N
Similarity between users

Jaccard similarity `sim(A,B)=|r_A nn r_B|/|r_A uu r_B|` doesn't work well.
`sim(A,B)= cos ( r_A, r_B )`
Centered cosine is better `sim(A,B)= cos ( r_A, r_B )` - center user's rating around average value.( Make 0 average user's rating)

Rating Predictions
Let `r_x` be the vector of user x's ratings
Let N be the set of k users most similar to x who also rated item i
Prediction for user x and item i:
Option 1: `r_{:x i:} = 1/k sum _{:y in N:}r_{:yi:}` - doesn't take in account "degree of similarity" of users
Option 2: `r_{:x i:} = 1/k sum _{:y in N:}s_{:xy:}r_{:yi:} // sum _{:y in N:}s_{:xy:}` where `s_{:xy:} = sim(x,y)`

Another version is Item-Item filtering (User-User described right above)

For item i find other similar items
Estimate rating for item i based on ratings for similar items
Can use same similarity metrics and prediction functions as in user-user model

`r_{:x i:} = {sum _{:j in N (i;x):}s_{:ij:}r_{:xj:}} / {sum _{:j in N(i;x):}s_{:ij:}}` where `s_{:ij:}` similarity of items i and j `r_{:xj:}` rating of user x on item `N(i;x)` set items rated by x similar to i

In practice item-item outperforms user-user in many use cases. - Items are "simpler" than users. Items similarity is more meaningful than user similarity

Common practice implementing collaborative filtering - some combination of recommendation system plus filtering
`r_{:x i:} = b_{:x i:}+{sum _{:j in N (i;x):}s_{:ij:}(r_{:xj:}-b_{:xj:})} / {sum _{:j in N(i;x):}s_{:ij:}}` where `b_{:x i:} = mu + b_x + b_i`, `mu` - overall mean movie rating (example of movie database), `b_x` - rating deviation of user x = (avg rating of user x) - `mu`, `b_i` - rating deviation of movie i

Evaluating Recommender system
Predict ratings on test set and evaluate RMSE
`sqrt({sum_{(x,i) in T}(r_{x i} - r_{x i}^**)^2} / N)`

`N=|T|`, `r_{x I}` the predicted rating, `r_{x i}^**` the actual rating.

Problem with RMSE - Prediction Diversity (all predictions too similar to each other), Prediction Context (user might operate in different context => want different results), Order of Prediction
We care only to predict high ratings - RMSE might penalize method that does well for high ratings and badly with others, Alternative : precision at top k i.e. percentage of prediction in user's top k ratings.

Latent Factor Models

Global effects => like Baseline Estimation
Local neighborhood (CF/NN) => Final estimate
Collaborative filtering - derive unknown rating from k-nearest neighbors like
`hat r_{x i} = (sum_{j in N(i;x)}S_{ij}*r_{xj})/(sum_{j in N(i;x)}S_{ij})`
`s_{ij}` similarity of items i and j
`r_{uj}` rating of user x on item j
`N(i;x)` set of items similar to item i that were rated by x

Combine with global effects
`hat r_{x i} = b_{x i} + (sum_{j in N(i;x)}S_{ij}*(r_{xj}-b_{x j}))/(sum_{j in N(i;x)}S_{ij})`
`b_{x i}=mu+b_x + b_i`
`mu` - overall mean rating
`b_x` - rating deviation of user x
`b_i` - rating deviation of item i

Problems:
1) Similarity are arbitrary
2) Pairwise similarities neglect interdependencies among users
3) Taking weighted average can be restricting

SVD
A - input data matrix
U - left singular vec
V - right singular vec
`Sigma` - singular values

`A~~U Sigma V^T` => "SVD" `R~~Q*P^T`, A=R, Q=U, `P^T=Sigma V^T`
R - utility matrix approximated by Q and P matrices of lower dimensions

SVS minimizes reconstruction error
`SSE = {::}_{U,V,Sigma}^"min" sum_{i,j in A}(A_{ij}-[U Sigma V^T]_{ij})^2`
SSE - Sum of Squared Errors
SSE and MSE are monotonically related `RMSE = 1/{const} sqrt(SSE)` => also minimizing RMSE
But R has missing entries!

So modify SVD to find P,Q such as
`{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}-q_i * p_x^T)^2` (use elements with defined scores)
also

we don't require P,Q to be orthogonal/unit length
P,Q map users/movies to latent space

We want a large k, but if k>2 SSE on training data will increase, because model will adapt noise. So we may use regularization like
`{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}-q_i * p_x^T)^2+ lambda[ sum_x||p_x||^2+sum_i||q_i||^2]`
and use gradient descent

Initialize P and Q (using SVD - pretend missed ratings are 0)
Do gradient descent

`P larr P - eta * gradP`
`Q larr Q - eta * gradQ`

where `gradQ = [gradq_{ik}]` and `gradq_{ik}=sum_"xi"-2(r_"xi"-q_ip_x^T)p_"xk"+2lambdaq_"ik"`and similary for `gradP`

Modeling biases and interactions
`hat r_"xi" = mu + b_x + b_i + q_i*p_x^T`
where
'mu' - overall mean rating
`b_x` - bias for user x
`b_i` - bias for movie i
`q_i*p_x^T` - user movie interaction

and we can fit a new model
`{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}- (mu + b_x + b_i + q_i*p_x^T))^2` `+ lambda[ sum_x||p_x||^2+sum_i||q_i||^2+sum_x||b_x||^2+sum_i||b_i||^2]`
where `b_u` and `b_i` become additional parameters for the estimation

Dimensionality Reduction

Matrix rank - number of lineary independent columns/rows. The amount of coordinates can be reduced to rank.

Why reduce?

Discover hidden correlation/topics
Remove redundant and noisy features
Interpretation and visualization
Easier storage and processing of data

SVD definition

`A_[m x n] = U_[m x r] Sigma_[r x r] (V_[n x r])^T`
A: input matrix (say, m documents, n terms)
U: Left singular martix m x n matrix (say, m documents, r concepts)
`Sigma` Singular values, r x r diagonal matrix, values on diagonal sorted in decreasing order ("strength of each concept")
V: Right singular vector n x r matrix ( n terms, r concepts)

`A~~USigmaV^T = sum_isigma_iu_inu_i^T`

Decomposition is always possible, matrices are unique, U and V columns orthonormal ( `U^TU=I`, `V^TV=I`, columns are orthogonal unit vectors), `Sigma` diagonal, singular values are positive and sorted in decreasing order

We can remove minor (tail `nu`).
Frobenius norm is minimal `||M||_F=sqrt(Sigma_"ij"M_"ij^2")`

SVD gives best low rank approximation
Rule-of-a thumb - Keep 80-90% of 'energy' (``=sumsigma_i^2)
SVD complexity O(nm^2) or O(n^2m)whichever is less but less work if we just nee a singular values, or k first singular vectors, or the matrix is sparse

Query SVD
`q_"concept"= q V` - transform query q to "concept" space and it will give a matrix of "strength" of each concept

Relation to Eigen-decomposition
`A=XLambdaX^T`

A is symmetric
U,V,X are orthonormal (`U^TU=I`)
`Lambda, Sigma` are diagonal

or diagonal matrix `lambda_i=sigma_i^2`

So
+ Optimal low rank approximation
- Interpretability problem
- Lack of sparsity (singular vectors are dense)

CUR decomposition
Goal - express A as product of C,U,R matrices and keep `||A - C*U*R||_F` small

"Constraints" on U - contains set of columns from A, similarly on R - set of rows from A

In practice we can choose about 4K rows/cols and do almost as good as SVD does.

See book for details.

Pros:
Easy interpretation
Sparcity
Cons:
Rows and columns with large norm may be choosen few times (probabilistic method)

P.S. ASCIIMathML

La situazione non è buona

Friday, October 24, 2014

Week 4 of MofMD. Recommender Systems

Content-based recommendations

Collaborative filtering

No comments:

Post a Comment