Two major classes
Formal model:
C - set of Customers
S - set of Items
Utility function u: C x S -> R (set of Ratings)
R - totally ordered
Key problems:
Item profiles - set of "important" features
Text features - set of "important" words
Extract using heuristics like TF-IDF (Term frequency * Inverse Doc Frequency)
fij - frequency of ter (feature) i in doc (item) j
TFij=fijmaxkfij
nj - number of docs that mention term i
N- total number of docs
IDFi=log(Nni) TF-IDF score: wij=TFij×IDFi
Doc profile - set of words with highest TF-IDF scores (together with scores)
I.e. word is frequent in (small) set of document - it's important
Steps include building user profile as sort of aggregated features of items and estimate U(x,i)=cos(θ)=x⋅i|x||i| where x is User profile, i Item profile. Aggregation can be mean or normalized mean of features of items.
Pros:
Similarity between users
Let rx be the vector of user x's ratings
Let N be the set of k users most similar to x who also rated item i
Prediction for user x and item i:
Option 1: rxi=1k∑y∈Nryi - doesn't take in account "degree of similarity" of users
Option 2: rxi=1k∑y∈Nsxyryi/∑y∈Nsxy where sxy=sim(x,y)
Another version is Item-Item filtering (User-User described right above)
In practice item-item outperforms user-user in many use cases. - Items are "simpler" than users. Items similarity is more meaningful than user similarity
Common practice implementing collaborative filtering - some combination of recommendation system plus filtering
rxi=bxi+∑j∈N(i;x)sij(rxj-bxj)∑j∈N(i;x)sij where bxi=μ+bx+bi, μ - overall mean movie rating (example of movie database), bx - rating deviation of user x = (avg rating of user x) - μ, bi - rating deviation of movie i
Evaluating Recommender system
Predict ratings on test set and evaluate RMSE
√∑(x,i)∈T(rxi-r∗xi)2N
N=|T|, rxI the predicted rating, r∗xi the actual rating.
Problem with RMSE - Prediction Diversity (all predictions too similar to each other), Prediction Context (user might operate in different context => want different results), Order of Prediction
We care only to predict high ratings - RMSE might penalize method that does well for high ratings and badly with others, Alternative : precision at top k i.e. percentage of prediction in user's top k ratings.
Latent Factor Models
Global effects => like Baseline Estimation
Local neighborhood (CF/NN) => Final estimate
Collaborative filtering - derive unknown rating from k-nearest neighbors like
ˆrxi=∑j∈N(i;x)Sij⋅rxj∑j∈N(i;x)Sij
sij similarity of items i and j
ruj rating of user x on item j
N(i;x) set of items similar to item i that were rated by x
Combine with global effects
ˆrxi=bxi+ ∑j∈N(i;x)Sij⋅(rxj-bxj)∑j∈N(i;x)Sij
bxi=μ+bx+bi
μ - overall mean rating
bx - rating deviation of user x
bi - rating deviation of item i
Problems:
1) Similarity are arbitrary
2) Pairwise similarities neglect interdependencies among users
3) Taking weighted average can be restricting
SVD
A - input data matrix
U - left singular vec
V - right singular vec
Σ - singular values
A~~U Sigma V^T => "SVD" R~~Q*P^T, A=R, Q=U, P^T=Sigma V^T
R - utility matrix approximated by Q and P matrices of lower dimensions
SVS minimizes reconstruction error
SSE = {::}_{U,V,Sigma}^"min" sum_{i,j in A}(A_{ij}-[U Sigma V^T]_{ij})^2
SSE - Sum of Squared Errors
SSE and MSE are monotonically related RMSE = 1/{const} sqrt(SSE) => also minimizing RMSE
But R has missing entries!
So modify SVD to find P,Q such as
{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}-q_i * p_x^T)^2 (use elements with defined scores)
also
{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}-q_i * p_x^T)^2+ lambda[ sum_x||p_x||^2+sum_i||q_i||^2]
and use gradient descent
Initialize P and Q (using SVD - pretend missed ratings are 0)
Do gradient descent
Modeling biases and interactions
hat r_"xi" = mu + b_x + b_i + q_i*p_x^T
where
'mu' - overall mean rating
b_x - bias for user x
b_i - bias for movie i
q_i*p_x^T - user movie interaction
and we can fit a new model
{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}- (mu + b_x + b_i + q_i*p_x^T))^2 + lambda[ sum_x||p_x||^2+sum_i||q_i||^2+sum_x||b_x||^2+sum_i||b_i||^2]
where b_u and b_i become additional parameters for the estimation
Dimensionality Reduction
Matrix rank - number of lineary independent columns/rows. The amount of coordinates can be reduced to rank.
Why reduce?
SVD definition
A_[m x n] = U_[m x r] Sigma_[r x r] (V_[n x r])^T
A: input matrix (say, m documents, n terms)
U: Left singular martix m x n matrix (say, m documents, r concepts)
Sigma Singular values, r x r diagonal matrix, values on diagonal sorted in decreasing order ("strength of each concept")
V: Right singular vector n x r matrix ( n terms, r concepts)
A~~USigmaV^T = sum_isigma_iu_inu_i^T
Decomposition is always possible, matrices are unique, U and V columns orthonormal ( U^TU=I, V^TV=I, columns are orthogonal unit vectors), Sigma diagonal, singular values are positive and sorted in decreasing order
We can remove minor (tail nu).
Frobenius norm is minimal ||M||_F=sqrt(Sigma_"ij"M_"ij^2")
SVD gives best low rank approximation
Rule-of-a thumb - Keep 80-90% of 'energy' (=sumsigma_i^2)
SVD complexity O(nm^2) or O(n^2m)whichever is less but less work if we just nee a singular values, or k first singular vectors, or the matrix is sparse
Query SVD
q_"concept"= q V - transform query q to "concept" space and it will give a matrix of "strength" of each concept
Relation to Eigen-decomposition
A=XLambdaX^T
So
+ Optimal low rank approximation
- Interpretability problem
- Lack of sparsity (singular vectors are dense)
CUR decomposition
Goal - express A as product of C,U,R matrices and keep ||A - C*U*R||_F small
"Constraints" on U - contains set of columns from A, similarly on R - set of rows from A
In practice we can choose about 4K rows/cols and do almost as good as SVD does.
See book for details.
Pros:
Easy interpretation
Sparcity
Cons:
Rows and columns with large norm may be choosen few times (probabilistic method)
P.S. ASCIIMathML
- Content based
- Collaborative filtering
- Latent factor based
Formal model:
C - set of Customers
S - set of Items
Utility function u: C x S -> R (set of Ratings)
R - totally ordered
Key problems:
- Gathering "known" ratings for matrix
- Explicit (ask people, don't scale)
- Implicit (learn from actions, hard to learn low ratings)
- Extrapolate unknown ratings from the known ones (mostly interested in high unknown ratings)
- Sparsity
- Cold start: new items have no ratings, new users have no history
- Evaluating extrapolation methods
Content-based recommendations
Main idea to recommend customer x items similar to previous items highly recommended by customer xItem profiles - set of "important" features
Text features - set of "important" words
Extract using heuristics like TF-IDF (Term frequency * Inverse Doc Frequency)
fij - frequency of ter (feature) i in doc (item) j
TFij=fijmaxkfij
nj - number of docs that mention term i
N- total number of docs
IDFi=log(Nni) TF-IDF score: wij=TFij×IDFi
Doc profile - set of words with highest TF-IDF scores (together with scores)
I.e. word is frequent in (small) set of document - it's important
Steps include building user profile as sort of aggregated features of items and estimate U(x,i)=cos(θ)=x⋅i|x||i| where x is User profile, i Item profile. Aggregation can be mean or normalized mean of features of items.
Pros:
- No need for data on other users
- Able to recommend to users with unique tastes
- Able to recommend new & unpopular items
- Explanation for recommended items
- Hard to find appropriate features
- Overspecialization - never recommend items outside user's profile, unable to exploit quality judgments of other users
- Cold start problem - how to build user profile
Collaborative filtering
Main idea to find N other users whose rating is similar to x's rating and estimate x's rating based on ratings of users in NSimilarity between users
- Jaccard similarity sim(A,B)=|rA∩rB||rA∪rB| doesn't work well.
- sim(A,B)=cos(rA,rB)
- Centered cosine is better sim(A,B)=cos(rA,rB) - center user's rating around average value.( Make 0 average user's rating)
Let rx be the vector of user x's ratings
Let N be the set of k users most similar to x who also rated item i
Prediction for user x and item i:
Option 1: rxi=1k∑y∈Nryi - doesn't take in account "degree of similarity" of users
Option 2: rxi=1k∑y∈Nsxyryi/∑y∈Nsxy where sxy=sim(x,y)
Another version is Item-Item filtering (User-User described right above)
- For item i find other similar items
- Estimate rating for item i based on ratings for similar items
- Can use same similarity metrics and prediction functions as in user-user model
In practice item-item outperforms user-user in many use cases. - Items are "simpler" than users. Items similarity is more meaningful than user similarity
Common practice implementing collaborative filtering - some combination of recommendation system plus filtering
rxi=bxi+∑j∈N(i;x)sij(rxj-bxj)∑j∈N(i;x)sij where bxi=μ+bx+bi, μ - overall mean movie rating (example of movie database), bx - rating deviation of user x = (avg rating of user x) - μ, bi - rating deviation of movie i
Evaluating Recommender system
Predict ratings on test set and evaluate RMSE
√∑(x,i)∈T(rxi-r∗xi)2N
N=|T|, rxI the predicted rating, r∗xi the actual rating.
Problem with RMSE - Prediction Diversity (all predictions too similar to each other), Prediction Context (user might operate in different context => want different results), Order of Prediction
We care only to predict high ratings - RMSE might penalize method that does well for high ratings and badly with others, Alternative : precision at top k i.e. percentage of prediction in user's top k ratings.
Latent Factor Models
Global effects => like Baseline Estimation
Local neighborhood (CF/NN) => Final estimate
Collaborative filtering - derive unknown rating from k-nearest neighbors like
ˆrxi=∑j∈N(i;x)Sij⋅rxj∑j∈N(i;x)Sij
sij similarity of items i and j
ruj rating of user x on item j
N(i;x) set of items similar to item i that were rated by x
Combine with global effects
ˆrxi=bxi+ ∑j∈N(i;x)Sij⋅(rxj-bxj)∑j∈N(i;x)Sij
bxi=μ+bx+bi
μ - overall mean rating
bx - rating deviation of user x
bi - rating deviation of item i
Problems:
1) Similarity are arbitrary
2) Pairwise similarities neglect interdependencies among users
3) Taking weighted average can be restricting
SVD
A - input data matrix
U - left singular vec
V - right singular vec
Σ - singular values
A~~U Sigma V^T => "SVD" R~~Q*P^T, A=R, Q=U, P^T=Sigma V^T
R - utility matrix approximated by Q and P matrices of lower dimensions
SVS minimizes reconstruction error
SSE = {::}_{U,V,Sigma}^"min" sum_{i,j in A}(A_{ij}-[U Sigma V^T]_{ij})^2
SSE - Sum of Squared Errors
SSE and MSE are monotonically related RMSE = 1/{const} sqrt(SSE) => also minimizing RMSE
But R has missing entries!
So modify SVD to find P,Q such as
{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}-q_i * p_x^T)^2 (use elements with defined scores)
also
- we don't require P,Q to be orthogonal/unit length
- P,Q map users/movies to latent space
{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}-q_i * p_x^T)^2+ lambda[ sum_x||p_x||^2+sum_i||q_i||^2]
and use gradient descent
Initialize P and Q (using SVD - pretend missed ratings are 0)
Do gradient descent
- P larr P - eta * gradP
- Q larr Q - eta * gradQ
Modeling biases and interactions
hat r_"xi" = mu + b_x + b_i + q_i*p_x^T
where
'mu' - overall mean rating
b_x - bias for user x
b_i - bias for movie i
q_i*p_x^T - user movie interaction
and we can fit a new model
{::}_{P,Q}^"min" sum_{(i,x in R)}(r_{"xi"}- (mu + b_x + b_i + q_i*p_x^T))^2 + lambda[ sum_x||p_x||^2+sum_i||q_i||^2+sum_x||b_x||^2+sum_i||b_i||^2]
where b_u and b_i become additional parameters for the estimation
Dimensionality Reduction
Matrix rank - number of lineary independent columns/rows. The amount of coordinates can be reduced to rank.
Why reduce?
- Discover hidden correlation/topics
- Remove redundant and noisy features
- Interpretation and visualization
- Easier storage and processing of data
SVD definition
A_[m x n] = U_[m x r] Sigma_[r x r] (V_[n x r])^T
A: input matrix (say, m documents, n terms)
U: Left singular martix m x n matrix (say, m documents, r concepts)
Sigma Singular values, r x r diagonal matrix, values on diagonal sorted in decreasing order ("strength of each concept")
V: Right singular vector n x r matrix ( n terms, r concepts)
A~~USigmaV^T = sum_isigma_iu_inu_i^T
Decomposition is always possible, matrices are unique, U and V columns orthonormal ( U^TU=I, V^TV=I, columns are orthogonal unit vectors), Sigma diagonal, singular values are positive and sorted in decreasing order
We can remove minor (tail nu).
Frobenius norm is minimal ||M||_F=sqrt(Sigma_"ij"M_"ij^2")
SVD gives best low rank approximation
Rule-of-a thumb - Keep 80-90% of 'energy' (=sumsigma_i^2)
SVD complexity O(nm^2) or O(n^2m)whichever is less but less work if we just nee a singular values, or k first singular vectors, or the matrix is sparse
Query SVD
q_"concept"= q V - transform query q to "concept" space and it will give a matrix of "strength" of each concept
Relation to Eigen-decomposition
A=XLambdaX^T
- A is symmetric
- U,V,X are orthonormal (U^TU=I)
- Lambda, Sigma are diagonal
So
+ Optimal low rank approximation
- Interpretability problem
- Lack of sparsity (singular vectors are dense)
CUR decomposition
Goal - express A as product of C,U,R matrices and keep ||A - C*U*R||_F small
"Constraints" on U - contains set of columns from A, similarly on R - set of rows from A
In practice we can choose about 4K rows/cols and do almost as good as SVD does.
See book for details.
Pros:
Easy interpretation
Sparcity
Cons:
Rows and columns with large norm may be choosen few times (probabilistic method)
P.S. ASCIIMathML
No comments:
Post a Comment