Week 6 = 4 days for dimensionality reduction and recommender systems. It was also sandwiched between 4th of July (holiday) and our week-long break. I think most of us were not in the strongest mindset or focus for matrix algebra, so it's good for me to do some recapping.

Dimensionality Reduction

In machine learning, our data is represented in a matrix where each row is a set of observations of some features. So a matrix of size m x p would have m observations and p features. When m and p are extremely large, we need dimensionality reduction to help alleviate some problems.

High-dimensional matrices suffer from "the curse of dimensionality". Unsupervised learning methods such as k-nearest neighbors and k-means/hierarchical clustering use distance metrics to group together data that is similar, meaning there is small distance between data points. So, a large number of features means that data points will be extremely far from each other (sparse). You would need a LOT more data to learn and describe the high-dimensional space. Or, you can perform dimensionality reduction.

A large feature space may exhibit multicollinearity, redundancy, and/or noise, which we can be solved by tuning (reducing) the feature space- this reminds me of using regularization for tuning model complexity in regression models.

Dimensionality reduction can also bring out latent features! These are the most relevant features of your data that is not explicitly measured in the data (think hidden topics like book/movie genres).

Also, computations are nicer with a smaller matrix.

We went over 3 methods for dimensionality reduction: Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Non-negative Matrix Factorization (NMF).

PCA takes your data that is described by possibly correlated variables and tries to describe it with uncorrelated (orthogonal) variables. This set of uncorrelated variables are the principal components part of PCA, and the number of principal components are less than or equal to the number of features you originally started with. The principal components are axes that your data has the most variance along- I liked this image they showed us in lecture- there is no reduction in number of features, but it has transformed it to a more useful set of axes: To get the components, you must solve an eigenvalue problem of your data's covariance matrix. Here, the eigenvalue corresponds to the variance captured along the direction of its respectful eigenvector. So, we can choose a reduced feature space by choosing a subset of the eigenvectors (eliminate eigenvectors with smallest eigenvalues first). You can choose an appropriate number of principal components by targeting some ratio of $\frac{principal \ component \ variance}{original \ variance}$ or by using the "elbow" method of a scree plot (y-axis of eigenvalues in decreasing order, x-axis of corresponding component).

On top of computing the covariance matrix AND solving an eigenvalue problem to perform PCA (which isn't the most computationally tractable especially for data already requiring dimensionality reduction) these results aren't as interpretable, which brings us to SVD.

SVD breaks down your matrix X (m x p) into a product of 3 matrices: U (m x k), D (k x k), V (k x p) in that order. The derivation is a bunch of matrix algebra that I'll skip. The important part is that D is a diagonal matrix of singular values corresponding to k latent features. This means that U is some matrix describing a relationship between observation and latent feature, and V is describing a relationship between latent feature and original feature. Performing SVD on a matrix of users (rows) by their rankings of specific movies (columns) can reveal topics (latent features).

These matrices may contain negative elements, which might not be applicable or interpretable in some contexts (like term frequency or inverse document frequency in NLP). In that case, we can use NMF.

NMF factors your matrix X (m x p) into W (m x k) and H (k x p). This is done using alternating least squares where you hold W (or H) constant, solve for H (or W) using simple regression and only keeping the non-negative values, then repeat by switching W & H. It's important to remember that since this is an iterative process, your resulting factorization is only approximate.

Recommenders

Recommenders work by suggesting an item to a user that is similar to some preference he/she has exhibited. "Similar" can be defined as similarity between users or similarity between items. The number of users is usually larger than the number of items, so item-item similarity may be nicer to build a recommender with.

Similarity is measured on a scale of 0 (totally dissimilar) to 1 (identical). Some of the ways similarity can be computed include:

• euclidean distance (pretty intuitive)
• cosine similarity (similarity of vector directions)
• Pearson correlation (normalized covariance)
• since this metric is normalized, it will capture similarity between users who rate items consistently high/low
• Ex: if one person rates some items as 5, 3, 4, and another person rates the same items as 3, 1, 2, they will have a similarity of 1 since they relatively rate the same way
• Jaccard similarity (for sets constructed from booleans)
• this is used when we have a boolean measure like bought/didn't bought instead of ratings
• the similarity is the ratio of the number of people who bought item 1 & item 2 to the number of people who bought item 1 or item 2

Then, we can construct a similarity matrix where rows and columns correspond to items and are populated with the similarity between two items (this matrix will be symmetric).

Given some utility matrix that describes a user's rating of every possible item where rows are users and columns are items, we can expect that this matrix will not be fully populated. But, we can use the similarity matrix to predict a user's ratings for unrated items, and return items with the highest predicted rating as recommendations:

$$rating(u, i) = \frac{\sum_{j \in I_u} similarity (i, j) \times r_{u, j}}{\sum_{j \in I_u} similarity (i, j)}$$

where $$I_u = set \ of \ items \ rated \ by \ user \ u$$ $$r_{u, j} = user \ u's \ rating \ of \ item \ j$$

The predicted ratings are computed like a weighted average of all items a user has rated- items that are similar to an item j that user i has rated will be weighted more highly with the user's rating of item j.

You can also speed up the prediction by only looking at "neighborhoods" of items. In the above calculation, you would only use similarity matrix values for elements that are most similar to the unrated item you are trying to predict.

Issues with Recommenders:

• Utility matrix can be sparse! This is when dimensionality reduction is helpful for us.
• Hard to validate- to measure that your recommender is effective you could perform hypothesis testing to see if conversions have changed.
• Requires a user to rate a number of items before recommendations can be given

Then a long awaited break to celebrate the halfway point of our program!

Things I thought I'd accomplish on my break:

• figuring out what to work on for my capstone project
• revamping my structural engineering resume to a data science resume
• brushing up my LinkedIn (adding data sciencey skills to get endorsed, figuring out what description to put for my experience at Galvanize)
• updating the blog
• exploring topics on my own:
• word2vec
• neural nets
• d3js
• and more things... the list never ends
• finding a new place to live after my summer sublet is up

Things I got done on my break:

• revamping my structural engineering resume to a data science resume
• brushing up my LinkedIn (adding data sciencey skills to get endorsed, figuring out what description to put for my experience at Galvanize)
• kinda updating the blog
• Fussing with Craigslist Missed Connections
• as I mentioned in the last post ... there's always an improvement or update to be made