SVD, Variance, and Sparsity

There is a steady trickle of visitors to my post on Why Does SVD Improve Similarity Measurement?, so I gave this question a bit more thought. In that post, I offered three hypotheses about why SVD helps — high-order co-occurrence, latent meaning, and noise reduction — and I said that I didn’t know which hypothesis [...]

SVD and Tucker Decomposition with Low RAM Requirements

Recently I’ve been experimenting with algorithms for the Singular Value Decomposition and the Tucker Decomposition, with the goal of processing large matrices (more than 105 rows and columns) and large tensors (more than 104 rows, columns, and tubes) that are relatively sparse (about 10% density). The problem with matrices and tensors of this size is [...]

Tensors for Data and Text Analysis

For the last several months, I’ve been playing with tensors as an approach to data and text analysis. Here are some pointers to get started on tensors.
Tensors are a generalization of matrices to higher dimensions:

order 0 tensor = scalar
order 1 tensor = vector
order 2 tensor = matrix
order n > 2 tensor = higher order tensor

PARAFAC [...]

Why Does SVD Improve Similarity Measurement?

In response to my earlier post on Effects of High-Order Co-occurrences on Word Semantic Similarity, Tom Landauer sent me the following note:
You have given me an idea. Because I have just been asked again to review papers that say that the way LSA works is by indirect associations, it seems that few have seen my [...]

Effects of High-Order Co-occurrences on Word Semantic Similarity

Comments on:
Benoît Lemaire and Guy Denhière
Effects of High-Order Co-occurrences on Word Semantic Similarity
PMI-IR estimates the semantic similarity between a pair of words by how frequently they co-occur within a certain window of text. This simple measure of similarity is surprisingly good at recognizing synonyms: it seems that synonyms often appear close together in text. [...]

Unified Latent Analysis

In Latent Semantic Analysis, we use a large collection of text to build a matrix, in which the rows represent words and the columns represent chunks of text. A chunk can be a sentence, a paragraph, a document, or any sequence of words. The value in a cell in the matrix is based on the [...]