Tensors for Data and Text Analysis

For the last several months, I’ve been playing with tensors as an approach to data and text analysis. Here are some pointers to get started on tensors.

Tensors are a generalization of matrices to higher dimensions:

  • order 0 tensor = scalar
  • order 1 tensor = vector
  • order 2 tensor = matrix
  • order n > 2 tensor = higher order tensor

PARAFAC and Tucker are generalizations of SVD (Singular Value Decomposition) to higher order tensors. When n = 2, PARAFAC = Tucker = SVD. When n > 2, PARAFAC and Tucker are two different ways to generalize SVD to higher order tensors.

Two good PowerPoint introductions to tensors for data and text analysis are here:

A good survey is Unsupervised Multiway Data Analysis: A Literature Survey. I’m using the Tensor Toolbox, written in Matlab, for my experiments.

For more information on tensors, Tucker, and PARAFAC, here are some good places to look:

Tamara Kolda:

Rasmus Bro:

Richard Harshman:

Evrim Acar:

Here are some papers on tensors for various data and text analysis applications:

http://csmr.ca.sandia.gov/~tgkolda/pubs/index.html

  • Cross-language information retrieval using PARAFAC2
  • Pattern analysis of directed graphs using DEDICOM: An application to enron email
  • Temporal analysis of social networks using three-way DEDICOM
  • Multilinear algebra for analyzing data with multiple linkages
  • The TOPHITS model for higher-order web link analysis
  • Higher-order web link analysis using multilinear algebra
  • Data sciences technology for homeland security information management and knowledge discovery

http://research.microsoft.com/~jtsun/

  • CubeSVD: A Novel Approach to Personalized Web Search

http://www.cs.cmu.edu/~christos/

  • Beyond Streams and Graphs: Dynamic Tensor Analysis

http://www.cs.rpi.edu/~acare/publications.html

  • Collective Sampling and Analysis of High-order Tensors for Chatroom Communications
  • Modeling and Multiway Analysis of Chatroom Tensors

http://publish.uwo.ca/~harshman/

  • Improvements to three-way DEDICOM for applications in social network analysis

Update: My algorithm and source code for big tensors.

4 Responses to “Tensors for Data and Text Analysis”

  1. It seems an immediate application of this would be in the Netflix Prize competition for $1M, where SVD is already being widely used. the NFP is basically a missing value imputation problem involving a very large and very sparse matrix, so latent variable computation is very important. Unlike the Hutter Prize for Lossless Compression of Human Knowledge — which is text processing — the Netflix Prize does not place any limits on the computation resources allowed. Until some fairly powerful optimizations for higher dimensional SVD analogues can be found, the Hutter Prize’s computation limits will be prohibitive (as well as offering lower rewards compared to the NFP). However, I’d hope that the algorithms coming out of the NFP will be made available to text researchers.

  2. Yes, I also thought of the connection to Netflix. I’m not working on the Netflix problem, but efficient Tucker and PARAFAC variants are on my agenda.

  3. Thanks for this article, it’s a nice starting point for people like me who have more than two modalities of information and need a method like SVD to aggregate them.

    I’m currently using SVD for latent semantic analysis, replacing the traditional term-document matrix with a sparse matrix of relevance feedback to aid image annotation. So effectively I have three modalities: low-level image content, image captions, and relevance feedback which relates certain images (documents) to others via a query concept.

    Donn

  4. You’re welcome!

Leave a Reply