Записки программиста, обо всем и ни о чем. Но, наверное, больше профессионального.

2015-12-30

BerkeleyX: CS190.1x Scalable Machine Learning

Мои конспекты по курсу
BerkeleyX: CS190.1x Scalable Machine Learning

Интро

Виртуалка для лабораторных работ

Неделя 1: обзор Apache Spark; основные концепты, линейная алгебра big O, ...

Неделя 2: Apache Spark, RDD, transformations, actions, ...

Неделя 2, лабораторка: RDD, lambda, transformations, actions, ...

Неделя 3: linear regression, gradient descent, grid search, …

Неделя 3, лабораторка: linear regression pipeline

Неделя 4: логистическая регресия, one-hot-encoding, feature hashing, …

Неделя 4, лабораторка: логистическая регресия, one-hot-encoding, feature hashing, …

Неделя 5: Principal Component Analysis

Неделя 5, лабораторка: PCA






original post http://vasnake.blogspot.com/2015/12/berkeleyx-cs1901x-scalable-machine.html

Week 5, Lab 5

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 5.
В прошлый раз было про теорию PCA (Principal Component Analysis).
Пришло время потрогать пройденные темы на практике. Лабораторка.

LAB5 OVERVIEW

Principal Component Analysis Lab

This lab delves into exploratory analysis of neuroscience data, specifically using principal component analysis (PCA) and feature-based aggregation. We will use a dataset of light-sheet imaging recorded by the Ahrens Lab at Janelia Research Campus, and hosted on the CodeNeuro data repository.
Our dataset is generated by studying the movement of a larval zebrafish, an animal that is especially useful in neuroscience because it is transparent, making it possible to record activity over its entire brain using a technique called light-sheet microscopy. Specifically, we'll work with time-varying images containing patterns of the zebrafish's neural activity as it is presented with a moving visual pattern. Different stimuli induce different patterns across the brain, and we can use exploratory analyses to identify these patterns. Read "Mapping brain activity at scale with cluster computing" for more information about these kinds of analyses.

During this lab you will learn about PCA, and then compare and contrast different exploratory analyses of the same data set to identify which neural patterns they best highlight.

Понятно вроде. Я, пожалуй, не буду дальше выделять курсивом цитаты из материалов курса. И так понятно, всё, что по аглицки – умело скопипащено.

ОК, забираем нотебук

Запускаем виртуалку
valik@snafu:~$  pushd ~/sparkvagrant/
valik@snafu:~/sparkvagrant$  vagrant up
И вперед: http://localhost:8001/tree

На Гитхабе этот нотебук в отрендеренном виде

Программа действий:
  • Part 1: Work through the steps of PCA on a sample dataset
  • Part 2: Write a PCA function and evaluate PCA on sample datasets
  • Part 3: Parse, inspect, and preprocess neuroscience data then perform PCA
  • Part 4: Perform feature-based aggregation followed by PCA
Видно, что первые две части посвящены разбору алгоритма PCA (который big n, small d) и только потом собственно практика:
применение PCA в реальной задаче.

Для справки

Part 1: Work through the steps of PCA on a sample dataset


Visualization 1: Two-dimensional Gaussians

Principal Component Analysis, or PCA, is a strategy for dimensionality reduction.
To better understand PCA, we'll work with synthetic data generated by sampling from the two-dimensional Gaussian distribution.
This distribution takes as input the mean and variance of each dimension, as well as the covariance between the two dimensions.
In our visualizations below, we will specify the mean of each dimension to be 50 and the variance along each dimension to be 1.
We will explore two different values for the covariance: 0 and 0.9.
When the covariance is zero, the two dimensions are uncorrelated, and hence the data looks spherical.
In contrast, when the covariance is 0.9, the two dimensions are strongly (positively) correlated and thus the data is non-spherical.
As we'll see in Parts 1 and 2, the non-spherical data is amenable to dimensionality reduction via PCA, while the spherical data is not.

Для разминки нам подготовили два датасета:
def create2DGaussian(mn, sigma, cov, n):
    """Randomly sample points from a two-dimensional Gaussian distribution"""
    np.random.seed(142)
    return np.random.multivariate_normal(np.array([mn, mn]), np.array([[sigma, cov], [cov, sigma]]), n)
dataRandom = create2DGaussian(mn=50, sigma=1, cov=0, n=100)
dataCorrelated = create2DGaussian(mn=50, sigma=1, cov=.9, n=100)


dataRandom



dataCorrelated

1a, отцентруем датасет, чтобы среднее значение = 0

In the first step of PCA, we must first center our data. Working with our correlated dataset, first compute the mean of each feature (column) in the dataset. Then for each observation, modify the features by subtracting their corresponding mean, to create a zero mean dataset.
Note that correlatedData is an RDD of NumPy arrays. This allows us to perform certain operations more succinctly. For example, we can sum the columns of our dataset using correlatedData.sum()
correlatedData = sc.parallelize(dataCorrelated)
meanCorrelated = correlatedData.сумма
meanCorrelated = [meanCorrelated[0] / correlatedData.количество, meanCorrelated[1] / correlatedData.количество]
correlatedDataZeroMean = correlatedData.map(lambda x: (x[0] минус meanCorrelated[0], x[1] минус meanCorrelated[1]))
print meanCorrelated
print correlatedData.take(1)
print correlatedDataZeroMean.take(1)

[49.957390367753533, 49.971804769900864]
[array([ 49.6717712 ,  50.07531969])]
[(-0.28561916832964585, 0.10351492197556666)]

1b, посчитаем матрицу ковариантности, распределенно

Если вспомнить лекцию, то понятно, что каждый воркер получит запись датасета и сделает из нее outer product.

To compute this matrix, compute the outer product of each data point,
add together these outer products,
and divide by the number of data points.
The data are two dimensional, so the resulting covariance matrix should be a 2x2 matrix.



correlatedCov = (correlatedDataZeroMean
    .map(lambda x: аутерпродакт(x, x))
    .сумма
) / correlatedDataZeroMean.количество
print correlatedCov

[[ 0.99558386  0.90148989]
 [ 0.90148989  1.08607497]]

1c, размялись, теперь оформим расчет ковариантности как функцию

use the expressions above to write a function to compute the sample covariance matrix for an arbitrary data RDD

def estimateCovariance(data):
    """Compute the covariance matrix for a given rdd.

    Note:
        The multi-dimensional covariance array should be calculated using outer products.  Don't
        forget to normalize the data by first subtracting the mean.

    Args:
        data (RDD of np.ndarray):  An `RDD` consisting of NumPy arrays.

    Returns:
        np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
            length of the arrays in the input `RDD`.
    """
    count = data.количество
    mean = data.сумма
    mean = map(lambda x: x делить count, mean)
    centered = data.map(lambda x: x минус mean)
    res = (centered
           .map(lambda x: аутерпродакт(x, x))
           .сумма()
    ) / count
    return res

correlatedCovAuto = estimateCovariance(correlatedData)
print correlatedCovAuto

[[ 0.99558386  0.90148989]
 [ 0.90148989  1.08607497]]

1d, имея матрицу ковариантности, можно найти эйгенвекторы со товарищи (принцип. компоненты)
Считать будем по простому, ибо «small d».

The d eigenvectors of the covariance matrix give us the directions of maximal variance, and are often called the "principal components."
The associated eigenvalues are the variances in these directions. In particular, the eigenvector corresponding to the largest eigenvalue is the direction of maximal variance (this is sometimes called the "top" eigenvector).

Eigendecomposition of a d×d covariance matrix has a (roughly) cubic runtime complexity with respect to d. Whenever d is relatively small (e.g., less than a few thousand) we can quickly perform this eigendecomposition locally.

Use a function from numpy.linalg called eigh to perform the eigendecomposition.
Next, sort the eigenvectors based on their corresponding eigenvalues (from high to low), yielding a matrix where the columns are the eigenvectors (and the first column is the top eigenvector).
Note that np.argsort can be used to obtain the indices of the eigenvalues that correspond to the ascending order of eigenvalues.
Finally, set the topComponent variable equal to the top eigenvector or prinicipal component, which is a 2-dimensional vector (array with two values).
Note that the eigenvectors returned by eigh appear in the columns and not the rows.
For example, the first eigenvector of eigVecs would be found in the first column and could be accessed using eigVecs[:,0].

from numpy.linalg import eigh
# Calculate the eigenvalues and eigenvectors from correlatedCovAuto
eigVals, eigVecs = eigh(correlatedCovAuto)
print 'eigenvalues: {0}'.format(eigVals)
print '\neigenvectors: \n{0}'.format(eigVecs)
# Use np.argsort to find the top eigenvector based on the largest eigenvalue
inds = np.argsort(eigVals)[::-1]
topComponent = eigVecs[:, inds[0]]
print '\ntop principal component: {0}'.format(topComponent)

eigenvalues: [ 0.13820481  1.94345403]

eigenvectors: 
[[-0.72461254  0.68915649]
 [ 0.68915649  0.72461254]]

top principal component: [ 0.68915649  0.72461254]

1e, сожмем данные через PCA
В смысле, имея принцип. компоненты, в виде матрицы, мы можем снизить размерность наших данных.
Перемножив матрицу компонент на матрицу данных.

We just computed the top principal component for a 2-dimensional non-spherical dataset. Now let's use this principal component to derive a one-dimensional representation for the original data. To compute these compact representations, which are sometimes called PCA "scores", calculate the dot product between each data point in the raw data and the top principal component.

# Use the topComponent and the data from correlatedData to generate PCA scores
correlatedDataScores = correlatedData.map(lambda x: x.умножить(topComponent))
print 'one-dimensional data (first three):\n{0}'.format(np.asarray(correlatedDataScores.take(3)))

one-dimensional data (first three):
[ 70.51682806  69.30622356  71.13588168]

Part 2: Write a PCA function and evaluate PCA on sample datasets


2a, теперь у нас есть всё необходимое, чтобы нарисовать функцию «pca»

def pca(data, k=2):
    """Computes the top `k` principal components, corresponding scores, and all eigenvalues.

    Note:
        All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
        each eigenvectors as a column.  This function should also return eigenvectors as columns.

    Args:
        data (RDD of np.ndarray): An `RDD` consisting of NumPy arrays.
        k (int): The number of principal components to return.

    Returns:
        tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
            scores, eigenvalues).  Eigenvectors is a multi-dimensional array where the number of
            rows equals the length of the arrays in the input `RDD` and the number of columns equals
            `k`.  The `RDD` of scores has the same number of rows as `data` and consists of arrays
            of length `k`.  Eigenvalues is an array of length d (the number of features).
    """
    covMatrix = estimateCovariance(data)
    eigValues, eigVecs = eigh(covMatrix)
    inds = нп.аргсорт(eigValues)
    revInds = inds[::минус раз]
    d = len(eigValues)
    topComponents = np.нули((d, k))
    for idx in range(k): # insert columns
        topComponents[:, idx] = eigVecs[:, revInds[idx]]
    dataScores = data.map(lambda x: x.умножить(topComponents))
    # Return the `k` principal components, `k` scores, and all eigenvalues
    return (topComponents, dataScores, eigValues[revInds])

# Run pca on correlatedData with k = 2
topComponentsCorrelated, correlatedDataScoresAuto, eigenvaluesCorrelated = pca(correlatedData, k=2)

2b, применим функцию pca

randomData = sc.parallelize(dataRandom)
# Use pca on randomData
topComponentsRandom, randomDataScoresAuto, eigenvaluesRandom = pca(randomData, k=2)
print 'topComponentsRandom: \n{0}'.format(topComponentsRandom)
print ('\nrandomDataScoresAuto (first three): \n{0}'
       .format('\n'.join(map(str, randomDataScoresAuto.take(3)))))
print '\neigenvaluesRandom: \n{0}'.format(eigenvaluesRandom)

topComponentsRandom: 
[[-0.2522559  -0.96766056]
 [ 0.96766056 -0.2522559 ]]

randomDataScoresAuto (first three): 
[ 36.61068572 -61.3489929 ]
[ 35.97314295 -62.08813671]
[ 35.59836628 -60.61390415]

eigenvaluesRandom: 
[ 1.4204546   0.99521397]

Посмотрим, как наши данные выглядят на графиках





Теперь добавим третье измерение, данные 3D



from mpl_toolkits.mplot3d import Axes3D
m = 100
mu = np.array([50, 50, 50])
r1_2 = 0.9
r1_3 = 0.7
r2_3 = 0.1
sigma1 = 5
sigma2 = 20
sigma3 = 20
c = np.array([[sigma1 ** 2, r1_2 * sigma1 * sigma2, r1_3 * sigma1 * sigma3],
             [r1_2 * sigma1 * sigma2, sigma2 ** 2, r2_3 * sigma2 * sigma3],
             [r1_3 * sigma1 * sigma3, r2_3 * sigma2 * sigma3, sigma3 ** 2]])
np.random.seed(142)
dataThreeD = np.random.multivariate_normal(mu, c, m)

2c, и сожмем их через PCA до 2D

threeDData = sc.parallelize(dataThreeD)
componentsThreeD, threeDScores, eigenvaluesThreeD = pca(threeDData, k=2)

print 'componentsThreeD: \n{0}'.format(componentsThreeD)
print ('\nthreeDScores (first three): \n{0}'
       .format('\n'.join(map(str, threeDScores.take(3)))))
print '\neigenvaluesThreeD: \n{0}'.format(eigenvaluesThreeD)

componentsThreeD: 
[[ 0.23952078  0.045635  ]
 [ 0.61699931  0.76409466]
 [ 0.74962768 -0.64348799]]

threeDScores (first three): 
[ 85.25798606  -8.29694407]
[ 89.66337911  15.73381517]
[ 75.92616872 -20.5015709 ]

eigenvaluesThreeD: 
[ 614.46863537  349.47737219    5.85043581]



See the 2D version of the data that captures most of its original structure.
Note that darker blues correspond to points with higher values for the original data's third dimension

2d, отладка, как много вариативности мы захватываем

let's quantify how much of the variance is being captured by PCA in each of the three synthetic datasets we've analyzed. To do this, we'll compute the fraction of retained variance by the top principal components.
Recall that the eigenvalue corresponding to each principal component captures the variance along this direction.

def varianceExplained(data, k=1):
    """Calculate the fraction of variance explained by the top `k` eigenvectors.

    Args:
        data (RDD of np.ndarray): An RDD that contains NumPy arrays which store the
            features for an observation.
        k: The number of principal components to consider.

    Returns:
        float: A number between 0 and 1 representing the percentage of variance explained
            by the top `k` eigenvectors.
    """
    components, scores, eigenvalues = pca(data, k)
    topEV = эйгенвалс[:первые к]
    res = sum(топ) / sum(эйгенвалс)
    return res

varianceRandom1 = varianceExplained(randomData, 1)
varianceCorrelated1 = varianceExplained(correlatedData, 1)
varianceRandom2 = varianceExplained(randomData, 2)
varianceCorrelated2 = varianceExplained(correlatedData, 2)
varianceThreeD2 = varianceExplained(threeDData, 2)

print ('Percentage of variance explained by the first component of randomData: {0:.1f}%'
       .format(varianceRandom1 * 100))
print ('Percentage of variance explained by both components of randomData: {0:.1f}%'
       .format(varianceRandom2 * 100))
print ('\nPercentage of variance explained by the first component of correlatedData: {0:.1f}%'.
       format(varianceCorrelated1 * 100))
print ('Percentage of variance explained by both components of correlatedData: {0:.1f}%'
       .format(varianceCorrelated2 * 100))
print ('\nPercentage of variance explained by the first two components of threeDData: {0:.1f}%'
       .format(varianceThreeD2 * 100))

Percentage of variance explained by the first component of randomData: 58.8%
Percentage of variance explained by both components of randomData: 100.0%

Percentage of variance explained by the first component of correlatedData: 93.4%
Percentage of variance explained by both components of correlatedData: 100.0%

Percentage of variance explained by the first two components of threeDData: 99.4%


Part 3: Parse, inspect, and preprocess neuroscience data then perform PCA


3a, загрузка данных

we will load and do some basic inspection of the data. The raw data are currently stored as a text file. Every line in the file contains the time series of image intensity for a single pixel in a time-varying image (i.e. a movie). The first two numbers in each line are the spatial coordinates of the pixel, and the remaining numbers are the time series. We'll use first() to inspect a single row, and print just the first 100 characters.¶

import os
baseDir = os.path.join('data')
inputPath = os.path.join('cs190', 'neuro.txt')
inputFile = os.path.join(baseDir, inputPath)

lines = sc.textFile(inputFile)
print lines.first()[0:100]

# Check that everything loaded properly
assert len(lines.first()) == 1397
assert lines.count() == 46460

0 0 103 103.7 103.2 102.7 103.8 102.8 103 103.3 103.8 103.2 102.1 103.5 103.2 102.7 103.1 102.2 102.

3b, парсинг данных
Каждая запись станет туплем, см. ниже

Parse the data into a key-value representation. We want each key to be a tuple of two-dimensional spatial coordinates and each value to be a NumPy array storing the associated time series.
Write a function that converts a line of text into a (tuple, np.ndarray) pair.
Then apply this function to each record in the RDD, and inspect the first entry of the new parsed data set.
Now would be a good time to cache the data, and force a computation by calling count, to ensure the data are cached.

def parse(line):
    """Parse the raw data into a (`tuple`, `np.ndarray`) pair.

    Note:
        You should store the pixel coordinates as a tuple of two ints and the elements of the pixel intensity
        time series as an np.ndarray of floats.

    Args:
        line (str): A string representing an observation.  Elements are separated by spaces.  The
            first two elements represent the coordinates of the pixel, and the rest of the elements
            represent the pixel intensity over time.

    Returns:
        tuple of tuple, np.ndarray: A (coordinate, pixel intensity array) `tuple` where coordinate is
            a `tuple` containing two values and the pixel intensity is stored in an NumPy array
            which contains 240 values.
    """
    lst = line.сплит(пробел)
    lst = map(lambda x: float(x.strip()), lst)
    coords = (int(lst[0]), int(lst[1]))
    pixValues = lst[остаток-после-двух:]
    res = (coords, np.asarray(pixValues))
    return res

rawData = lines.map(parse)
rawData.cache()
entry = rawData.first()

print 'Length of movie is {0} seconds'.format(len(entry[1]))
print 'Number of pixels in movie is {0:,}'.format(rawData.count())
print ('\nFirst entry of rawData (with only the first five values of the NumPy array):\n({0}, {1})'
       .format(entry[0], entry[1][:5]))

Length of movie is 240 seconds
Number of pixels in movie is 46,460
First entry of rawData (with only the first five values of the NumPy array):
((0, 0), [ 103.   103.7  103.2  102.7  103.8])

3c, найдем границы датасета, мин, макс

Next we'll do some basic preprocessing on the data. The raw time-series data are in units of image flouresence, and baseline flouresence varies somewhat arbitrarily from pixel to pixel. First, compute the minimum and maximum values across all pixels

#allMean = (rawData.map(lambda (xy, vals): sum(vals)/len(vals)).sum()) / rawData.count()
mn = (rawData
      .map(lambda (xy, vals): минимум(vals))
      .минимум
)
mx = (rawData
      .map(lambda (xy, vals): максимум(vals))
      .максимум
)
print mn, mx

100.6 940.8

Как меняется яркость пикселей со временем, график


example = rawData.filter(lambda (k, v): np.std(v) > 100).values().first()
plt.plot(range(len(example)), example, c='#8cbfd0', linewidth='3.0')

3d, масштабирование данных

To convert from these raw flouresence units to more intuitive units of fractional signal change, write a function that takes a time series for a particular pixel and subtracts and divides by the mean. Then apply this function to all the pixels. Confirm that this changes the maximum and minimum values

def rescale(ts):
    """Take a np.ndarray and return the standardized array by subtracting and dividing by the mean.

    Note:
        You should first subtract the mean and then divide by the mean.

    Args:
        ts (np.ndarray): Time series data (`np.float`) representing pixel intensity.

    Returns:
        np.ndarray: The times series adjusted by subtracting the mean and dividing by the mean.
    """
    mn = np.среднее(ts)
    #res = map(lambda x: (x минус mn) поделить mn, ts)
    res = (ts минус mn) поделить mn
    res = np.asarray(res)
    return res

scaledData = rawData.mapValues(lambda v: rescale(v))
mnScaled = scaledData.map(lambda (k, v): v).map(lambda v: min(v)).min()
mxScaled = scaledData.map(lambda (k, v): v).map(lambda v: max(v)).max()

print mnScaled, mxScaled

-0.271512880125 0.905448764348

нормализованные данные на графике



example = scaledData.filter(lambda (k, v): np.std(v) > 0.1).values().first()
plt.plot(range(len(example)), example, c='#8cbfd0', linewidth='3.0')

3e, сожмем датасет с 240 секунд до 3

We now have a preprocessed dataset with n=46460 pixels and d=240 seconds of time series data for each pixel.
We can interpret the pixels as our observations and each pixel value in the time series as a feature.
We would like to find patterns in brain activity during this time series, and we expect to find correlations over time.
We can thus use PCA to find a more compact representation of our data and allow us to visualize it.
Use the pca function from Part (2a) to perform PCA on the preprocessed neuroscience data with k=3, resulting in a new low-dimensional 46460 by 3 dataset.
The pca function takes an RDD of arrays, but data is an RDD of key-value pairs, so you'll need to extract the values.

# Run pca using scaledData
componentsScaled, scaledScores, eigenvaluesScaled = pca(
    scaledData.map(lambda (k,v): v),
    k=3)

Как сжатые данные выглядят на графике

Now, we'll view the scores for the top two component as images.
Note that we reshape the vectors by the dimensions of the original image, 230 x 202.
These graphs map the values for the single component to a grayscale image. This provides us with a visual representation which we can use to see the overall structure of the zebrafish brain and to identify where high and low values occur.
However, using this representation, there is a substantial amount of useful information that is difficult to interpret.
In the next visualization, we'll see how we can improve interpretability by combining the two principal components into a single image using a color mapping


scoresScaled = np.vstack(scaledScores.collect())

imageOneScaled = scoresScaled[:,0].reshape(230, 202).T
image = plt.imshow(imageOneScaled,interpolation='nearest', aspect='auto', cmap=cm.gray)



imageTwoScaled = scoresScaled[:,1].reshape(230, 202).T
image = plt.imshow(imageTwoScaled,interpolation='nearest', aspect='auto', cmap=cm.gray)

А отобразив компоненты на цвета, можно получить картинку еще интереснее


brainmap = polarTransform(2.0, [imageOneScaled, imageTwoScaled])
image = plt.imshow(brainmap,interpolation='nearest', aspect='auto')

Part 4: Perform feature-based aggregation followed by PCA


4a, фокус в том, что 240 секунд замеров в нашем датасете это измерения реакции на 12 шаблонов, по 20 секунд каждое измерение.
Поэтому нам надо научиться аггрегировать данные. В данном случае, через (кто бы мог подумать) перемножение матриц.

For this exercise, you'll create several arrays that perform different types of aggregation

vector = np.array([0., 1., 2., 3., 4., 5.])

# Create a multi-dimensional array that when multiplied (using .dot) against vector, results in
# a two element array where the first element is the sum of the 0, 2, and 4 indexed elements of
# vector and the second element is the sum of the 1, 3, and 5 indexed elements of vector.
# This should be a 2 row by 6 column array
sumEveryOther = np.array([[1, 0, 1, 0, 1, 0], 
                          [0, 1, 0, 1, 0, 1]])

# Create a multi-dimensional array that when multiplied (using .dot) against vector, results in a
# three element array where the first element is the sum of the 0 and 3 indexed elements of vector,
# the second element is the sum of the 1 and 4 indexed elements of vector, and the third element is
# the sum of the 2 and 5 indexed elements of vector.
# This should be a 3 row by 6 column array
sumEveryThird = np.array([[1, 0, 0, 1, 0, 0],
                          [0, 1, 0, 0, 1, 0],
                          [0, 0, 1, 0, 0, 1]])

# Create a multi-dimensional array that can be used to sum the first three elements of vector and
# the last three elements of vector, which returns a two element array with those values when dotted
# with vector.
# This should be a 2 row by 6 column array
sumByThree = np.array([[1, 1, 1, 0, 0, 0], 
                       [0, 0, 0, 1, 1, 1]])

# Create a multi-dimensional array that sums the first two elements, second two elements, and
# last two elements of vector, which returns a three element array with those values when dotted
# with vector.
# This should be a 3 row by 6 column array
sumByTwo = np.array([ [1, 1, 0, 0, 0, 0],
                      [0, 0, 1, 1, 0, 0],
                      [0, 0, 0, 0, 1, 1]])

print 'sumEveryOther.dot(vector):\t{0}'.format(sumEveryOther.dot(vector))
print 'sumEveryThird.dot(vector):\t{0}'.format(sumEveryThird.dot(vector))
print '\nsumByThree.dot(vector):\t{0}'.format(sumByThree.dot(vector))
print 'sumByTwo.dot(vector): \t{0}'.format(sumByTwo.dot(vector))

sumEveryOther.dot(vector):      [ 6.  9.]
sumEveryThird.dot(vector):      [ 3.  5.  7.]
sumByThree.dot(vector): [  3.  12.]
sumByTwo.dot(vector):   [ 1.  5.  9.]

4b, использование np.tile, np.eye для заполнения матриц


In this exercise, recreate sumEveryOther and sumEveryThird using np.tile and np.eye

# Use np.tile and np.eye to recreate the arrays
sumEveryOtherTile = np.tile(np.eye(2),3)
sumEveryThirdTile = np.tile(np.eye(3),2)

print sumEveryOtherTile
print 'sumEveryOtherTile.dot(vector): {0}'.format(sumEveryOtherTile.dot(vector))
print '\n', sumEveryThirdTile
print 'sumEveryThirdTile.dot(vector): {0}'.format(sumEveryThirdTile.dot(vector))

[[ 1.  0.  1.  0.  1.  0.]
 [ 0.  1.  0.  1.  0.  1.]]

sumEveryOtherTile.dot(vector): [ 6.  9.]

[[ 1.  0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.  1.]]

sumEveryThirdTile.dot(vector): [ 3.  5.  7.]

4c, использование np.kron для заполнения матриц


For this exercise, you'll recreate the sumByThree and sumByTwo arrays using np.kron, np.eye, and np.ones. Note that np.ones creates an array of all ones

# Use np.kron, np.eye, and np.ones to recreate the arrays
sumByThreeKron = np.kron(
#    np.array([[1, 0],
#              [0, 1]]),
    np.eye(2),
#    np.array([1, 1, 1])
    np.ones((1,3))
)
sumByTwoKron = np.kron(
#    np.array([[1,0,0],
#              [0,1,0],
#              [0,0,1]]),
    np.eye(3),
#    np.array([1, 1])
    np.ones((1,2))
)

print sumByThreeKron
print 'sumByThreeKron.dot(vector): {0}'.format(sumByThreeKron.dot(vector))
print '\n', sumByTwoKron
print 'sumByTwoKron.dot(vector): {0}'.format(sumByTwoKron.dot(vector))

[[ 1.  1.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  1.  1.]]

sumByThreeKron.dot(vector): [  3.  12.]

[[ 1.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  1.]]

sumByTwoKron.dot(vector): [ 1.  5.  9.]

4d, теперь можно аггрегировать наш датасет

we'll first study the temporal aspects of neural response, by aggregating our features by time. In other words, we want to see how different pixels (and the underlying neurons captured in these pixels) react in each of the 20 seconds after a new visual pattern is displayed, regardless of what the pattern is.
Hence, instead of working with the 240 features individually, we'll aggregate the original features into 20 new features, where the first new feature captures the pixel response one second after a visual pattern appears, the second new feature is the response after two seconds, and so on.
We can perform this aggregation using a map operation.
First, build a multi-dimensional array T that, when dotted with a 240-dimensional vector, sums every 20-th component of this vector and returns a 20-dimensional vector.
Note that this exercise is similar to (4b).
Once you have created your multi-dimensional array T, use a map operation with that array and each time series to generate a transformed dataset. We'll cache and count the output, as we'll be using it again

# Create a multi-dimensional array to perform the aggregation
T = np.tile(
    np.kron(
        np.глаз(двадцать),
        np.единицы((раз, раз))
    ), 12
)

# Transform scaledData using T.  Make sure to retain the keys.
timeData = scaledData.map(lambda (k,v): (k, T.умножить(v)))

timeData.cache()
print timeData.count()
print timeData.first()

46460
((0, 0), array([ 0.00802155,  0.00607693, -0.0075354 ,  0.00121539,  0.02163388,
        0.00121539, -0.03087082,  0.00510462,  0.01191079,  0.02455081,
       -0.0182308 ,  0.00802155, -0.00948002, -0.00948002,  0.02163388,
       -0.02212004,  0.00704924,  0.00121539, -0.01142464, -0.00850771]))

4e, сожмем агрегированный датасет через PCA

We now have a time-aggregated dataset with n=46460 pixels and d=20 aggregated time features, and we want to use PCA to find a more compact representation. Use the pca function from Part (2a) to perform PCA on the this data with k=3, resulting in a new low-dimensional 46,460 by 3 dataset. As before, you'll need to extract the values from timeData since it is an RDD of key-value pairs

componentsTime, timeScores, eigenvaluesTime = pca(
    timeData.map(lambda (k,v): v),
    k=3
)

print 'componentsTime: (first five) \n{0}'.format(componentsTime[:5,:])
print ('\ntimeScores (first three): \n{0}'
       .format('\n'.join(map(str, timeScores.take(3)))))
print '\neigenvaluesTime: (first five) \n{0}'.format(eigenvaluesTime[:5])

componentsTime: (first five) 
[[ 0.27392702 -0.16152431  0.01388556]
 [ 0.09941893 -0.31968127 -0.34738824]
 [-0.03376505 -0.32933108 -0.35606954]
 [-0.12092744 -0.2845482  -0.27232364]
 [-0.18219248 -0.22998061 -0.12248985]]

timeScores (first three): 
[-0.00720617 -0.00292979 -0.00223645]
[ 0.02353076 -0.00197457  0.00362094]
[ 0.01310623  0.00123069 -0.00582974]

eigenvaluesTime: (first five) 
[ 0.77528991  0.05038881  0.01173423  0.0059711   0.00138073]

Посмотрим, что вышло

Let's view the scores from the first two PCs as a composite image. When we preprocess by aggregating by time and then perform PCA, we are only looking at variability related to temporal dynamics. As a result, if neurons appear similar -- have similar colors -- in the resulting image, it means that their responses vary similarly over time, regardless of how they might be encoding direction. In the image below, we can define the midline as the horizontal line across the middle of the brain. We see clear patterns of neural activity in different parts of the brain, and crucially note that the regions on either side of the midline are similar, which suggests that temporal dynamics do not differ across the two sides of the brain



scoresTime = np.vstack(timeScores.collect())
imageOneTime = scoresTime[:,0].reshape(230, 202).T
imageTwoTime = scoresTime[:,1].reshape(230, 202).T
brainmap = polarTransform(3, [imageOneTime, imageTwoTime])
image = plt.imshow(brainmap,interpolation='nearest', aspect='auto')

4f, аггрегируем иначе, в 12 (вместо 20) фич

Next, let's perform a second type of feature aggregation so that we can study the direction-specific aspects of neural response, by aggregating our features by direction.
In other words, we want to see how different pixels (and the underlying neurons captured in these pixels) react when the zebrafish is presented with 12 direction-specific patterns, ignoring the temporal aspect of the reaction.
Hence, instead of working with the 240 features individually, we'll aggregate the original features into 12 new features, where the first new feature captures the average pixel response to the first direction-specific visual pattern, the second new feature is the response to the second direction-specific visual pattern, and so on.

As in Part (4c), we'll design a multi-dimensional array D that, when multiplied by a 240-dimensional vector, sums the first 20 components, then the second 20 components, and so on.
Note that this is similar to exercise (4c).
First create D, then use a map operation with that array and each time series to generate a transformed dataset.
We'll cache and count the output, as we'll be using it again.

# Create a multi-dimensional array to perform the aggregation
D = np.kron(
    np.глаз(двенадцать),
    np.единицы((раз, двадцать))
)

# Transform scaledData using D.  Make sure to retain the keys.
directionData = scaledData.map(lambda (k,v): (k, D.умножить(v)))
directionData.cache()

print directionData.count()
print directionData.first()

46460
((0, 0), array([ 0.03346365,  0.03638058, -0.02195799, -0.02487492,  0.00721129,
        0.00332206, -0.02098568,  0.00915591, -0.00542873, -0.01029027,
        0.0081836 , -0.01417951]))

4g, сожмем получившееся через PCA

We now have a direction-aggregated dataset with n=46460 pixels and d=12 aggregated direction features, and we want to use PCA to find a more compact representation. Use the pca function from Part (2a) to perform PCA on the this data with k=3, resulting in a new low-dimensional 46460 by 3 dataset. As before, you'll need to extract the values from directionData since it is an RDD of key-value pairs

componentsDirection, directionScores, eigenvaluesDirection = pca(
    directionData.map(lambda (k,v): v),
    k=3
)

print 'componentsDirection: (first five) \n{0}'.format(componentsDirection[:5,:])
print ('\ndirectionScores (first three): \n{0}'
       .format('\n'.join(map(str, directionScores.take(3)))))
print '\neigenvaluesDirection: (first five) \n{0}'.format(eigenvaluesDirection[:5])

componentsDirection: (first five) 
[[-0.25952179  0.16201941  0.24947433]
 [-0.31369506 -0.09185175  0.29464223]
 [-0.21716693 -0.35944645  0.35296454]
 [-0.11517273 -0.37356905  0.07169062]
 [ 0.02996577 -0.36272623 -0.14783897]]

directionScores (first three): 
[-0.01622513  0.01322998  0.01322204]
[ 0.00999482  0.0652367  -0.04524758]
[ 0.004646    0.05751097  0.00756383]

eigenvaluesDirection: (first five) 
[ 0.96411048  0.77613553  0.12762987  0.09775924  0.04333691]

Графическое представление полученного

Again, let's view the scores from the first two PCs as a composite image. When we preprocess by averaging across time (group by direction), and then perform PCA, we are only looking at variability related to stimulus direction. As a result, if neurons appear similar -- have similar colors -- in the image, it means that their responses vary similarly across directions, regardless of how they evolve over time. In the image below, we see a different pattern of similarity across regions of the brain. Moreover, regions on either side of the midline are colored differently, which suggests that we are looking at a property, direction selectivity, that has a different representation across the two sides of the brain


scoresDirection = np.vstack(directionScores.collect())
imageOneDirection = scoresDirection[:,0].reshape(230, 202).T
imageTwoDirection = scoresDirection[:,1].reshape(230, 202).T
brainmap = polarTransform(2, [imageOneDirection, imageTwoDirection])
image = plt.imshow(brainmap, interpolation='nearest', aspect='auto')

Вместо заключения:

In the analyses above we have successfully identified regions of the brain that encode particular properties, e.g., a particular temporal pattern or selectivity to a stimulus. However, this is only the first step! These exploratory analyses are typically followed with more targeted investigation, both through analysis and experiment. For example, we might find all neurons that prefer one stimulus direction, and then do an experiment in which we stimulate or inactivate only those neurons and look at the effect on the animal's behavior. Alternatively, we might subdivide neurons into groups based on simple forms of stimulus selectivity like the ones analyzed here, and then estimate coupling across different neuronal populations, i.e. can we predict one population's response as a function of another. This can be framed as a massive pair-wise regression problem, related to techniques you learned earlier in the course, and demanding large-scale implementations.


Вот.
На этом и закончился курс
BerkeleyX: CS190.1x Scalable Machine Learning


В общем, ничего серъезного. Так, дали понюхать инструментарий и методы его использования.





original post http://vasnake.blogspot.com/2015/12/week-5-lab-5.html

2015-12-28

Week 5, part 2

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 5, продолжение.
В прошлый раз начали разбирать теорию PCA.
Далее: пятая неделя, лекции, алгоритмы вычисления PCA в распределенной среде, бигдата.

WEEK 5: Principal Component Analysis and Neuroimaging.

PCA ALGORITHM

Освежим в памяти, что такое ортогональные и ортонормальные векторы:

Orthogonal vectors are simply perpendicular vectors,
and one nice property of orthogonal vectors
is that their dot product always equals 0.

Note that a unit vector is simply
a vector whose Euclidean norm equals one.

Orthonormal … These are vectors that are orthogonal and also
have unit norm.



Going back to our example, since a and b are
both unit norm and orthogonal, they
are also orthonormal vectors.
In contrast, b and d, even though they're orthogonal,
are not orthonormal since the Euclidean norm
of d is greater than one.

Now that we're equipped with a better understanding
of these concepts, we could discuss
the interpretation of PCA as an iterative algorithm.

Понятно, ортонормальные векторы, ага.
Теперь как формулируется итеративный алгоритм PCA:

imagine that we want a k dimensional
representation of our data.
... we can imagine taking the following iterative
approach.

At the i-th iteration we aim to find
the direction of maximal variance in the data,
subject to the constraint that this direction is of unit norm
and that it is orthogonal to all directions we
chose in previous iterations.

We can then project our data onto this direction,
and the locations along this direction
become the i-th feature in our new representation.

As a result of this algorithm, we
have found k unit vectors that are pairwise orthogonal,
and thus, these k directions are orthonormal.



DISTRIBUTED PCA

Так как же нам найти Принципиальные Компоненты и получить редуцированный датасет?
4 шага
отцентрировать данные, 
вычислить матрицу ковариантности, 
найти эйгенвекторы, 
умножить данные на эйгенвекторы:

we'll assume that our raw data is not
centered. And so the first step in PCA involves centering our data.
Or in other words, computing the mean of each feature
and subtracting these feature means from each original data
point.

We'll assume that our centered data is stored in an n
by d matrix which we'll call x.

Once we've centered our data, we next
need to compute the sample covariance
matrix, or the scatter matrix.

Note that the scatter matrix is simply the sample covariance
matrix without dividing by n.
PCA returns the same solution in either case,
as discussed in the previous segment.
And we'll work with the scatter matrix in this segment
to simplify our notation.

As we previously discussed, the principal components
of our data equal the eigenvectors of the sample
covariance matrix.

So in the third step, we compute these eigenvectors
by performing an eigendecomposition.

Finally, in order to obtain our k dimensional representation,
we need to multiply our original data by the top k eigenvectors
to compute the PCA scores.



Distributed PCA, Big n, small d
Как посчитать PCA на практике, с учетом того, что количество фич невелико (small d).

we must center our data.
And to do this, we must compute the mean of each feature.
There are d features, and thus d feature means.
And we define the vector m to be the d dimensional vector whose
i-th component is the mean of the i-th feature.

We can compute the mean vector via a simple
reduce operation whereby we sum all of the data points
together.

After we have performed this sum,
we can compute the mean vector, m,
by simply dividing by the number of data points, n.

After computing m on the driver, we
must send it back to the workers so that each data
point can be centered.

Together, the reduce operation to compute m
and the subsequent communication are inexpensive,
as they are linear in d in terms of local storage, computation,
and communication.

Once each worker has access to m,
we can then perform a map operation
to create the centered data points, which
simply involves subtracting m from each original data point.



we next need to compute the scatter matrix.

As in the case of closed-form linear regression,
we can efficiently perform this computation in a distributed
fashion by using outer products.

Матрица ковариантности = X transpose * X, а это это сумма произведений строк датасета.
Очень легко ложится на модель распределенных вычислений.

We'll represent x visually by its rows or data points,
and then we can express this matrix multiplication
as a sum of outer products where each outer product involves
only a single row of x, or a single data point.

Also recall that, in the previous step,
we computed our center data and we stored it
in a data parallel fashion.
So with this context, we can now compute the scatter matrix
as a simple MapReduce operation.

In the map step, we take each point
and compute its outer product with itself.
This is a computational bottleneck
in our PCA computation, but it's distributed
across multiple workers.


In the reduce step, we simply sum over
all of these outer products.


This requires quadratic storage and computation in d,
both of which are feasible since we're assuming that d is small.

Once we've computed the scatter matrix,
we need to perform its eigendecomposition.

Since we want to compute a k dimensional representation
for our data, we're interested in the top k
principal components of our data, which we know
are the top k eigenvectors of a scatter matrix.

We represent these principal components by the d by k matrix P.
As a result of eigendecomposition,
we have access to the top k eigenvectors on the driver.
But now we need to communicate them to the workers
so that they can compute these k dimensional representations
for the data points.

This requires O of dk communication,
which is the communication bottleneck in this algorithm.


Additionally, the eigendecomposition
generally requires cubic time and quadratic space.
But if we only want the top k eigenvalues and eigenvectors,
we can reduce this time complexity to O of d squared k.

Finally, now that each worker has the principal component
stored locally, we can compute the k dimensional
representation for each point via a simple matrix vector
multiply.


This process requires O of dk local computation,
and can be performed via a simple map operation.

Простой алгоритм, но можно упереться в количество фич, превышающее возможности Spark кластера.
Не беда, есть способ проделать PCA в Big N and Big D setting.

Distributed PCA: big d, big n


Ключевая идея заключается в хитром, итеративном способе вычисления ейгенвекторов.

where both N and D are large,
we can only afford storage, computation, and communication that are linear in N & D.

So we can't locally store or operate on the scatter matrix.
Instead, we'll introduce an iterative approach.

Our iterative approach relies on performing
a sequence of matrix vector products
to compute the top K eigenvectors of the scatter matrix.


The most common methods for doing this are
Krylov subspace and random projection based methods.

And Spark's MLlib in particular relies on Krylov subspace methods.

In this approach, the driver provides the workers
with a D dimensional vector on each iteration,
and requires that the workers left multiply this vector
by the scatter matrix.
Overall, the algorithm requires O of K iterations,
or passes over the data, and O of DK local storage.

And notably, this algorithm computes the PCA solution
without ever explicitly computing the covariance
of a scatter matrix.

The first step involves the algorithm
communicating the D dimensional vector, vi, to all workers.

Next (step 2, see below), we need to multiply the scatter matrix by the vector vi
in a distributed fashion.
And we'll denote the result of this matrix multiplication
as the D dimensional vector qi.

The driver then uses qi to update its estimate of p,
which are the top K eigenvectors of the scatter matrix.



We repeat this process, o of k for o of k iterations,
until we converge on our answer for p

Also, note that we're using the letter
i here to denote the iteration number.
And so the vector vi communicated
to the workers in step one changes on each iteration.

step two is interesting.
The challenge is that we want to perform this matrix
multiplication without ever having to explicitly compute
the scatter matrix, or even having to store copies
of both X and X transpose.

And so we need to be a bit clever in how
we perform this computation.
And it turns out that by carefully breaking
this multiplication into two steps,
we're able to work to achieve our desired goal.

We first compute bi, which is an n dimensional vector
equal to x times vi.
We then multiply X transpose by this intermediate result
to obtained qi.

we can efficiently compute steps one
and step two by only storing X in a data parallel fashion.

Remember that bi is an n dimensional vector,
and each component of this vector is simply equal to the
dot product between a row of X and the vector vi.

Since each row of X is an observation,
and since each worker is storing vi locally,
we can simply compute the dot product between each data point
and vi, and then concatenate the results to obtain bi.

We can first perform a map operation, in which we compute
the dot product of each data point and the vector vi.
This requires O of d storage to store vi, and O of nd
distributed computation to compute the dot products.

Next, in the reduce step, we can simply concatenate the results.
Finally, each worker will need to store bi locally
in order to compute the overall result, qi, in the next step.
So we're going to need to communicate bi to each worker.



So overall, this reduce step, combined with the communication of qi,
requires linear time, space, and communication in terms of n.

The following spark code snippet succinctly
summarizes what we've done.
Starting with an RDD of training data,
we first compute a dot product in the map step,
then collect the results, which creates a list.
And finally, we convert this list into a NumPy array.



Now let's consider the second step of this two step process.
In this step, our goal is to compute the product of X transpose and bi.

By inspecting this product, we can interpret this multiplication
as the sum of rescaled data points.

In particular, for the jth data point, we can multiply it by the jth component of the vector bi,
which gives us a new d dimensional vector.

We can similarly rescale each of the other data points,
which gives us a total of n rescaled d dimensional vectors.
Taking the sum of these rescaled vectors gives us qi,
which is our desired result.

In the map step, we simply rescale each vector
by its corresponding component of bi.

Storing bi requires O of n storage,
and computing the rescaled vectors
requires a pass over all of our data,
and thus takes O of nd distributed computation.

In the reduce step, we simply take a sum
of these rescaled vectors.
Each worker can aggregate all of the rescaled vectors
that it is storing locally.
So each worker only needs to send a single D dimensional
vector to the driver, which is the driver then must sum.

Hence, this process is linear in d, in terms of storage,
computation, and communication.

The following spark code snippet summarizes the step,
showing how we can rescale in the map step,
and then sum the vectors in the reduce step.




С теоретической частью всё. Осталась еще одна лабораторка, пятая (№5).

И конспект будет завершен.






original post http://vasnake.blogspot.com/2015/12/week-5-part-2.html

Архив блога

Ярлыки

linux (241) python (191) citation (185) web-develop (170) gov.ru (155) video (123) бытовуха (111) sysadm (100) GIS (97) Zope(Plone) (88) Book (81) programming (81) бурчалки (79) грабли (77) development (73) Fun (72) windsurfing (72) Microsoft (64) hiload (62) opensource (58) internet provider (57) security (57) опыт (55) movie (52) Wisdom (51) ML (47) language (45) hardware (44) JS (41) curse (40) money (40) driving (39) DBMS (38) bigdata (38) ArcGIS (34) history (31) PDA (30) howto (30) holyday (29) Google (27) Oracle (27) virtbox (27) health (26) vacation (24) AI (23) Autodesk (23) SQL (23) Java (22) humor (22) knowledge (22) translate (20) CSS (19) cheatsheet (19) hack (19) tourism (18) Apache (16) Manager (15) web-browser (15) Никонов (15) music (14) todo (14) PHP (13) happiness (13) weapon (13) HTTP. Apache (12) SSH (12) course (12) frameworks (12) functional programming (12) hero (12) im (12) settings (12) HTML (11) SciTE (11) crypto (11) game (11) map (11) scala (10) HTTPD (9) ODF (9) купи/продай (9) benchmark (8) documentation (8) 3D (7) CS (7) DNS (7) NoSQL (7) Photo (7) cloud (7) django (7) gun (7) matroska (7) telephony (7) Microsoft Office (6) VCS (6) bluetooth (6) pidgin (6) proxy (6) Donald Knuth (5) ETL (5) NVIDIA (5) REST (5) bash (5) flash (5) keyboard (5) price (5) samba (5) CGI (4) LISP (4) RoR (4) cache (4) display (4) holywar (4) nginx (4) pistol (4) xml (4) Лебедев (4) IDE (3) IE8 (3) J2EE (3) NTFS (3) RDP (3) USA (3) mount (3) spark (3) Гоблин (3) кухня (3) урюк (3) AMQP (2) ERP (2) IE7 (2) NAS (2) Naudoc (2) PDF (2) address (2) air (2) british (2) coffee (2) font (2) ftp (2) messaging (2) notify (2) sharepoint (2) ssl/tls (2) stardict (2) tests (2) tunnel (2) udev (2) APT (1) CRUD (1) Canyonlands (1) Cyprus (1) DVDShrink (1) Jabber (1) K9Copy (1) Matlab (1) Palanga (1) Portugal (1) VBA (1) WD My Book (1) autoit (1) bike (1) cannabis (1) chat (1) concurrent (1) dbf (1) ext4 (1) holiday (1) idioten (1) krusader (1) license (1) mindmap (1) pneumatic weapon (1) quiz (1) regexp (1) robot (1) science (1) serialization (1) tie (1) vim (1) Науру (1) крысы (1) налоги (1) пианино (1)

Google+ Followers