VSnake notes: Week 4, Categorical data and OHE

Курс Scalable Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это всё.

Продолжаю конспектировать пройденный курс. Неделя 4.

В прошлый раз было про применение логистической регрессии для получения вероятностных оценок.

Теперь займемся нечисловыми фичами (наблюдениями) и кодированием таких фич.

CATEGORICAL DATA AND ONE-HOT-ENCODING

In this segment, we'll talk about how

we can deal with categorical data when training a model.

In particular, we'll introduce one-hot encoding,

which is a standard technique for converting

categorical features into numerical ones.

В чем проблема: делая модели для предсказания, тренируя эти модели, мы имеем дело с числами. Перемножение матриц, да.

Но часто данные предметной области выражены не числами а, скажем, словами.

Выходит, перед тем как строить модели, надо исходный набор данных трансформировать в набор чисел. Вот об этом и поговорим.

Как вариант, можно пользоваться методами, работающими с нечисловыми данными (Desicion Trees, Naive Bayes). Но это ограничивает наши возможности.

Посмотрим.

Есть два основных нечисловых вида фичей:

– категорные (пол, страна проживания, язык, …)

– ординарные (плохо, хорошо, отлично; низко, средне, высоко; …), отличаются наличием внутреннего порядка
Про ординарные не будем, это отдельная тема. Поговорим про категорные.

First, we have categorical features.

These are features that contain two or more categories,

and these categories have no intrinsic ordering.

The second common class of non-numeric features

are ordinal features.

Ordinal features, again, have two or more categories.

But here, these categories do contain

some ranking information but only

relative ordering information.

Specifically, we could represent a non-numeric feature

with a single numeric feature by representing each category

with a distinct number.

According to this strategy, we would

assign each of the four categories

to integer values ranging from 1 to 4.

At first glance, this seems like a reasonable strategy,

and, in fact, using a single numerical feature

does preserve the ordering for ordinal features.

However, doing this introduces a notion

of closeness between the ordinal categories that

didn't previously exist.

Назначая числовые значения для категорных/ординарных фич, мы дезинформируем модель, добавляя несуществовавший ранее параметр «близости» между значениями.

Так не годится.

Входит One-hot-encoding.

Идея такая:
для каждой категории добавляется переменная, которая принимает значение 0 или 1 в зависимости от значения исходной фичи. Понять это проще на примере.

Фича «Страны», может принимать значения «Аргентина», «Франция», «США».

Имеем три разных значения для фичи.
Значит, создаем три переменные (фактически вектор) и задаем им значения в зависимости от значений исходной фичи.

Допустим, в первой записи наших данных указана страна Аргентина. Тогда наш вектор будет выглядеть так: [1, 0, 0].

Для Франции это будет [0, 1, 0].

Понятно, почему One-hot-encoding?

COMPUTING AND STORING OHE FEATURES

In this segment, we'll go through an example of how

to compute one-hot encoded, or OHE features,

and we'll then discuss how to efficiently store

these OHE features.

Рассмотрим пример, датасет из трех фич (животное, цвет, диета) и трех записей (мышка, черная, -), (сошка, полосатая, мышка), (медведь, черный, рыбка).

Creating OHE features is a two-step process

where the first step involves creating an OHE dictionary.

Первым делом надо посчитать количество кодируемых категорий. Три вида животных + два цвета + две диеты = семь категорий.

Это означает, что итоговая запись будет содержать 7 фич вместо 3 в исходном наборе.

Составим словарик — соответствие, пронумеруем значения: (животное, медведь) = 0; (животное, кошка) = 1; … (цвет, черный) = 3; ...

Итоговая запись кодируется просто: берем вектор из 7 ноликов и меняем 0 на 1 там, где присутствует одна из семи категорий.

Как можно заметить, данные получаются разреженными, сплошные нолики.

Note that for a given categorical feature, at most

a single OHE dummy feature is non-zero.

In other words, most of our OHE features are equal to 0,

or our OHE features are sparse.

Это дает возможность оптимизировать хранение/обработку (Sparse Vectors).

an alternative way to store our features

is to store the indices and values for all non-zero entries

and assume that all other entries equal 0.

So back to our A1 example, this would

mean that we would store two tuples or a total of four

values.

In each of these tuples, the first component

is the feature index, and the second component

is the value of this feature.

При использовании Sparse Vectors экономия на хранении может быть весьма значительной, до двух порядков. Считать (умножать матрицы) тоже можно быстрее

Всё просто замечательно. Но только до некоторого предела.

В какой-то момент производных фич становится слишком много. Ибо в исходных данных слишком много категорий.
Это, кстати, характерно для задачи Click-through Rating.

However, a consequence of OHE features

is that they drastically increase the dimensionality

of our data, as the number of features in an OHE

representation equals the total number of categories

in our data.

This is a major issue in the context of clickthrough rate

prediction, as our initial user, ad,

and publisher features include many names and a lot of text.

OHE features can be problematic whenever

we're working with text data.

For instance, as we discussed earlier in the course,

we often use a bag of words representation

to represent documents with a vocabulary of words.

There are over one million words in the English language alone,

and thus the size of our vocabulary

can be quite large, resulting in high dimensional

representation.

Additionally, we sometimes would like

to create a richer representation of our text data

by considering all pairs of adjacent words,

using a so-called bigram representation.

This bigram representation is similar in spirit

to the idea of quadratic features discussed earlier

in the course.

Bigram representations of text documents

result in even larger feature dimensionality.

Используя OHE, для работы с текстом понадобятся векторы размером в миллионы! И это не считая биграмм, не говоря уж о триграммах.

Высокая размерность фич приводит к статистическим проблемам: нам надо больше записей для обучения, зависимости в данных теряют выраженность, размываются по размерностям.

Вычислительные проблемы тоже надо учитывать: нужно больше хранилища, больше процессорного времени, толще каналы связи.

Входит Feature Hashing

Feature hashing provides an alternative method

for reducing the dimension of OHE features.

By using hashing principles, we can

create a lower dimensional feature representation

and completely skip the process of creating an OHE dictionary.

Feature hashing preserves the sparsity of OHE features,

and is motivated by theoretical arguments.

It's also worth noting that the feature hashing can

be interpreted as an unsupervised learning

method, which is performed as a pre-processing step

for a downstream supervised learning task.

Все, что нам надо знать про хеш в данном контексте, это то, что мы имеем фиксированное количество корзин (buckets) и хеш-функция кладет объект в одну из этих корзин.

In the context of feature hashing,

each feature category is an object,

and we have fewer buckets than feature categories.

So in contrast to the OHE dictionary,

which is a one to one mapping between feature categories

and dummy features, feature hashing uses a many to one

mapping between feature categories and buckets.

Пример, про животных.

Suppose the number of buckets equals m equals 4,

and that we've already selected a hash function to map future

categories to buckets.

Принцип похож на OHE, только вместо словаря, отражающего категории 1:1 в индексы результирующего вектора, у нас хеш-функция, возвращающаяя этот индекс для всех категорий. Могут быть коллизии, не без этого.

Семь категорий кодируются в четыре бита.

It turns out that in spite of the slightly strange behavior,

feature hashing has nice theoretical properties.

In particular, many learning methods

can be interpreted as relying on training data

only in terms of pairwise inner products between the data

points.

And it turns out that under certain conditions,

hashed features lead to good approximations

of these inner products, theoretically.

Moreover, hashed features have been used in practice

and have resulted in good practical performance

on various text classification tasks.

And so in summary, hash features are a reasonable alternative

to OHE features.

Что хорошо, хеширование (препроцессинг данных) хорошо ложится на распределенную среду. Локальные вычисления не требуют коммуникаций с другими нодами, результат можно сохранить как SparseVector-ы.

Вот и вся теория четвертой недели. Далее будет лабораторка.

original post http://vasnake.blogspot.com/2015/11/week-4-categorical-data-and-ohe.html

Tools

VSnake notes

2015-11-02

Week 4, Categorical data and OHE

Комментариев нет:

Отправить комментарий

Архив блога

Ярлыки

Обо мне

Links