machine learning - Dictvectorizer for list as one feature in Python Pandas and Scikit-learn -


i have been trying solve days, , although have found similar problem here how can vectorize list using sklearn dictvectorizer, solution overly simplified.

i fit features logistic regression model predict 'chinese' or 'non-chinese'. have raw_name extract 2 features 1) last name, , 2) list of substring of last name, example, 'chan' give ['ch', 'ha', 'an']. seems dictvectorizer doesn't take list type part of dictionary. link above, try create function list_to_dict, , successfully, return dict elements,

{'substring=co': true, 'substring=or': true, 'substring=rn': true, 'substring=ns': true} 

but have no idea how incorporate in my_dict = ... before applying dictvectorizer.

# coding=utf-8 import pandas pd pandas import dataframe, series import numpy np import nltk import re import random random import randint import sys reload(sys) sys.setdefaultencoding('utf-8')  sklearn.linear_model import logisticregression sklearn.feature_extraction import dictvectorizer  lr = logisticregression() dv = dictvectorizer()  # csv file data frame data = pd.read_csv("v2-1_2000records_processed_sep2015.csv", header=0, encoding="utf-8") df = dataframe(data)  # pandas data frame shuffling df_shuffled = df.iloc[np.random.permutation(len(df))] df_shuffled.reset_index(drop=true)  # assign x , y variables x = df.raw_name.values y = df.chinesescan.values  # feature extraction functions def feature_full_last_name(namestring):     try:         last_name = namestring.rsplit(none, 1)[-1]         if len(last_name) > 1: # not accept name 1 character             return last_name         else: return none     except: return none  def feature_twoletters(namestring):     placeholder = []     try:         in range(0, len(namestring)):             x = namestring[i:i+2]             if len(x) == 2:                 placeholder.append(x)         return placeholder     except: return []  def list_to_dict(substring_list):     try:         substring_dict = {}         in substring_list:             substring_dict['substring='+str(i)] = true         return substring_dict     except: return none  list_example = ['co', 'or', 'rn', 'ns'] print list_to_dict(list_example)  # transform format of x variables, , spit out numpy array features my_dict = [{'two-letter-substrings': feature_twoletters(feature_full_last_name(i)),      'last-name': feature_full_last_name(i), 'dummy': 1} in x]  print my_dict[3] 

output:

{'substring=co': true, 'substring=or': true, 'substring=rn': true, 'substring=ns': true} {'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'} 

sample data:

raw_name    chinesescan jack anderson    non-chinese po lee    chinese 

if have understood correctly want way encode list values in order have feature dictionary dictvectorizer use. (one year late but) can used depending on case:

my_dict_list = []  in x:     # create new feature dictionary     feat_dict = {}     # add features straight forward     feat_dict['last-name'] = feature_full_last_name(i)     feat_dict['dummy'] = 1      # features have list of values iterate on values ,     # create custom feature each value     two_letters in feature_twoletters(feature_full_last_name(i)):         # make sure naming unique enough no other feature         # unrelated have same name/ key         feat_dict['two-letter-substrings-' + two_letters] = true      # save feature dictionary list used in dict vectorizer     my_dict_list.append(feat_dict)  print my_dict_list  sklearn.feature_extraction import dictvectorizer dict_vect = dictvectorizer(sparse=false) transformed_x = dict_vect.fit_transform(my_dict_list) print transformed_x 

output:

[{'dummy': 1, u'two-letter-substrings-er': true, 'last-name': u'anderson', u'two-letter-substrings-on': true, u'two-letter-substrings-de': true, u'two-letter-substrings-an': true, u'two-letter-substrings-rs': true, u'two-letter-substrings-nd': true, u'two-letter-substrings-so': true}, {'dummy': 1, u'two-letter-substrings-ee': true, u'two-letter-substrings-le': true, 'last-name': u'lee'}] [[ 1.  1.  0.  1.  0.  1.  0.  1.  1.  1.  1.  1.]  [ 1.  0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.]] 

another thing (but don't recommend) if don't want create many features values in lists this:

# sorting values idea feat_dict[frozenset(feature_twoletters(feature_full_last_name(i)))] = true # or  feat_dict[" ".join(feature_twoletters(feature_full_last_name(i)))] = true 

but first 1 means can't have duplicate values , both don't make features, if need fine-tuned , detailed ones. also, reduce possibility of 2 rows having same combination of 2 letter combinations, classification won't well.

output:

[{'dummy': 1, 'last-name': u'anderson', frozenset([u'on', u'rs', u'de', u'nd', u'an', u'so', u'er']): true}, {'dummy': 1, 'last-name': u'lee', frozenset([u'ee', u'le']): true}] [{'dummy': 1, 'last-name': u'anderson', u'an nd de er rs on': true}, {'dummy': 1, u'le ee': true, 'last-name': u'lee'}] [[ 1.  0.  1.  1.  0.]  [ 0.  1.  1.  0.  1.]] 

Comments