i have been trying solve days, , although have found similar problem here how can vectorize list using sklearn dictvectorizer, solution overly simplified.
i fit features logistic regression model predict 'chinese' or 'non-chinese'. have raw_name extract 2 features 1) last name, , 2) list of substring of last name, example, 'chan' give ['ch', 'ha', 'an']. seems dictvectorizer doesn't take list type part of dictionary. link above, try create function list_to_dict, , successfully, return dict elements,
{'substring=co': true, 'substring=or': true, 'substring=rn': true, 'substring=ns': true}
but have no idea how incorporate in my_dict = ... before applying dictvectorizer.
# coding=utf-8 import pandas pd pandas import dataframe, series import numpy np import nltk import re import random random import randint import sys reload(sys) sys.setdefaultencoding('utf-8') sklearn.linear_model import logisticregression sklearn.feature_extraction import dictvectorizer lr = logisticregression() dv = dictvectorizer() # csv file data frame data = pd.read_csv("v2-1_2000records_processed_sep2015.csv", header=0, encoding="utf-8") df = dataframe(data) # pandas data frame shuffling df_shuffled = df.iloc[np.random.permutation(len(df))] df_shuffled.reset_index(drop=true) # assign x , y variables x = df.raw_name.values y = df.chinesescan.values # feature extraction functions def feature_full_last_name(namestring): try: last_name = namestring.rsplit(none, 1)[-1] if len(last_name) > 1: # not accept name 1 character return last_name else: return none except: return none def feature_twoletters(namestring): placeholder = [] try: in range(0, len(namestring)): x = namestring[i:i+2] if len(x) == 2: placeholder.append(x) return placeholder except: return [] def list_to_dict(substring_list): try: substring_dict = {} in substring_list: substring_dict['substring='+str(i)] = true return substring_dict except: return none list_example = ['co', 'or', 'rn', 'ns'] print list_to_dict(list_example) # transform format of x variables, , spit out numpy array features my_dict = [{'two-letter-substrings': feature_twoletters(feature_full_last_name(i)), 'last-name': feature_full_last_name(i), 'dummy': 1} in x] print my_dict[3]
output:
{'substring=co': true, 'substring=or': true, 'substring=rn': true, 'substring=ns': true} {'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
sample data:
raw_name chinesescan jack anderson non-chinese po lee chinese
if have understood correctly want way encode list values in order have feature dictionary dictvectorizer use. (one year late but) can used depending on case:
my_dict_list = [] in x: # create new feature dictionary feat_dict = {} # add features straight forward feat_dict['last-name'] = feature_full_last_name(i) feat_dict['dummy'] = 1 # features have list of values iterate on values , # create custom feature each value two_letters in feature_twoletters(feature_full_last_name(i)): # make sure naming unique enough no other feature # unrelated have same name/ key feat_dict['two-letter-substrings-' + two_letters] = true # save feature dictionary list used in dict vectorizer my_dict_list.append(feat_dict) print my_dict_list sklearn.feature_extraction import dictvectorizer dict_vect = dictvectorizer(sparse=false) transformed_x = dict_vect.fit_transform(my_dict_list) print transformed_x
output:
[{'dummy': 1, u'two-letter-substrings-er': true, 'last-name': u'anderson', u'two-letter-substrings-on': true, u'two-letter-substrings-de': true, u'two-letter-substrings-an': true, u'two-letter-substrings-rs': true, u'two-letter-substrings-nd': true, u'two-letter-substrings-so': true}, {'dummy': 1, u'two-letter-substrings-ee': true, u'two-letter-substrings-le': true, 'last-name': u'lee'}] [[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.] [ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
another thing (but don't recommend) if don't want create many features values in lists this:
# sorting values idea feat_dict[frozenset(feature_twoletters(feature_full_last_name(i)))] = true # or feat_dict[" ".join(feature_twoletters(feature_full_last_name(i)))] = true
but first 1 means can't have duplicate values , both don't make features, if need fine-tuned , detailed ones. also, reduce possibility of 2 rows having same combination of 2 letter combinations, classification won't well.
output:
[{'dummy': 1, 'last-name': u'anderson', frozenset([u'on', u'rs', u'de', u'nd', u'an', u'so', u'er']): true}, {'dummy': 1, 'last-name': u'lee', frozenset([u'ee', u'le']): true}] [{'dummy': 1, 'last-name': u'anderson', u'an nd de er rs on': true}, {'dummy': 1, u'le ee': true, 'last-name': u'lee'}] [[ 1. 0. 1. 1. 0.] [ 0. 1. 1. 0. 1.]]
Comments
Post a Comment