dictionary comprehension when dealing with duplicates

I need to convert categorical data to numerical for a machine leearnig project. But, a lot of instances of the categorical data are duplicates and dictonary compreshension dictates that the last key will be the one that defines the final dictonary. 

This will cause issues when trying to analyse the data and as such I am wondering if anyone has dealt with this ssue and how I can resolve it. 


Sorry this is for Python. 

Any help or suggestions much appreciated.

You will likely get better and more comprehensive responses in a forum that focuses on Python programming, such as the Python section of Stack Overflow.

Having said that, Python data structures are super flexible so there are likely many different ways you could approach your problem.  It would help if you gave some more specifics about the data you are starting with and what exactly you want to do with it.

For example, if you had a list of (category, item) entries and you wanted to count the amount of items in each category, then I would recommend using a default dictionary to accumulate a count:

import defaultdict

# Assuming entries like this
entries = [('Fruit', 'Apple'), ('Fruit', 'Banana'), ('Vegetable', 'Carrot')]

# You could count them with a default dictionary
counts = defaultdict(int)
for category, item in entries:
    counts[category] += 1

# count at this point would be {'Fruit': 2, 'Vegetable': 1}

What are you actually trying to do?

1 Like

Thanks for coming back to me. I think I was on the complete wrong track. 

I am trying to learn machine learning in Python for a sports prediction project on horse racing. There is categorical data that needs encoded for the ML process. But, I think one-hot encoding will do, by using pandas dummy variables:

``cat_dummy = pd.get_dummies(data[[“race_track”, “race_type”, “jockey”, “trainer”, “sire”, “dam”]]) # get dummy variables for categorical data

data_train = datadata.drop([“race_track”, “race_type”, “jockey”, “trainer”, “sire”, “dam”],axis = 1) # drop previous categorical columns and rename

data_train = data_train,join(cat_dummy) #join categorical dummy with original data``