Showing results for 
Search instead for 
Did you mean: 
Copilot Lvl 2
Message 1 of 4

dictionary comprehension when dealing with duplicates

I need to convert categorical data to numerical for a machine leearnig project. But, a lot of instances of the categorical data are duplicates and dictonary compreshension dictates that the last key will be the one that defines the final dictonary. 


This will cause issues when trying to analyse the data and as such I am wondering if anyone has dealt with this ssue and how I can resolve it. 



3 Replies
Copilot Lvl 2
Message 2 of 4

Re: dictionary comprehension when dealing with duplicates

Sorry this is for Python. 


Any help or suggestions much appreciated.

GitHub Staff
Message 3 of 4

Re: dictionary comprehension when dealing with duplicates

You will likely get better and more comprehensive responses in a forum that focuses on Python programming, such as the Python section of Stack Overflow.


Having said that, Python data structures are super flexible so there are likely many different ways you could approach your problem.  It would help if you gave some more specifics about the data you are starting with and what exactly you want to do with it.


For example, if you had a list of (category, item) entries and you wanted to count the amount of items in each category, then I would recommend using a default dictionary to accumulate a count:


import defaultdict

# Assuming entries like this
entries = [('Fruit', 'Apple'), ('Fruit', 'Banana'), ('Vegetable', 'Carrot')]

# You could count them with a default dictionary
counts = defaultdict(int)
for category, item in entries:
    counts[category] += 1

# count at this point would be {'Fruit': 2, 'Vegetable': 1}


What are you actually trying to do?

Copilot Lvl 2
Message 4 of 4

Re: dictionary comprehension when dealing with duplicates

Thanks for coming back to me. I think I was on the complete wrong track. 


I am trying to learn machine learning in Python for a sports prediction project on horse racing. There is categorical data that needs encoded for the ML process. But, I think one-hot encoding will do, by using pandas dummy variables:


``cat_dummy = pd.get_dummies(data[["race_track", "race_type", "jockey", "trainer", "sire", "dam"]]) # get dummy variables for categorical data


data_train = datadata.drop(["race_track", "race_type", "jockey", "trainer", "sire", "dam"],axis = 1) # drop previous categorical columns and rename


data_train = data_train,join(cat_dummy) #join categorical dummy with original data``