Python __hash__ performance for bulky data

Tech stack:

  • Python 3.8

When this function (to re-structure data in acceptable format) is executed for

  • few 500s of timestamps works efficiently (0.022 s)
  • bulk in order of 100,000+ it takes lot of time (~40 seconds)

where length of grouped values is 250+.

def re_struct_data(all_timestamps: List, grouped_values: Dict[String, Dict[Integer, Integer]]):
    tm_count = len(all_timestamps)

    start_tm = 1607494871
    get_tms = lambda: [None] * tm_count
    data_matrix = {'runTime': get_tms()}

    for i_idx, tm in enumerate(all_timestamps):

        data_matrix['runTime'][i_idx] = float(tm) - start_tm
        for cnl_nm in grouped_values:
            if cnl_nm not in data_matrix:
                data_matrix[cnl_nm] = get_tms()

            value_dict = grouped_values[cnl_nm]
            if tm in value_dict:
                data_matrix[cnl_nm][i_idx] = value_dict[tm]
    return data_matrix

When I did code profiling for the same, got to know quite a handsome amount of time goes into hashing for the presence/ absence of cnl_nm in data_matrix.

ANy suggestions to improve the same?

It is difficult to guess without examples of what the data instances look like. E.g. how expensive is it to hash your data?

You are querying data_matrix in order to determine if you need to create a new entry. You might try using collections.defaultdict to do that for you and skip that query.

At some point tho, 500,000 entries in your dictionary is going to take seconds to hash through.