HashSet and checking duplicates

Petras_Vestartas · March 18, 2021, 10:46am

Hi,

I would like to ask more like a StackOverflow question, but hope you had similar issues before.

I have a lot of lists of 4 numbers, [0,1,2,1], [5,1,3,1], [0,1,2,0]. [0,1,2,1],…
The lists contains duplicates.

I am gradually iterating the lists and checking if the same lists has already been iterated. In the upper case the first and last lists are the same.

Currently I am doing this by adding lists to HashSet :

HashSet<string> visited = new HashSet<string>();
string key = string.Joint(";", listOfNumbers);

if(visited.Contains(key)){
   dosomething...
   visited.Add(key);
}

I noticed that converting integers to stirng and checking if such string exists in hashset can be slow for large sets. If I use integers instead i.e. int keys = (number1) * 1000000 + (number2) * 10000 + (number3) * 100 + (number4); it becomes 1/3 faster, but this key in not necessary unique for large numbers.

I think, I am missing some basic knowledge for creating unique keys for a list of number and somehow checking the duplicates. Is there is a more efficient way than using strings to this problem?

gankeyu · March 18, 2021, 11:22am

HashSet has the ability to deal with collisions (values with same hash code). Do not convert data to hash code yourself.

I recommend you to implement your own 4-field struct with overridden GetHashCode/Equals/IEqutable<>.

If sub-optimal performance is acceptable, you may also use ValueTuple.

Petras_Vestartas · March 18, 2021, 11:35am

Is there any possibility to show how to implement the 4-field struct?

I already faced several times this issue in different projects, I would really like to learn the lesson how to use HashSet properly.

gankeyu · March 18, 2021, 12:06pm

Say your input a list of lists. (although such a input format is not quite good)

Easier way (using ValueTuple<>):

    public static void ProcessDataWithValueTuple(IEnumerable<List<int>> inputs)
        {
            var hashset = new HashSet<ValueTuple<int, int, int, int>>();

            foreach (var it in inputs)
            {
                var entry = (it[0], it[1], it[2], it[3]);
                if (hashset.Add(entry))
                {
                    // first visit

                    // do something
                }
                else
                {
                    // already exists in the hashset

                    // do something
                }
            }
        }

More complicated way (using custom struct, probably faster, I’m unsure), but the approach is much better to keep your data structured than numbers in a list:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace PancakeAlgo
{
    public struct DataEntry : IEquatable<DataEntry>
    {
        public int A;
        public int B;
        public int C;
        public int D;

        public DataEntry(int a, int b, int c, int d)
        {
            A = a;
            B = b;
            C = c;
            D = d;
        }

        public override bool Equals(object obj)
        {
            return (obj is DataEntry other) && Equals(other);
        }

        public bool Equals(DataEntry other)
        {
            return other.A == A && other.B == B && other.C == C && other.D == D;
        }

        public override int GetHashCode()
        {
            // this hashcode algorithm is recommended by some books
            // the numbers are arbitrarily constructed

            unchecked
            {
                var hashCode = -1117161837;
                hashCode = hashCode * -1521734295 + A;
                hashCode = hashCode * -1521734295 + B;
                hashCode = hashCode * -1521734295 + C;
                hashCode = hashCode * -1521734295 + D;
                return hashCode;
            }
        }
    }
    public static class HashSetTest
    {
        public static void ProcessDataWithCustomizedType(IEnumerable<List<int>> inputs)
        {
            var hashset = new HashSet<DataEntry>();

            foreach (var it in inputs)
            {
                var entry = new DataEntry(it[0], it[1], it[2], it[3]);
                if (hashset.Add(entry))
                {
                    // first visit

                    // do something
                }
                else
                {
                    // already exists in the hashset

                    // do something
                }
            }
        }
    }
}

Petras_Vestartas · March 18, 2021, 4:30pm

gankeyu:

public override int GetHashCode()
        {
            // this hashcode algorithm is recommended by some books
            // the numbers are arbitrarily constructed

            unchecked
            {
                var hashCode = -1117161837;
                hashCode = hashCode * -1521734295 + A;
                hashCode = hashCode * -1521734295 + B;
                hashCode = hashCode * -1521734295 + C;
                hashCode = hashCode * -1521734295 + D;
                return hashCode;
            }
        }

Thank you very much, Now I can save 200 ms from 1100ms. Really nice.

Could you explain what is the HashCode and why it is defined in such a way?

gankeyu · March 18, 2021, 4:35pm

Hashcode is what you did before. It’s kind of a key (not guaranteed to be unique, but should have fewer duplicates as possible).

The algorithm here (multiplication & addition) is recommend by various books, such as Effective Java, for fast-calculation and fewer collisions (the terminology of duplicates). It can be any algorithm, but the entire performance of HashSet depends on whether the HashCode algorithm is good enough.

Ideally, HashSet has an average time complexity of O(logN) if hashcodes are evenly distributed. Its performance would degenerate to O(N), a sequential search, if every object has a same hashcode.

If there’re no so many data, nor so many operations, I’d say 900ms still sounds slow. e.g. the daylight solver I’ve been writing on is able to process ~100,000,000,000 loops in ~3s.
If you may, you could post all of the code.

Petras_Vestartas · March 18, 2021, 5:45pm

This is part of larger project, I believe a hash-set search is relatively fast.
I will try to split code into blocks to identify what eats up the time. I am performing some geometry methods that I suspect contributes more than hash itself. Nevertheless I was suprised that switching from strings to integers had quite a large difference.

Terry_Chappell · March 18, 2021, 6:38pm

100 billion in 3 sec or 33 billion per sec. Without CUDA, my processor alone does 4 billion ops per sec or 72 billion with its 18 cores operating with perfect parallelism. But this leaves only 2 or so ops per loop which is not practical.

So you are using CUDA in Nvidia GPU to implement your code? How many ops are you doing in the loop?

Regards,
Terry.

gankeyu · March 19, 2021, 7:56am

I apologize I forgot to mention it’s 106,413,782,475 loops before optimization (as a naive implementation). After optimization it’s 1,060,320,217 loops and it’s CPU only.

Terry_Chappell · March 19, 2021, 12:01pm

Keyu,

Thanks for the details. This makes more sense now. I code a lot in C++ and have most of my critical code running this fast also. It often gets 10X+ speedup with 16-way parallelism which chews thru
40 billion+ ops per sec.

But doing 100 billion loops with say 40 ops per loop or 4 trillion ops per sec, which is 100X faster, is not something I can imagine doing on my 18 core CPU. But algorithmic optimizations to obtain 100X speedup compared to the first version of the code are quite possible and is where I invest a lot of my time.

Regards,
Terry.

gankeyu · March 19, 2021, 1:02pm

Majorly I use OpenMP, AVX512 and customized scheduler while my method is way larger than 40 ops. My initial algorithm required about 200 seconds for the same task. It’s always funny to do micro-benchmark and micro-optimization.

Bests,
Keyu Gan

Terry_Chappell · March 20, 2021, 1:12am

So the execution time went from 200 sec to 3 sec or a 67X improvement. This is competitive with what I find with some of my code. I typically start out with a serial implementation of a simple algorithm and then chart its bottlenecks. After some contemplation, I then developed a new algorithm that addresses the shortcomings of the initial code. Finally I move the code to a parallel implementation to achieve a 50X to 100X performance improvement. Always a surprise that the performance can be improved by this amount. I think you describe this outcome as funny because it is so fun to see the code running so fast that it exceeds our expectation. It certainly does it my case as I started my first programming back in1966 on an IBM 1620 computer running at 50 KHz with punch card input. Now my processors run at 4-5 GHz with 2.56 million times as much memory (50KB to 128GB)! So happy to be programming these days!

Regards,
Terry.

Topic		Replies	Views
Evaluate difference between list and set Grasshopper windows	1	165	January 15, 2024
Help on GetHashCode() method. Same value returns for different objects? Grasshopper ghpython , python , rhino6	18	2124	March 29, 2020
Fast random integer (not duplicatenumber)c# Scripting windows , c	3	821	January 10, 2022
Distinct list Grasshopper grasshopper2 , csharp	9	848	June 4, 2020
Matching equal point sets Rhino Developer	6	360	October 25, 2020

HashSet and checking duplicates

Related topics