I noticed that converting integers to stirng and checking if such string exists in hashset can be slow for large sets. If I use integers instead i.e. int keys = (number1) * 1000000 + (number2) * 10000 + (number3) * 100 + (number4); it becomes 1/3 faster, but this key in not necessary unique for large numbers.
I think, I am missing some basic knowledge for creating unique keys for a list of number and somehow checking the duplicates. Is there is a more efficient way than using strings to this problem?
Say your input a list of lists. (although such a input format is not quite good)
Easier way (using ValueTuple<>):
public static void ProcessDataWithValueTuple(IEnumerable<List<int>> inputs)
{
var hashset = new HashSet<ValueTuple<int, int, int, int>>();
foreach (var it in inputs)
{
var entry = (it[0], it[1], it[2], it[3]);
if (hashset.Add(entry))
{
// first visit
// do something
}
else
{
// already exists in the hashset
// do something
}
}
}
More complicated way (using custom struct, probably faster, I’m unsure), but the approach is much better to keep your data structured than numbers in a list:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace PancakeAlgo
{
public struct DataEntry : IEquatable<DataEntry>
{
public int A;
public int B;
public int C;
public int D;
public DataEntry(int a, int b, int c, int d)
{
A = a;
B = b;
C = c;
D = d;
}
public override bool Equals(object obj)
{
return (obj is DataEntry other) && Equals(other);
}
public bool Equals(DataEntry other)
{
return other.A == A && other.B == B && other.C == C && other.D == D;
}
public override int GetHashCode()
{
// this hashcode algorithm is recommended by some books
// the numbers are arbitrarily constructed
unchecked
{
var hashCode = -1117161837;
hashCode = hashCode * -1521734295 + A;
hashCode = hashCode * -1521734295 + B;
hashCode = hashCode * -1521734295 + C;
hashCode = hashCode * -1521734295 + D;
return hashCode;
}
}
}
public static class HashSetTest
{
public static void ProcessDataWithCustomizedType(IEnumerable<List<int>> inputs)
{
var hashset = new HashSet<DataEntry>();
foreach (var it in inputs)
{
var entry = new DataEntry(it[0], it[1], it[2], it[3]);
if (hashset.Add(entry))
{
// first visit
// do something
}
else
{
// already exists in the hashset
// do something
}
}
}
}
}
Hashcode is what you did before. It’s kind of a key (not guaranteed to be unique, but should have fewer duplicates as possible).
The algorithm here (multiplication & addition) is recommend by various books, such as Effective Java, for fast-calculation and fewer collisions (the terminology of duplicates). It can be any algorithm, but the entire performance of HashSet depends on whether the HashCode algorithm is good enough.
Ideally, HashSet has an average time complexity of O(logN) if hashcodes are evenly distributed. Its performance would degenerate to O(N), a sequential search, if every object has a same hashcode.
If there’re no so many data, nor so many operations, I’d say 900ms still sounds slow. e.g. the daylight solver I’ve been writing on is able to process ~100,000,000,000 loops in ~3s.
If you may, you could post all of the code.
This is part of larger project, I believe a hash-set search is relatively fast.
I will try to split code into blocks to identify what eats up the time. I am performing some geometry methods that I suspect contributes more than hash itself. Nevertheless I was suprised that switching from strings to integers had quite a large difference.
100 billion in 3 sec or 33 billion per sec. Without CUDA, my processor alone does 4 billion ops per sec or 72 billion with its 18 cores operating with perfect parallelism. But this leaves only 2 or so ops per loop which is not practical.
So you are using CUDA in Nvidia GPU to implement your code? How many ops are you doing in the loop?
I apologize I forgot to mention it’s 106,413,782,475 loops before optimization (as a naive implementation). After optimization it’s 1,060,320,217 loops and it’s CPU only.
Thanks for the details. This makes more sense now. I code a lot in C++ and have most of my critical code running this fast also. It often gets 10X+ speedup with 16-way parallelism which chews thru
40 billion+ ops per sec.
But doing 100 billion loops with say 40 ops per loop or 4 trillion ops per sec, which is 100X faster, is not something I can imagine doing on my 18 core CPU. But algorithmic optimizations to obtain 100X speedup compared to the first version of the code are quite possible and is where I invest a lot of my time.
Majorly I use OpenMP, AVX512 and customized scheduler while my method is way larger than 40 ops. My initial algorithm required about 200 seconds for the same task. It’s always funny to do micro-benchmark and micro-optimization.
So the execution time went from 200 sec to 3 sec or a 67X improvement. This is competitive with what I find with some of my code. I typically start out with a serial implementation of a simple algorithm and then chart its bottlenecks. After some contemplation, I then developed a new algorithm that addresses the shortcomings of the initial code. Finally I move the code to a parallel implementation to achieve a 50X to 100X performance improvement. Always a surprise that the performance can be improved by this amount. I think you describe this outcome as funny because it is so fun to see the code running so fast that it exceeds our expectation. It certainly does it my case as I started my first programming back in1966 on an IBM 1620 computer running at 50 KHz with punch card input. Now my processors run at 4-5 GHz with 2.56 million times as much memory (50KB to 128GB)! So happy to be programming these days!