I need to calculate the distance between all possible contact points present in the dataframe using python3. My code is working but it is very slow. How to reduce the time?
Condition: If chr string is same in the dataframe. Then distance between all gene will be calculated. Chr1 is present in row Index 1,3,5 so three combination will be made:
1)MK1 MI5 40 (62~17)
2)MK1 MR4 9 (62~51)
3)MI5 MR4 34 (17~51)
Similarly chr2 is present in index 1 and 2. So the only combination will be LC1 LI6 18 (16~34).
INPUT
chr st st1 gene
0 chr1 62 62 MK1
1 chr2 16 16 LC1
2 chr2 34 34 LI6
3 chr1 17 17 MI5
4 chr3 15 15 LI6
5 chr1 51 51 MR4
OUTPUT
gene1 gene2 dist
MK1 MI5 45
MK1 MR4 11
MI5 MK1 45
data1 = {‘chr’: [‘chr1’,‘chr2’,‘chr2’,‘chr1’,‘chr3’,‘chr1’],
‘st’: [62,16,34,17,15,51],
‘st1’: [62,16,34,17,15,51],
‘gene’:[‘MK1’,‘LC1’,‘LI6’,‘MI5’,‘LI6’,‘MR4’]}
data = pd.DataFrame(data1)
chr = pd.Series(data.chr.unique())
cols = [‘gen1’,‘gen2’,‘dist’]
all_dist_comb = pd.DataFrame()
for chr_num in chr:
for i in range(0,len(data)):
for j in range(0,len(data)):
if i != j and chr_num == data.iloc[i,0] and chr_num == data.iloc[j,0]:
dist = abs(data.iloc[i, 1] - data.iloc[j, 1])
all_dist_comb=all_dist_comb.append(pd.Series([data.iloc[i,3],data.iloc[j,3],dist],index={‘gene1’:‘str’,‘gene2’:‘str’,‘dist’:int}),ignore_index=True)
all_dist_comb=all_dist_comb.reindex(columns=[‘gene1’, ‘gene2’, ‘dist’])
all_dist_comb[‘dist’] = all_dist_comb[‘dist’].astype(int)
print(all_dist_comb)