Skip to content

Commit

Permalink
Merge pull request #1 from miranska/ep-pqm
Browse files Browse the repository at this point in the history
Add EP-PQM-related code
  • Loading branch information
miranska authored Jan 20, 2022
2 parents 6d7961d + 4e87f19 commit 0826b59
Show file tree
Hide file tree
Showing 10 changed files with 3,089 additions and 43 deletions.
37 changes: 32 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# String Comparison on a Quantum Computer Using Hamming Distance
# String Comparison on a Quantum Computer

This repository contains the files needed to compare and classify the strings using Hamming distance between a string and a set of strings using a quantum computer.
For computing the distance between a target string and the closest string in the group of strings, see [preprint](https://arxiv.org/abs/2106.16173). The code for this preprint is packaged in [v0.1.0](https://github.com/miranska/qc-str/releases/tag/v0.1.0). The core code resides in `string_comparison.py`.

Furthermore, this repository extends the above codebase by creating an efficient version of the Parametric Probabilistic Quantum Memory ([P-PQM](https://doi.org/10.1016/j.neucom.2020.01.116)) approach for computing the probability of a string belonging to a particular group of strings (i.e., a machine learning classification problem). We call our algorithm EP-PQM, see [preprint](https://arxiv.org/abs/2201.07265) for details. The code for this preprint is packaged in [v0.2.0](https://github.com/miranska/qc-str/releases/tag/v0.2.0).

This repository contains the files needed to compute the Hamming distance between a string and a set of strings
using a quantum computer; see [preprint](https://arxiv.org/abs/2106.16173) for details.
The code resides in `string_comparison.py`.

## Setup
To set up the environment, run
Expand All @@ -11,6 +13,7 @@ pip install -r requirements.txt
```

## Usage examples
### Computing the distance between a target string and the closest string in a group
`test_string_comparison.py` contains unit test cases for `string_comparison.py`. This file can also be interpreted as a
set of examples of how `StringComparator` in `string_comparison.py` should be invoked. To run, do
```bash
Expand All @@ -25,8 +28,18 @@ To execute code listings shown in the [preprint](https://arxiv.org/abs/2106.1617
python hd_paper_listings.py
```

### EP-PQM

The file `compute_empirical_complexity.py` simulates the generation of quantum circuits for string classification as described in the [preprint](https://arxiv.org/abs/2201.07265). Datasets found in `./datasets` (namely, Balance Scale, Breast Cancer, SPECT Heart, Tic-Tac-Toe Endgame, and Zoo) are taken from the UCI Machine Learning [Repository](https://archive.ics.uci.edu/ml/index.php).
To execute, run
```bash
python compute_empirical_complexity.py
```
The output is saved in `stats.csv` and `stats.json`.


## Citation
If you use the algorithm or code, please cite them as follows:
If you use the algorithm or code, please cite them as follows. For computing Hamming distance:
```bibtex
@article{khan2021string,
author = {Mushahid Khan and Andriy Miranskyy},
Expand All @@ -40,6 +53,20 @@ If you use the algorithm or code, please cite them as follows:
}
```

For EP-PQM:
```bibtex
@article{khan2022string,
author = {Mushahid Khan and Jean Paul Latyr Faye and Udson C. Mendes and Andriy Miranskyy},
title = {{EP-PQM: Efficient Parametric Probabilistic Quantum Memory with Fewer Qubits and Gates}},
journal = {CoRR},
volume = {abs/2201.07265},
year = {2022},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2201.07265},
eprint = {2201.07265}
}
```

## Contact us
If you found a bug or came up with a new feature --
please open an [issue](https://github.com/miranska/qc-str/issues)
Expand Down
224 changes: 224 additions & 0 deletions compute_empirical_complexity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
import json

import numpy as np
import pandas as pd
from qiskit.test.mock import FakeQasmSimulator
from qiskit.transpiler.exceptions import TranspilerError

from string_comparison import StringComparator


class NumpyEnc(json.JSONEncoder):
"""
Convert numpy int64 to the format comprehensible by the JSON encoder
"""

def default(self, obj):
if isinstance(obj, np.int64):
return int(obj)
return json.JSONEncoder.default(self, obj)


def standardize_column_elements(column):
"""
Update column values to make sure that column element values are consistent
Note that the changes will happen in place
:param column: pandas series
:return: updated column, number of unique attributes
"""
dic = {}
numeric_id = 0
for ind in range(0, len(column)):
element_value = column.iloc[ind]
if element_value not in dic:
dic[element_value] = numeric_id
numeric_id += 1
column.iloc[ind] = dic[element_value]
return column, len(dic)


def get_data(file_name, label_location, encoding, columns_to_remove=None,
fraction_of_rows=0.9, random_seed=42):
"""
Take the dataset, reshuffle, retain 90% of it and return a list of datasets (one per label/class)
:param file_name: name of the file to read the data from
:param label_location: location of the label column (first or last), applied after undesired columns are removed
:param encoding: Type of encoding: either one-hot or label
:param columns_to_remove: List of unwanted columns to remove
:param fraction_of_rows: Fraction of rows to retain for analysis
:param random_seed: The value of random seed needed for reproducibility
:return: a list of datasets, max number of attributes, features count
"""

df = pd.read_csv(file_name, header=None)

# remove unwanted columns
if columns_to_remove is not None:
df = df.drop(df.columns[columns_to_remove], axis=1)
# update column names
df.columns = list(range(0, len(df.columns)))

# get indexes of data columns
col_count = len(df.columns)
if label_location == "first":
data_columns = range(1, col_count)
label_column = 0
elif label_location == "last":
data_columns = range(0, col_count - 1)
label_column = col_count - 1
else:
raise Exception(f"Unknown label_location {label_location}")

features_cnt = len(data_columns)
# standardize column elements and get max number of attributes in a column/feature
max_attr_cnt = -1
for data_column in data_columns:
updated_column, attr_cnt = standardize_column_elements(df[data_column].copy())
df[data_column] = updated_column
if attr_cnt > max_attr_cnt:
max_attr_cnt = attr_cnt

# get 90% of strings (drawn at random)
df = df.sample(n=round(len(df.index) * fraction_of_rows), random_state=random_seed)

# get labels
labels = df[label_column].unique()

# generate strings
strings = {}
for label in labels:
single_class = df[df[label_column] == label]
class_strings = []
for ind in range(0, len(single_class.index)):
observation = single_class.iloc[ind]
if encoding == "label":
my_string = []
for feature_ind in data_columns:
my_string.append(str(observation.iloc[feature_ind]))
class_strings.append(my_string)
elif encoding == "one-hot":
my_string = ""
if max_attr_cnt > 2:
for feature_ind in data_columns:
value = observation.iloc[feature_ind]
one_hot = [0] * max_attr_cnt
one_hot[value] = 1
my_string += ''.join(map(str, one_hot))
else: # use binary string for the 2-attribute case
for feature_ind in data_columns:
value = observation.iloc[feature_ind]
one_hot = [value]
my_string += ''.join(map(str, one_hot))

class_strings.append(my_string)
else:
raise Exception(f"Unknown encoding {encoding}")
strings[label] = class_strings

return strings, max_attr_cnt, features_cnt


if __name__ == "__main__":
stats = []

files = [
{"file_name": "./datasets/balance_scale.csv", "label_location": "first", 'labels': ['R'],
'is_laborious': False},
{"file_name": "./datasets/tictactoe.csv", "label_location": "last", 'labels': ['positive'],
'is_laborious': True},
{"file_name": "./datasets/breast_cancer.csv", "label_location": "last", "remove_columns": [0], 'labels': [2],
'is_laborious': True},
{"file_name": "./datasets/zoo.csv", "label_location": "last", "remove_columns": [0], 'labels': [1],
'is_laborious': True},
{"file_name": "./datasets/SPECTrain.csv", "label_location": "first", 'labels': [1], 'is_laborious': False}
]
encodings = ["one-hot", "label"]

is_fake_circuit_off = input("Creation of the circuit for fake simulator is laborious. "
"Do you want to skip it? (Y/n): ") or "Y"
if is_fake_circuit_off.upper() == 'Y':
print("Skip fake simulator")
backend_types = ["abstract"]
elif is_fake_circuit_off.upper() == 'N':
print("Keep fake simulator")
backend_types = ["abstract", "fake_simulator"]
else:
raise ValueError("Please enter y or n.")

for file in files:
if "remove_columns" in file:
remove_columns = file["remove_columns"]
else:
remove_columns = None
for encoding in encodings:

classes, max_attr_count, features_count = get_data(file["file_name"],
label_location=file["label_location"],
encoding=encoding,
columns_to_remove=remove_columns)
# parameters for String Comparisons
if encoding == "one-hot":
is_binary = True
symbol_length = max_attr_count
p_pqm = True
symbol_count = None
elif encoding == "label":
is_binary = False
symbol_length = None
p_pqm = False
symbol_count = max_attr_count
else:
raise Exception(f"Unknown encoding {encoding}")

for label in classes:
if 'labels' in file: # process only a subset of labels present in file['labels']
if label not in file['labels']:
continue
database = classes[label]
target = database[0] # dummy target string
for backend_type in backend_types:
print(f"Analyzing {file['file_name']} for label {label} on {backend_type}")
if backend_type == "abstract":
x = StringComparator(target, database, symbol_length=symbol_length, is_binary=is_binary,
symbol_count=symbol_count,
p_pqm=p_pqm)
elif backend_type == "fake_simulator":
if file['is_laborious']:
print(f" Skipping {file['file_name']} as it requires too much computing power")
continue
print(" Keep only two rows to speed up processing")
database = database[1:3] # keep only two rows to make it simpler to compute the circuit
try:
x = StringComparator(target, database, symbol_length=symbol_length, is_binary=is_binary,
symbol_count=symbol_count,
p_pqm=p_pqm, optimize_for=FakeQasmSimulator(),
optimization_levels=[0], attempts_per_optimization_level=1)
except TranspilerError as e:
print(print(f"Unexpected {e=}, {type(e)=}"))
break
else:
raise Exception(f"Unknown backend type {backend_type}")

circ_decomposed = x.circuit.decompose().decompose(['c3sx', 'rcccx', 'rcccx_dg']).decompose('cu1')
run_stats = {'file_name': file['file_name'], 'encoding': encoding, 'label': label,
'observations_count': len(database), 'features_count': features_count,
'max_attr_count': max_attr_count, 'backend_type': backend_type,
'qubits_count': x.circuit.num_qubits, 'circuit_depth': x.circuit.depth(),
'circuit_count_ops': x.circuit.count_ops(),
'qubits_count_decomposed': circ_decomposed.num_qubits,
'circuit_depth_decomposed': circ_decomposed.depth(),
'circuit_count_ops_decomposed': circ_decomposed.count_ops()
}
stats.append(run_stats)

print(f"Final stats in basic dictionary")
print(stats)

# save stats in JSON format
out_json = json.dumps(stats, cls=NumpyEnc)
with open('stats.json', 'w') as f:
json.dump(out_json, f)

# let's also save it in a table
pd.json_normalize(stats).to_csv('stats.csv')
80 changes: 80 additions & 0 deletions datasets/SPECTrain.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
1,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0
1,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,1
1,1,0,1,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1
1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1
1,1,0,1,1,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,1,1
1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0,1,1
1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,1,0,1,0,0,1,1,1,1,0,0,1,1,1,1,1,0,1
1,1,1,0,0,1,1,1,0,1,1,1,1,0,1,0,0,1,0,1,1,0,0
1,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1
1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,1
1,1,0,1,1,0,0,1,1,1,0,1,1,1,1,1,1,0,1,1,0,1,1
1,0,1,1,0,0,1,1,1,0,0,0,1,1,0,0,1,1,1,0,1,1,1
1,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,1,0,0,0,0,1,0
1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,1,0,1,0,1,1,0,1,0,1,1,0,0,0,1,0,0,1,1,0
1,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1
1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0
1,1,1,0,0,1,1,1,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0,0,0,1,1,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,1,1,1,0,0,1,1,1,0,0,0,0,1,0,1,0,0,1
1,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,1,0,0,1,1
1,1,0,1,1,1,0,0,1,1,1,0,0,1,1,1,0,0,0,0,0,1,0
1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,0,1,1,1,1,1,0,1
1,1,1,1,0,1,0,1,1,1,1,0,1,1,1,0,1,0,0,0,1,1,1
1,1,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1,1,0,1,1,0,0,0,1,1,1,0,0,1,1,1,1,0,0,1,1,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1
0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
0,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,0,0,1,0,1,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0
0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,1,1,0,0
0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1
0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Loading

0 comments on commit 0826b59

Please sign in to comment.