Harmony Python library

🌐 harmonydata.ac.uk

Harmony Python library

You can also join our Discord server! If you found Harmony helpful, you can leave us a review!

What does Harmony do?

Psychologists and social scientists often have to match items in different questionnaires, such as "I often feel anxious" and "Feeling nervous, anxious or afraid".
This is called harmonisation.
Harmonisation is a time consuming and subjective process.
Going through long PDFs of questionnaires and putting the questions into Excel is no fun.
Enter Harmony, a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items, even in different languages.

Quick start with the code

Read our guide to contributing to Harmony here or read CONTRIBUTING.md.

You can run the walkthrough Python notebook in Google Colab with a single click:

You can also download an R markdown notebook to run in R Studio:

You can run the walkthrough R notebook in Google Colab with a single click: View the PDF documentation of the R package on CRAN

Looking for examples?

Check out our examples repository at https://github.com/harmonydata/harmony_examples

The Harmony Project

Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at https://harmonydata.ac.uk/app and you can read our blog at https://harmonydata.ac.uk/blog/.

Who to contact?

You can contact Harmony team at https://harmonydata.ac.uk/, or Thomas Wood at https://fastdatascience.com/.

🖥 Installation instructions (video)

🖱 Looking to try Harmony in the browser?

Visit: https://harmonydata.ac.uk/app/

You can also visit our blog at https://harmonydata.ac.uk/

✅ You need Tika if you want to extract instruments from PDFs

Download and install Java if you don't have it already. Download and install Apache Tika and run it on your computer https://tika.apache.org/download.html

java -jar tika-server-standard-2.3.0.jar

Requirements

You need a Windows, Linux or Mac system with

Python 3.8 or above
the requirements in requirements.txt
Java (if you want to extract items from PDFs)
Apache Tika (if you want to extract items from PDFs)

🖥 Installing Harmony Python package

You can install from PyPI.

pip install harmonydata

Loading all models

Harmony uses spaCy to help with text extraction from PDFs. spaCy models can be downloaded with the following command in Python:

import harmony
harmony.download_models()

Matching example instruments

instruments = harmony.example_instruments["CES_D English"], harmony.example_instruments["GAD-7 Portuguese"]
match_response = harmony.match_instruments(instruments)

questions = match_response.questions
similarity = match_response.similarity_with_polarity

How to load a PDF, Excel or Word into an instrument

harmony.load_instruments_from_local_file("gad-7.pdf")

Optional environment variables

As an alternative to downloading models, you can set environment variables so that Harmony calls spaCy on a remote server. This is only necessary if you are making a server deployment of Harmony.

HARMONY_DATA_PATH - determines where data files are stored. Defaults to HOME DIRECTORY/harmony
HARMONY_NO_PARSING - set to 1 to import a lightweight variant of Harmony which doesn't support PDF parsing.
HARMONY_NO_MATCHING - set to 1 to import a lightweight variant of Harmony which doesn't support matching.

Creating instruments from a list of strings

You can also create instruments quickly from a list of strings

from harmony import create_instrument_from_list, match_instruments
instrument1 = create_instrument_from_list(["I feel anxious", "I feel nervous"])
instrument2 = create_instrument_from_list(["I feel afraid", "I feel worried"])

match_response = match_instruments([instrument1, instrument2])

Loading instruments from PDFs

If you have a local file, you can load it into a list of Instrument instances:

from harmony import load_instruments_from_local_file
instruments = load_instruments_from_local_file("gad-7.pdf")

Matching instruments

Once you have some instruments, you can match them with each other with a call to match_instruments.

from harmony import match_instruments
match_response = match_instruments(instruments)

match_response.questions is a list of the questions passed to Harmony, in order.
match_response.similarity_with_polarity is the similarity matrix returned by Harmony.
match_response.query_similarity is the degree of similarity of each item to an optional query passed as argument to match_instruments.

⇗⇗ Using a different vectorisation function

Harmony defaults to sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (HuggingFace link). However you can use other sentence transformers from HuggingFace by setting the environment HARMONY_SENTENCE_TRANSFORMER_PATH before importing Harmony:

export HARMONY_SENTENCE_TRANSFORMER_PATH=sentence-transformers/distiluse-base-multilingual-cased-v2

Using OpenAI or other LLMs for vectorisation

Any word vector representation can be used by Harmony. The below example works for OpenAI's text-embedding-ada-002 model as of Apri 2025, provided you have create a paid OpenAI account. However, since LLMs are progressing rapidly, we have chosen not to integrate Harmony directly into the OpenAI client libraries, but instead allow you to pass Harmony any vectorisation function of your choice.

import numpy as np
from harmony import match_instruments_with_function, example_instruments
from openai import OpenAI

client = OpenAI()
model_name = "text-embedding-ada-002"
def convert_texts_to_vector(texts):
    vectors = client.embeddings.create(input = texts, model=model_name).data
    return np.asarray([vectors[i].embedding for i in range(len(vectors))])
instruments = example_instruments["CES_D English"], example_instruments["GAD-7 Portuguese"]
match_response = match_instruments_with_function(instruments, None, convert_texts_to_vector)

💻 Do you want to run Harmony in your browser locally?

Download and install Docker:

Open a Terminal and run

docker run -p 8000:80 harmonydata/harmonyapi

Then go to http://localhost:8000 in your browser to see the API.

You can now install and run the front end locally: https://www.youtube.com/watch?v=1xp3Uh6dptg

Looking for the Harmony API?

Visit: https://github.com/harmonydata/harmonyapi

📰 The code for training the PDF extraction is here: https://github.com/harmonydata/pdf-questionnaire-extraction

Docker images

If you are a Docker user, you can run Harmony from a pre-built Docker image.

https://hub.docker.com/repository/docker/harmonydata/harmonyapi - just the Harmony API
https://hub.docker.com/repository/docker/harmonydata/harmonylocal - Harmony API and React front end

Contributing to Harmony

If you'd like to contribute to this project, you can contact us at https://harmonydata.ac.uk/ or make a pull request on our Github repository. You can also raise an issue.

Developing Harmony

🧪 Automated tests

Test code is in tests/ folder using unittest.

The testing tool tox is used in the automation with GitHub Actions CI/CD. Since the PDF extraction also needs Java and Tika installed, you cannot run the unit tests without first installing Java and Tika. See above for instructions.

🧪 Use tox locally

Install tox and run it:

pip install tox
tox

In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

tox -e py39

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally.

⚙️Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

uses GitHub Actions for both testing and publishing
is tested when pushing master or main branch, and is published when create a release
includes test files in the source distribution
uses setup.cfg for version single-sourcing (setuptools 46.4.0+)

⚙️Re-releasing the package manually

The code to re-release Harmony on PyPI is as follows:

source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*

‎😃💁 Who worked on Harmony?

Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony has been funded by Wellcome as part of the Wellcome Data Prize in Mental Health and by Economic and Social Research Council (ESRC).

The core team at Harmony is made up of:

Dr Bettina Moltrecht, PhD (UCL)
Dr Eoin McElroy (University of Ulster)
Dr George Ploubidis (UCL)
Dr Mauricio Scopel Hoffmann (Universidade Federal de Santa Maria, Brazil)
Thomas Wood (Fast Data Science)

📜 License

Licenses of third party software

Third party dependency	License	Use
Python	BSD-style custom license	Programming language - all of Harmony runs based on Python and so this can't be replaced
Java	Different options available such as Oracle and IBM	Programming language used to run Tika, used for PDF parsing. If we replace Tika we may no longer need Java.
Sentence Transformers	Apache	Library for running transformer models
Transformers	Apache	Library for running transformer models
Pandas	BSD 3-Clause	Handling tables inside Harmony - mainly for reading/writing Excels
Tika	Apache	Parsing PDFs into plain text including OCR. Runs in Java
LXML	BSD	Reading the output of Tika's PDF parsing
Langdetect	Apache	Detecting language of text
XlsxWriter	BSD 2-Clause	Writing Excels
Openpyxl	MIT	Writing Excels
Numpy	custom license which appears to be BSD 3-Clause	Dependency of the transformers libraries
Scikit-Learn	BSD 3-Clause	Machine learning models for extracting the questions from PDFs
Scikit-Learn CRFSuite	MIT	Machine learning models for extracting the questions from PDFs
Scipy	custom license which appears to be BSD 3-Clause	Machine learning models for extracting the questions from PDFs
Huggingface Hub	Apache	Connects to HuggingFace Hub, online catalogue of transformer models

Third party software only used for the API

Third party dependency	License	Use
FastAPI	MIT	Runs the API
Pydantic	MIT	Ensures that data going in and out of the API is consistently formatted
Pydantic Settings	MIT	Ensures that data going in and out of the API is consistently formatted
Uvicorn	BSD 3-Clause	Runs the API
APScheduler	MIT	Periodically downloads Mental Health Catalogue data and similar - could potentially be removed

Third party software only used for using LLMs from cloud providers

Third party dependency	License	Use
VertexAI	Apache	Calls Google Vertex API LLMs
OpenAI	Apache	Calls OpenAI LLMs

📜 How do I cite Harmony?

You can cite our validation paper:

McElroy, Wood, Bond, Mulvenna, Shevlin, Ploubidis, Scopel Hoffmann, Moltrecht, Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data. BMC Psychiatry 24, 530 (2024), https://doi.org/10.1186/s12888-024-05954-2

A BibTeX entry for LaTeX users is

@article{mcelroy2024using,
  title={Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data},
  author={McElroy, Eoin and Wood, Thomas and Bond, Raymond and Mulvenna, Maurice and Shevlin, Mark and Ploubidis, George B and Hoffmann, Mauricio Scopel and Moltrecht, Bettina},
  journal={BMC psychiatry},
  volume={24},
  number={1},
  pages={530},
  year={2024},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 606 Commits
.github		.github
src/harmony		src/harmony
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Harmony_example_walkthrough.ipynb		Harmony_example_walkthrough.ipynb
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tox.ini		tox.ini
update.ipynb		update.ipynb

License

harmonydata/harmony

Folders and files

Latest commit

History

Repository files navigation

Harmony Python library

What does Harmony do?

Quick start with the code

Looking for examples?

The Harmony Project

Who to contact?

🖥 Installation instructions (video)

🖱 Looking to try Harmony in the browser?

✅ You need Tika if you want to extract instruments from PDFs

Requirements

🖥 Installing Harmony Python package

Loading all models

Matching example instruments

How to load a PDF, Excel or Word into an instrument

Optional environment variables

Creating instruments from a list of strings

Loading instruments from PDFs

Matching instruments

⇗⇗ Using a different vectorisation function

Using OpenAI or other LLMs for vectorisation

💻 Do you want to run Harmony in your browser locally?

Looking for the Harmony API?

Docker images

Contributing to Harmony

Developing Harmony

🧪 Automated tests

🧪 Use tox locally

⚙️Continuous integration/deployment to PyPI

⚙️Re-releasing the package manually

‎😃💁 Who worked on Harmony?

📜 License

Licenses of third party software

Third party software only used for the API

Third party software only used for using LLMs from cloud providers

📜 How do I cite Harmony?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 22

Uh oh!

Languages

Packages