Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] Update to ruff 0.3.0; update ruff.toml #2517

Merged
merged 1 commit into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/applications/clustering/agglomerative.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
"""

from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np
Expand Down
1 change: 1 addition & 0 deletions examples/applications/clustering/fast_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

In this example, we download a large set of questions from Quora and then find similar questions in this set.
"""

from sentence_transformers import SentenceTransformer, util
import os
import csv
Expand Down
1 change: 1 addition & 0 deletions examples/applications/clustering/kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
when encoding large text collections.
It also demonstrates how to stream data which is helpful in case you don't
want to wait for an extremely large dataset to download, or if you want to
limit the amount of memory used. More info about dataset streaming:
limit the amount of memory used. More info about dataset streaming:
https://huggingface.co/docs/datasets/stream
"""

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

Then, we re-rank the hits from the Bi-Encoder using a Cross-Encoder.
"""

from sentence_transformers import SentenceTransformer, util
from sentence_transformers import CrossEncoder
import os
Expand Down
1 change: 1 addition & 0 deletions examples/applications/cross-encoder/cross-encoder_usage.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
sentences in a corpus using a Cross-Encoder for semantic textual similarity (STS).
It output then the most similar sentences for the given query.
"""

from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
This script requires that you have FAISS installed:
https://github.com/facebookresearch/faiss
"""

from sentence_transformers import SentenceTransformer, models
import numpy as np
from bitext_mining_utils import score_candidates, kNN, file_open
Expand Down
1 change: 1 addition & 0 deletions examples/applications/parallel-sentence-mining/bucc2018.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
This script requires that you have FAISS installed:
https://github.com/facebookresearch/faiss
"""

from sentence_transformers import SentenceTransformer, models
from collections import defaultdict
import os
Expand Down
1 change: 1 addition & 0 deletions examples/applications/semantic-search/semantic_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""

from sentence_transformers import SentenceTransformer, util
import torch

Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
"""
This example demonstrates how we can perform semantic search for scientific publications.

As model, we use SPECTER (https://github.com/allenai/specter), which encodes paper titles and abstracts
As model, we use SPECTER (https://github.com/allenai/specter), which encodes paper titles and abstracts
into a vector space.

When can then use util.semantic_search() to find the most similar papers.

Colab example: https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06
"""

import json
import os
from sentence_transformers import SentenceTransformer, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""

from sentence_transformers import SentenceTransformer, util
import os
import csv
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""

from sentence_transformers import SentenceTransformer, util
import os
import csv
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
"""

from sentence_transformers import SentenceTransformer, util
import os
import csv
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

Google Colab example: https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing
"""

from sentence_transformers import SentenceTransformer, util
import os
import csv
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

Google Colab Example: https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing
"""

import json
from sentence_transformers import SentenceTransformer, util
import time
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

Note: Requires NLTK: `pip install nltk`
"""

import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
Expand Down
1 change: 1 addition & 0 deletions examples/evaluation/evaluation_inference_speed.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
OR
python evaluation_inference_speed.py model_name
"""

from sentence_transformers import SentenceTransformer, util
import sys
import os
Expand Down
1 change: 1 addition & 0 deletions examples/evaluation/evaluation_stsbenchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
OR
python evaluation_stsbenchmark.py model_name
"""

from sentence_transformers import SentenceTransformer, util, LoggingHandler, InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
import logging
Expand Down
1 change: 1 addition & 0 deletions examples/training/adaptive_layer/adaptive_layer_nli.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
OR
python adaptive_layer_nli.py pretrained_transformer_model_name
"""

import math
from datasets import load_dataset
from sentence_transformers import models, losses, datasets
Expand Down
1 change: 1 addition & 0 deletions examples/training/adaptive_layer/adaptive_layer_sts.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
OR
python adaptive_layer_sts.py pretrained_transformer_model_name
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
See https://public.ukp.informatik.tu-darmstadt.de/reimers/embeddings/
for available word embeddings files
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

Note, you can also pass BERT embeddings to the BiLSTM.
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

To make the model trainable, we add multiple dense layers to create a Deep Averaging Network (DAN).
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@


"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
You can get term-document frequencies from here:
https://public.ukp.informatik.tu-darmstadt.de/reimers/embeddings/wikipedia_doc_frequencies.txt
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses, util
Expand Down
1 change: 1 addition & 0 deletions examples/training/cross-encoder/training_nli.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
Usage:
python training_nli.py
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import LoggingHandler, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
python training_quora_duplicate_questions.py

"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import LoggingHandler, util
Expand Down
1 change: 1 addition & 0 deletions examples/training/cross-encoder/training_stsbenchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
Usage:
python training_stsbenchmark.py
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import LoggingHandler, util
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
Or to run it with Docker: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

Methodology:
Three steps are followed for AugSBERT data-augmentation with BM25 Sampling -
Three steps are followed for AugSBERT data-augmentation with BM25 Sampling -
1. Fine-tune cross-encoder (BERT) on gold STSb dataset
2. Fine-tuned Cross-encoder is used to label on BM25 sampled unlabeled pairs (silver STSb dataset)
2. Fine-tuned Cross-encoder is used to label on BM25 sampled unlabeled pairs (silver STSb dataset)
3. Bi-encoder (SBERT) is finally fine-tuned on both gold + silver STSb dataset

Citation: https://arxiv.org/abs/2010.08240
Expand All @@ -25,6 +25,7 @@
python train_sts_indomain_bm25.py bert-base-uncased 3

"""

from torch.utils.data import DataLoader
from sentence_transformers import models, losses, util
from sentence_transformers.cross_encoder import CrossEncoder
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
We utilise nlpaug (https://github.com/makcedward/nlpaug) for data augmentation strategies over a single sentence.

We chose synonym replacement for our example with (can be extended to other techniques) -
1. Word-embeddings (word2vec)
1. Word-embeddings (word2vec)
2. WordNet
3. Contextual word-embeddings (BERT)

Methodology:
Take a gold STSb pair, like (A, B, 0.6) Then replace synonyms in A and B, which gives you (A', B', 0.6)
Take a gold STSb pair, like (A, B, 0.6) Then replace synonyms in A and B, which gives you (A', B', 0.6)
These are the silver data and SBERT is finally trained on (gold + silver) STSb data.

Additional requirements:
Expand All @@ -28,6 +28,7 @@
Usage:
python train_sts_indomain_nlpaug.py
"""

from torch.utils.data import DataLoader
import torch
import math
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@


Methodology:
Three steps are followed for AugSBERT data-augmentation strategy with Semantic Search -
Three steps are followed for AugSBERT data-augmentation strategy with Semantic Search -
1. Fine-tune cross-encoder (BERT) on gold STSb dataset
2. Fine-tuned Cross-encoder is used to label on Sem. Search sampled unlabeled pairs (silver STSb dataset)
2. Fine-tuned Cross-encoder is used to label on Sem. Search sampled unlabeled pairs (silver STSb dataset)
3. Bi-encoder (SBERT) is finally fine-tuned on both gold + silver STSb dataset

Citation: https://arxiv.org/abs/2010.08240
Expand All @@ -18,6 +18,7 @@

python train_sts_indomain_semantic.py bert-base-uncased 3
"""

from torch.utils.data import DataLoader
from sentence_transformers import models, losses, util
from sentence_transformers import LoggingHandler, SentenceTransformer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
For our example below we consider STSb (source) and QQP (target) datasets respectively.

Methodology:
Three steps are followed for AugSBERT data-augmentation strategy with Domain Transfer / Cross-Domain -
Three steps are followed for AugSBERT data-augmentation strategy with Domain Transfer / Cross-Domain -
1. Cross-Encoder aka BERT is trained over STSb (source) dataset.
2. Cross-Encoder is used to label QQP training (target) dataset (Assume no labels/no annotations are provided).
3. Bi-encoder aka SBERT is trained over the labeled QQP (target) dataset.
Expand All @@ -16,6 +16,7 @@
OR
python train_sts_qqp_crossdomain.py pretrained_transformer_model_name
"""

from torch.utils.data import DataLoader
from sentence_transformers import models, losses, util, LoggingHandler, SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

For more details refer to -
Fine-Tuning Pretrained Language Models:
Weight Initializations, Data Orders, and Early Stopping by Dodge et al. 2020
Weight Initializations, Data Orders, and Early Stopping by Dodge et al. 2020
https://arxiv.org/pdf/2002.06305.pdf

Why Seed Optimization?
Dodge et al. (2020) show a high dependence on the random seed for transformer based models like BERT,
as it converges to different minima that generalize differently to unseen data. This is especially the
case for small training datasets.
Dodge et al. (2020) show a high dependence on the random seed for transformer based models like BERT,
as it converges to different minima that generalize differently to unseen data. This is especially the
case for small training datasets.

Citation: https://arxiv.org/abs/2010.08240

Expand All @@ -22,6 +22,7 @@

python train_sts_seed_optimization.py bert-base-uncased 10 0.3
"""

from torch.utils.data import DataLoader
import math
import torch
Expand Down
1 change: 1 addition & 0 deletions examples/training/distillation/dimensionality_reduction.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
the new SentenceTransformer model will produce directly embeddings with 128 dimensions
without further changes needed.
"""

from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer, LoggingHandler, util, evaluation, models, InputExample
import logging
Expand Down
1 change: 1 addition & 0 deletions examples/training/distillation/model_distillation.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
There is a performance - speed trade-off. However, we found that a student with 4 instead of 12 layers keeps about 99.4%
of the teacher performance, while being 2.3 times faster.
"""

from torch.utils.data import DataLoader
from sentence_transformers import models, losses, evaluation
from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
Expand Down
1 change: 1 addition & 0 deletions examples/training/distillation/model_quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
For more details:
https://pytorch.org/docs/stable/quantization.html
"""

import logging
import os
import torch
Expand Down
1 change: 1 addition & 0 deletions examples/training/matryoshka/2d_matryoshka_nli.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
OR
python 2d_matryoshka_nli.py pretrained_transformer_model_name
"""

import math
from datasets import load_dataset
from sentence_transformers import models, losses, datasets
Expand Down
1 change: 1 addition & 0 deletions examples/training/matryoshka/2d_matryoshka_sts.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
OR
python 2d_matryoshka_sts.py pretrained_transformer_model_name
"""

from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
Expand Down
1 change: 1 addition & 0 deletions examples/training/matryoshka/matryoshka_nli.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
OR
python matryoshka_nli.py pretrained_transformer_model_name
"""

import math
from datasets import load_dataset
from sentence_transformers import models, losses, datasets
Expand Down
1 change: 1 addition & 0 deletions examples/training/matryoshka/matryoshka_nli_reduced_dim.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
OR
python matryoshka_nli_reduced_dim.py pretrained_transformer_model_name
"""

import math
from datasets import load_dataset
from sentence_transformers import models, losses, datasets
Expand Down
Loading
Loading