Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze large fingerprint collection #446

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
a855523
Add logging config
stepan-anokhin Nov 5, 2021
fea57a7
Support legacy storages without hashes
stepan-anokhin Nov 5, 2021
ab2dbcd
Improve misc utils
stepan-anokhin Nov 17, 2021
ded1ffa
Pass kwargs to underlying tqdm in ProgressMonitor
stepan-anokhin Nov 17, 2021
3b37c8a
Draft reusable script for match-graph analysis
stepan-anokhin Nov 17, 2021
fe0d9f5
Add task to draw top communities only
stepan-anokhin Nov 18, 2021
30dc119
Add default logging config
stepan-anokhin Nov 18, 2021
218c726
Draw labelled CCWEB collection
stepan-anokhin Nov 19, 2021
cc92a43
Implement tasks for cross-collection embeddings
stepan-anokhin Dec 8, 2021
98c1a52
Implement partial cross-collection embeddings
stepan-anokhin Dec 14, 2021
6611fc6
Draw cross-collection embeddings images
stepan-anokhin Dec 14, 2021
b04c377
Extract condensed fingerprints target
stepan-anokhin Jan 14, 2022
d17e911
Improve progress monitor
stepan-anokhin Jan 20, 2022
0bf7f44
Refactor luigi tasks
stepan-anokhin Jan 20, 2022
41fc081
Remove unused code
stepan-anokhin Jan 26, 2022
84c1440
Fix bug in progress-monitoring
stepan-anokhin Jan 28, 2022
0a0ad6e
Refactor match-graph builder
stepan-anokhin Jan 28, 2022
1a126c1
Add community comparison task
stepan-anokhin Jan 28, 2022
b64eb0e
Implement multi-task progress tracking
stepan-anokhin Mar 4, 2022
b4f20cb
Migrate feature extraction to luigi
stepan-anokhin Mar 9, 2022
28e87d2
Refactor frame feature extractor
stepan-anokhin Mar 11, 2022
858499e
Define file collection
stepan-anokhin Mar 11, 2022
39a015f
Use file collection in feature extraction
stepan-anokhin Mar 11, 2022
6955814
Refactor file hashing
stepan-anokhin Mar 15, 2022
c13ffb2
Finish luigi-celery integration
stepan-anokhin Mar 16, 2022
3481ee1
Refactor feature-extraction tasks
stepan-anokhin Mar 16, 2022
e5fd9af
Remove legacy pipeline logic
stepan-anokhin Mar 16, 2022
a422fed
Support incremental targets
stepan-anokhin Mar 16, 2022
fce7fd8
Migrate example celery-task to Luigi
stepan-anokhin Mar 16, 2022
7b66ac9
Temporary bring back deleted pipeline tasks
stepan-anokhin Mar 17, 2022
cf3745b
Migrate match detection
stepan-anokhin Mar 17, 2022
55c04e0
Prepare basic processing for preview
stepan-anokhin Mar 17, 2022
5563df9
Merge branch 'development' into 444-analyze-large-fingerprint-collection
stepan-anokhin Mar 17, 2022
a761d60
Fix a regression after merge
stepan-anokhin Mar 18, 2022
685908a
Refactor file target
stepan-anokhin Mar 18, 2022
fb66732
Hook up several luigi tasks
stepan-anokhin Mar 18, 2022
f708dcf
Add temporary CLI to run luigi tasks
stepan-anokhin Mar 18, 2022
7d94fd0
Fix import regression
stepan-anokhin Mar 18, 2022
b77d409
Refactor luigi config passing
stepan-anokhin Mar 20, 2022
e7b2650
Add progress-bar conveninece method
stepan-anokhin Mar 20, 2022
f86cf93
Refactor match-detection tasks
stepan-anokhin Mar 20, 2022
15e88f1
Fix condense-fingerprints for custom prefix
stepan-anokhin Mar 20, 2022
6126e15
Refactor template matching
stepan-anokhin Mar 21, 2022
f066283
Hook up luigi template matching in UI
stepan-anokhin Mar 22, 2022
8d701c2
Improve documentation
stepan-anokhin Mar 22, 2022
5e29fe6
Hook up luigi remote matching in UI
stepan-anokhin Mar 23, 2022
c9497ae
Fix db-matches task target
stepan-anokhin Mar 23, 2022
4dcb162
Track matching progress
stepan-anokhin Mar 23, 2022
0c12fbe
Filter dark videos in luigi tasks
stepan-anokhin Mar 23, 2022
23044e8
Hook up luigi process-online task
stepan-anokhin Mar 23, 2022
d3cdb1e
Fix linting issues
stepan-anokhin Mar 23, 2022
5f7e393
Fix timestamp comparison
stepan-anokhin Mar 23, 2022
77339d6
Refactor celery tasks
stepan-anokhin Mar 24, 2022
291b67a
Handle luigi errors in celery
stepan-anokhin Mar 24, 2022
faa41ee
Temporary disable broken tests (#476)
stepan-anokhin Mar 24, 2022
8d4709f
Add dummy test to temporary fix integrationtests (#476)
stepan-anokhin Mar 24, 2022
33ccd57
Pin server dependencies
stepan-anokhin Mar 24, 2022
a06a5be
Fix server dependencies
stepan-anokhin Mar 25, 2022
b8aec7f
Fix server dependencies
stepan-anokhin Mar 25, 2022
99525cb
Temporary disable integration tests (#476)
stepan-anokhin Mar 25, 2022
6f38260
Migrate CLI to luigi
stepan-anokhin Mar 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions cli/cli/handlers/finder.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ def local_matches(self):
from winnow.pipeline.generate_local_matches import generate_local_matches
from winnow.utils.files import scan_videos

configure_logging_cli()
config = self._pipeline.config
configure_logging_cli(config.logging)

videos = scan_videos(config.sources.root, "**", extensions=config.sources.extensions)
generate_local_matches(files=videos, pipeline=self._pipeline)
Expand All @@ -26,7 +26,8 @@ def remote_matches(self, repo: Optional[str] = None, contributor: Optional[str]
"""Find matches between local files and remote fingerprints."""
from winnow.pipeline.generate_remote_matches import generate_remote_matches

configure_logging_cli()
config = self._pipeline.config
configure_logging_cli(config.logging)

if repo is not None:
repo = str(repo)
Expand Down
4 changes: 2 additions & 2 deletions cli/cli/handlers/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def all(self):
from winnow.pipeline.extract_exif import extract_exif
from winnow.pipeline.pipeline_context import PipelineContext

configure_logging_cli()
configure_logging_cli(self._config.logging)

# Resolve list of video files from the directory
absolute_root = os.path.abspath(self._config.sources.root)
Expand All @@ -25,4 +25,4 @@ def all(self):
pipeline_context = PipelineContext(self._config)
generate_local_matches(files=videos, pipeline=pipeline_context)
detect_scenes(files=videos, pipeline=pipeline_context)
extract_exif(None, pipeline=pipeline_context)
extract_exif(videos, pipeline=pipeline_context)
7 changes: 0 additions & 7 deletions db/access/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -517,13 +517,6 @@ def query_local_files(session: Session, path_hash_pairs) -> Query:
query = query.filter(tuple_(Files.file_path, Files.sha256).in_(tuple(path_hash_pairs)))
return query

@staticmethod
def query_local_file_ids(session: Session, path_hash_pairs) -> Query:
"""Query local files by (path, hash) pairs."""
query = session.query(Files.id).filter(Files.contributor == None) # noqa: E711
query = query.filter(tuple_(Files.file_path, Files.sha256).in_(tuple(path_hash_pairs)))
return query

@staticmethod
def query_remote_files(session: Session, repository_name: str = None, contributor_name: str = None) -> Query:
"""Query remote signatures from database."""
Expand Down
20 changes: 20 additions & 0 deletions db/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,3 +318,23 @@ class FileFilterPreset(Base):
name = Column(String(100), nullable=False, unique=True)
# Any filter data as JSON blob
filters = Column(JSON, nullable=False)


class TaskLogRecord(Base):
"""Task execution log.

Motivation
----------
Sometimes there is no way to determine whether the task is already completed just by looking
at the results alone. For example if template-matching is performed and no matches was found
there will be zero ``TemplateMatches`` in the database. So the results before and after the
run will be identical. Thus, some indication that the task was successfully executed is
needed. ``TaskLogRecord`` fills this gap.
"""

__tablename__ = "task_logs"

id = Column(Integer, primary_key=True)
task_name = Column(String(100), nullable=False, unique=False)
timestamp = Column(DateTime, nullable=False)
details = Column(JSON)
5 changes: 3 additions & 2 deletions default.config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
sources:
root: data/
hash_mode: file
hash_cache: data/representations/hashes

extensions:
- mp4
Expand All @@ -12,7 +14,6 @@ sources:
repr:
directory: data/representations
storage_type: detect
hash_mode: file

processing:
frame_sampling: 1
Expand All @@ -32,7 +33,7 @@ database:
uri: postgresql://postgres:admin@postgres:5432/videodeduplicationdb

templates:
source_path: data/templates/test-group/CCSI Object Recognition External/
source_path: data/templates/
distance: 0.07
distance_min: 0.05

Expand Down
1 change: 1 addition & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ services:
CELERY_BROKER: "redis://redis:6379/0"
CELERY_RESULT_BACKEND: "redis://redis:6379/0"
TASK_LOG_DIRECTORY: "/project/pipeline-logs"
LUIGI_CONFIG_PATH: "/project/config/luigi.cfg"
volumes:
# Set the BENETECH_DATA_LOCATION environment variable to the path
# on your host machine where you placed the source data
Expand Down
9 changes: 8 additions & 1 deletion environment-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ dependencies:
- pysoundfile
- h5py==2.9.0
- ffmpeg
- numba
- pynndescent
- luigi
- tabulate
- umap-learn
- networkit
- pip
- pip:
- lmdb
Expand All @@ -42,8 +48,9 @@ dependencies:
- yt-dlp
- dacite
- deprecation
- trimap
- pacmap
- torch
- grpcio==1.43.0
- grpcio-tools==1.43.0

prefix: C:\ProgramData\Anaconda3\envs\winnow
8 changes: 8 additions & 0 deletions environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ dependencies:
- pysoundfile
- h5py==2.9.0
- ffmpeg
- numba
- pynndescent
- luigi
- tabulate
- umap-learn
- networkit
- pip
- pip:
- lmdb
Expand All @@ -43,6 +49,8 @@ dependencies:
- yt-dlp
- dacite
- deprecation
- trimap
- pacmap
- torch
- grpcio==1.43.0
- grpcio-tools==1.43.0
Expand Down
13 changes: 6 additions & 7 deletions extract_exif.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
import logging.config

import click
import luigi

from winnow.pipeline.extract_exif import extract_exif
from winnow.pipeline.pipeline_context import PipelineContext
from winnow.pipeline.luigi.exif import ExifTask
from winnow.utils.config import resolve_config
from winnow.utils.logging import configure_logging_cli


@click.command()
@click.option("--config", "-cp", help="path to the project config file", default=None)
def main(config):
logger = configure_logging_cli()
logger.info("Loading config file")
config = resolve_config(config_path=config)

extract_exif(videos=None, pipeline=PipelineContext(config))
logging.config.fileConfig("./logging.conf")
luigi.build([ExifTask(config=config)], local_scheduler=True, workers=1)


if __name__ == "__main__":
Expand Down
42 changes: 28 additions & 14 deletions extract_features.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
import logging.config
import os

import click
import luigi

from winnow.pipeline.extract_video_signatures import extract_video_signatures
from winnow.pipeline.pipeline_context import PipelineContext
from winnow.pipeline.store_database_signatures import store_database_signatures
from winnow.pipeline.luigi.exif import ExifTask, ExifFileListFileTask
from winnow.pipeline.luigi.signatures import (
SignaturesTask,
DBSignaturesTask,
SignaturesByPathListFileTask,
DBSignaturesByPathListFileTask,
)
from winnow.utils.config import resolve_config
from winnow.utils.files import scan_videos, scan_videos_from_txt
from winnow.utils.logging import configure_logging_cli


@click.command()
Expand Down Expand Up @@ -36,19 +40,29 @@
is_flag=True,
)
def main(config, list_of_files, frame_sampling, save_frames):
logger = configure_logging_cli()
logger.info("Loading config file")
config = resolve_config(config_path=config, frame_sampling=frame_sampling, save_frames=save_frames)
logging.config.fileConfig("./logging.conf")

logger.info("Searching for Dataset Video Files")
if list_of_files is None:
videos = scan_videos(config.sources.root, "**", extensions=config.sources.extensions)
luigi.build(
[
ExifTask(config=config),
SignaturesTask(config=config),
DBSignaturesTask(config=config),
],
local_scheduler=True,
workers=1,
)
else:
videos = scan_videos_from_txt(list_of_files, extensions=config.sources.extensions)

pipeline = PipelineContext(config)
extract_video_signatures(files=videos, pipeline=pipeline)
store_database_signatures(files=videos, pipeline=pipeline)
luigi.build(
[
ExifFileListFileTask(config=config, path_list_file=list_of_files),
SignaturesByPathListFileTask(config=config, path_list_file=list_of_files),
DBSignaturesByPathListFileTask(config=config, path_list_file=list_of_files),
],
local_scheduler=True,
workers=1,
)


if __name__ == "__main__":
Expand Down
22 changes: 8 additions & 14 deletions generate_matches.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
import logging.config
import os

import click
import luigi

from winnow.pipeline.detect_scenes import detect_scenes
from winnow.pipeline.generate_local_matches import generate_local_matches
from winnow.pipeline.pipeline_context import PipelineContext
from winnow.pipeline.luigi.matches import MatchesReportTask, MatchesByFileListTask
from winnow.utils.config import resolve_config
from winnow.utils.files import scan_videos, scan_videos_from_txt
from winnow.utils.logging import configure_logging_cli


@click.command()
Expand Down Expand Up @@ -36,19 +34,15 @@
is_flag=True,
)
def main(config, list_of_files, frame_sampling, save_frames):
logger = configure_logging_cli()
logger.info("Loading config file")
config = resolve_config(config_path=config, frame_sampling=frame_sampling, save_frames=save_frames)
logging.config.fileConfig("./logging.conf")

logger.info("Searching for Dataset Video Files")
if list_of_files is None:
videos = scan_videos(config.sources.root, "**", extensions=config.sources.extensions)
luigi.build([MatchesReportTask(config=config)], local_scheduler=True, workers=1)
else:
videos = scan_videos_from_txt(list_of_files, extensions=config.sources.extensions)

pipeline = PipelineContext(config)
generate_local_matches(files=videos, pipeline=pipeline)
detect_scenes(files=videos, pipeline=pipeline)
luigi.build(
[MatchesByFileListTask(config=config, path_list_file=list_of_files)], local_scheduler=True, workers=1
)


if __name__ == "__main__":
Expand Down
18 changes: 6 additions & 12 deletions generate_remote_matches.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import logging.config
import os

import click
import luigi

from winnow.pipeline.generate_remote_matches import generate_remote_matches
from winnow.pipeline.pipeline_context import PipelineContext
from winnow.pipeline.luigi.matches import RemoteMatchesTask
from winnow.utils.config import resolve_config
from winnow.utils.logging import configure_logging_cli


@click.command()
Expand All @@ -16,11 +16,6 @@
help="remote repository name",
default=None,
)
@click.option(
"--contributor",
help="remote contributor name",
default=None,
)
@click.option(
"--frame-sampling",
"-fs",
Expand All @@ -38,11 +33,10 @@
default=None,
is_flag=True,
)
def main(repo, contributor, config, frame_sampling, save_frames):
logger = configure_logging_cli()
logger.info("Loading config file")
def main(repo, config, frame_sampling, save_frames):
config = resolve_config(config_path=config, frame_sampling=frame_sampling, save_frames=save_frames)
generate_remote_matches(repository_name=repo, contributor_name=contributor, pipeline=PipelineContext(config))
logging.config.fileConfig("./logging.conf")
luigi.build([RemoteMatchesTask(config=config, repository_name=repo)], local_scheduler=True, workers=1)


if __name__ == "__main__":
Expand Down
57 changes: 0 additions & 57 deletions ingest_jobs.py

This file was deleted.

Loading