Skip to content
Change the repository type filter

All

    Repositories list

    • Resources for running StormCrawler with Docker services
      Dockerfile
      Apache License 2.0
      31000Updated Nov 10, 2024Nov 10, 2024
    • Resources for the DigitalPebble website
      SCSS
      0000Updated Jul 17, 2024Jul 17, 2024
    • Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
      FLUX
      0100Updated May 16, 2024May 16, 2024
    • storm

      Public
      Mirror of Apache Storm
      Java
      Apache License 2.0
      4.1k000Updated Apr 10, 2024Apr 10, 2024
    • Wraps the charset detection logic from StormCrawler as a Tika module
      Java
      Apache License 2.0
      1000Updated Feb 2, 2024Feb 2, 2024
    • tika

      Public
      The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
      Java
      Apache License 2.0
      794000Updated Jan 25, 2024Jan 25, 2024
    • benchmark

      Public
      StormCrawler topology to evaluate the performance of different backends and configurations
      Shell
      0000Updated Jan 22, 2024Jan 22, 2024
    • docs

      Public
      Documentation for Docker Official Images in docker-library
      Shell
      MIT License
      2.2k000Updated Jan 16, 2024Jan 16, 2024
    • Ansible playbook for deploying a Storm cluster
      1700Updated Dec 7, 2023Dec 7, 2023
    • nutch

      Public
      Apache Nutch is an extensible and scalable web crawler
      Java
      Apache License 2.0
      1.3k100Updated Nov 8, 2023Nov 8, 2023
    • URLFrontier client written in Rust (mostly as a way of learning Rust)
      Rust
      Apache License 2.0
      0100Updated Dec 5, 2022Dec 5, 2022
    • Java
      1000Updated Apr 6, 2022Apr 6, 2022
    • A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
      Java
      Apache License 2.0
      214810Updated Sep 24, 2021Sep 24, 2021
    • Crawl configurations for benchmarking / testing StormCrawler
      Shell
      Apache License 2.0
      51000Updated Sep 19, 2019Sep 19, 2019
    • behemoth

      Public archive
      Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
      Java
      Other
      60281121Updated Apr 25, 2018Apr 25, 2018
    • A set of reusable Java components that implement functionality common to any web crawler
      Java
      Apache License 2.0
      77400Updated Apr 4, 2017Apr 4, 2017
    • sc-warc

      Public
      WARC resources for StormCrawler
      1230Updated Oct 20, 2016Oct 20, 2016
    • tescobank

      Public archive
      Setup for crawling tescobank with SC
      Java
      Apache License 2.0
      2400Updated Sep 23, 2015Sep 23, 2015
    • Use cases for DigitalPebble's TextClassification API
      Java
      Apache License 2.0
      31000Updated Sep 1, 2015Sep 1, 2015
    • behemoth-commoncrawl

      Public archive
      Support for old (pre 2013) CommonCrawl dataset in Behemoth
      Java
      0400Updated Apr 20, 2015Apr 20, 2015
    • tika-cc

      Public
      resources for generating a corpus of docs from CC for Tika
      Shell
      0000Updated Nov 28, 2014Nov 28, 2014
    • Resources for comparison between 1.8 and 2.x of Apache Nutch
      Java
      Apache License 2.0
      0400Updated Jun 4, 2014Jun 4, 2014
    • ElasticSearch module for Behemoth
      Java
      0100Updated Feb 12, 2014Feb 12, 2014
    • Module for classifying Behemoth documents with a model from our Text Classification API
      Java
      0100Updated Nov 22, 2012Nov 22, 2012
    • GATE Processing Resource wrapping DigitalPebble's TextClassification API
      Java
      3511Updated Jul 12, 2012Jul 12, 2012
    • ngrams-api

      Public archive
      Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
      Java
      Other
      2400Updated Apr 27, 2012Apr 27, 2012