Skip to content

Caterpillar Mambo

Compare
Choose a tag to compare
@PolMine PolMine released this 18 Dec 09:03
· 648 commits to master since this release

New Features

  • The corpus class has been put in a shape to become the default point of
    departure of most workflows. All core methods are now available for the
    corpus class, and have been implemented newly if necessary, e.g. show()
    and size()-method. The constructor method for a corpus object, the
    corpus() method, will now check whether the character vector with the corpus
    ID refers to an available corpus, whether all letters are upper case and
    issue informative warnings and error messages.
  • The s_attributes()-method for corpus objects has been reworked: It will decode
    binary files directly, without reliance on the corpus library functions, which is
    significantly faster.
  • The Corpus reference class is now obsolete after the introduction of the
    S4 corpus class. To maintain the functionality not covered otherwise,
    new generics get_info and show_info have been introduced and defined
    for the corpus class.
  • Methods available for the subcorpus class have been expanded so that this
    class can supersede the partition class: Methods newly available are
    cpos(), count(), p_attributes(), s_attributes() get_token_stream(),
    and size(). Technically, there is virtual slice-class, from which
    subcorpus inherits (methods called via callNextMethod()).
  • A new subset()-method for the corpus and subcorpus classes to generate subcorpora
    (i.e. subcorpus objects) has been introduced. It outperforms the
    partition() method. The subset()-method for corpus and subcorpus objects
    will be the default way to work with non standard evaluation in a manner that
    feels "R-ish" (#40).
  • The zoom()-method that has been introduced experimentally has
    been dropped again in favor of the subset()-method to get subcorpus objects
    from corpus and subcorpus objects. A set of experimental methods for an
    initial check of the feasibility of a non-standard evaluation approach to
    the generation of subcorpora has been dropped (methods $, ==, !=,
    zoom for corpus-class).
  • To facilitate the transition from the partition class (inheriting from
    the textstat class) to the subcorpus class (inheriting from the textstat
    class), there is a new coerce()-method to turn a partition object into
    a subcorpus object.
  • A new remote_corpus-class is the basis for accessing remote
    corpora. A remote_subcorpus can be derived from a remote_corpus. Methods
    available for remote corpora und subcorpora remain limited at this stage.
  • Consolidation of the class system: For all the S4 classes in the package, multiple
    contains have been checked, and multiple contains have been removed.
  • The subcorpus_bundle class now inherits from partition_bundle. This is not
    intended to be a long-term solution, but facilitates the implementation of new
    workflows based on the subcorpus class rather than the partition class.
  • Calling the polmineR shiny app via polmineR did not have safeguards if
    the suggested packages shiny and shinythemes were not installed. Now
    there will be a conditional installation of the packages required for running
    the shiny app.
  • The somewhat odd class CorpusOrSubcorpus has been removed. The ngrams-method
    now applies for corpus and subcorpus objects.
  • The pipe operator of the magrittr package is imported now, and magrittr has moved
    from a suggested package to a required package.
  • The label()-method, present for a while, is superseded by a edit()-method now.
    It will call a shiny gadget either using DataTables or Handsontable. The former
    Labels reference class has been turned into a S4 class, because the
    desired reference logic can also be achieved with a data.table in a slot of
    the labels class.
  • The table-slot of the kwic class has been renamed as stat slot (a data.table),
    so that the kwic class can now inherit from the textstat class. The
    enrich()-method for objects of class kwic now includes a new argument
    extra that will add extra tokens to the left of the windows for concordances so
    that qualitative inspections for query hits can work with more context.
  • The as.TermDocumentMatrix() and the as.DocumentTermMatrix()-methods are now
    also defined for kwic objects. They work exactly the same as for the context
    class. To avoid having to write new methods, a new neighborhood virtual class has
    been introduced. The aforementioned methods are defined for the virtual class and
    are available for context and kwic class objects.
  • Added CQP functionality to count tab in shiny app, and to the dispersion tab.
  • There is now a basic implementation of get_token_stream() for a partition_bundle
    object.
  • The Cooccurrences()-method is now available for subcorpus-objects (#88).
  • There is a new coerce method to turn a kwic-object into a context-object.
    The neighborhood virtual class could be discarded again, and a bug could be removed
    that left an enrich()-operation for kwic objects (argument p_attribute)
    ineffectual (#103).

Minor changes

  • Added a new argument regex to the cpos()-method (for corpus objects), which
    will interpret argument query as a regular expression. This may be faster than
    taking query as an outright CQP query.
  • The configure-script in the package that would adjust paths in the registry files
    for the corpora included in the package for documentation and testing purposes has
    been removed. Having switched to a temporary registry directory, it has lost
    its function.
  • The version of the data.table package now required is 1.12.2, because previous
    versions did not allow adding columns to a new data.table.
  • Implemented the possibility to use multiple queries in dispersion-method (#92).
  • To keep up with the renaming of functions and arguments in the package, "sAttributes"
    and "pAttributes" in the polmineR shiny app have been renamed ("s_attributes",
    and "p_attributes", respectively).
  • The shiny app module for kwic output will not show p_attribute and positivelist
    by default.
  • The format()-method is used to create proper output in the cooccurrences of the
    shiny app.
  • User names that include non-ASCII characters were a persistent problem on Windows
    machines (#66). The solution now is to check for non-ASCII characters in the path
    to the data directory, and to use the "old" short DOS path if necessary. The worker is
    a modified registry()-function.
  • The ordering of the table for ll-method had been somewhat mixed up, which is repaired
    now. Tokens with NA values for the ll-test will show up at the end of the table.
  • The registry_move()-function, used only internally at this stage, is exported now
    so that it can be used by other packages.
  • The return value of the get_token_stream()-method for regions objects was a
    data.table. The behavior is now in line with the other get_token_stream() methods
  • The tempcorpus()-method and the tempcorpus class have been removed from the package,
    having become utterly deprecated.
  • The summary()-method for partition-class objects has been turned into a method
    for the count-class, to eliminate an inconsistency. The example of a workflow has been
    moved to the documentation object for the count-class.
  • The browse()-method has not proven to be useful and has been removed from the package.
    A new browse()-function is introduced to throw a warning, if browse should be
    called nevertheless.
  • A refactoring of the split()-method for partition-objects improved the readability
    of the code, but the performance gain is minimal.
  • A new kwic_bundle-class has been introduced, a list of kwic objects can be turned
    into this new class using as.bundle.
  • The context()-method will now take again as input character vectors for the arguments
    left and right to expand to the left and right boundaries of the designated
    region (#87).
  • Rework of the way messages are printed to make it easy to implement notifications in
    the shiny environment.
  • Default highlighting when a positivelist is supplied has been removed from the
    kwic()-method. This ensures that subsequent highlighting operations can assign
    new colors (#38).
  • Implemented feature request for dispersion() that results are reported for all
    values of structural attributes, including those with zero matches. (#104)
  • Performance improved for the cpos-method for matrix which unfolds a matrix with regions
    of corpus positions, useful for operations that require many calls.
  • The count-method for partition_bundle has been reworked and is much faster and more
    memory efficient.
  • as.TermDocumentMatrix() for partition_bundle optimized to work efficiently
    with large corpora.
  • Introduction of a context,matrix-method to have a unified auxiliary function
    to create contexts.
  • The as.corpusEnc()-function uses the localeToCharset()-function from the utils
    package to determine the charset of input strings. On RStudio Server, we have seen
    cases when the return value is NA. Then it will be assumed that the locale is UTF-8.
  • Functionality to highlight terms in kwic display has been restored for the shiny app.

Bug fixes

  • Removed a bug in the context()/kwic() method that led to superfluous words in the
    right context.
  • Removed a bug that occurred with the as.data.frame()-method for kwic-objects
    when no metadata were added.
  • The count()-method for partition_bundle-objects did not perform iconv() if
    necessary - this has been corrected.
  • Indexing the concordances of a kwic object did not reduce the cpos table
    concurringly. This has been corrected.
  • The as.speeches()-method failed to handle situations correctly, when one speaker
    occurring in the corpus only contributed one single region to the entire corpus (#86).
    This behavior has been debugged.
  • Counting over a partition_bundle started to throw a warning that an argument arrives
    at the cpos()-method that is not used. The cause for the warning message is removed,
    an additional unit test has been introduced to recognize issues with the
    count-method (#90).
  • The kwic()-method threw an error when trimming the matches by using a positivelist
    or a stoplist resulted in no remaining matches. The method will now return a NULL
    object and keep issuing a warning if no matches remain after filtering (#91).
  • Chaining subsetting calls on a corpus/subcorpus omitted filling the s_attribute slot
    of the subcorpus object, resulting in false results when counting over
    subcorpora. Fixed.
  • Started to remove bugs in the shiny app: kwic starts to work again (bug: slot table
    has been replaced by stat).
  • The part of the shiny app for dispersions did not work at all - has been repaired,
    exposing more functionality of dispersion() (#62).
  • In the as.speeches()-method, the argument verbose was not used (#64) - this had
    been addressed when solving issue #86.
  • Telling messages when sending out emails - on success and error - have been added (#61).
  • A shortcoming in coerce method to turn a subcorpus into a String was removed:
    A semicolon was not recognized as a punctuation mark. This makes decoding subcorpora
    as Annotation more robust. The respective unit test has been updated.
  • Calling read() on a kwic object works again (#84).
  • Checks for the as.VCorpus() method that failed are now ok (#77). The reason was
    that get_token_stream() assumed implicitly that a p-attribute "pos" is present,
    which is not the case for the REUTERS test corpus.
  • A minor bug in the s_attributes-method was removed that would make retrieving the
    metadata for the first strucs (index 0) of a s-attribute impossible.
  • Fixed an issue for as.DocumentTermMatrix that started to occur with the introduction
    of the subcorpus_bundle class (#100).
  • Removed a bug in the kwic-method for character that prevented using different values for
    right and left context (#101).
  • Removed a bug that occurred when using as.DocumentTermMatrix() on a corpus stated
    by corpus ID / length-one character vector (#105).
  • Removed a bug from the kwic,character-method, and the context,corpus-method that
    would result in odd behavior when either the left or right context is 0.
  • An endemic encoding issue for full text output on Windows machines (latin1 encoding)
    has been solved by replacing internally markdown::markdownToHTML by a direct
    call to markdown::renderMarkdown. On this occasion, some overhead preparing
    fulltext output has been removed.
  • A bug that prevented getting extra left and right context for kwic objects has
    been removed (#102).
  • The as.TermDocumentMatrix()-method for neighborhood-objects returned a
    DocumentTermMatrix (unintendedly), this bug is removed now.

Documentation

  • Extended documentation for pmi()-method and t_test()-method.
  • New s_attributes()-method for corpus-class.
  • The documentation for the corpus-class has been rewritten entirely, and the
    documentation for the remote_corpus-class has been integrated, whereas methods
    applicable to the remote_corpous-class were integrated into the documentation
    objects for the respective methods.
  • The documentation for the get_token_stream()-method has been reworked and expanded
    thoroughly (#65). On this occasion, test coverage for the method has been improved
    significantly. (Everything is tested now apart from parallelization.)