You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The corpus class has been put in a shape to become the default point of
departure of most workflows. All core methods are now available for the corpus class, and have been implemented newly if necessary, e.g. show()
and size()-method. The constructor method for a corpus object, the corpus() method, will now check whether the character vector with the corpus
ID refers to an available corpus, whether all letters are upper case and
issue informative warnings and error messages.
The s_attributes()-method for corpus objects has been reworked: It will decode
binary files directly, without reliance on the corpus library functions, which is
significantly faster.
The Corpus reference class is now obsolete after the introduction of the
S4 corpus class. To maintain the functionality not covered otherwise,
new generics get_info and show_info have been introduced and defined
for the corpus class.
Methods available for the subcorpus class have been expanded so that this
class can supersede the partition class: Methods newly available are cpos(), count(), p_attributes(), s_attributes()get_token_stream(),
and size(). Technically, there is virtual slice-class, from which subcorpus inherits (methods called via callNextMethod()).
A new subset()-method for the corpus and subcorpus classes to generate subcorpora
(i.e. subcorpus objects) has been introduced. It outperforms the partition() method. The subset()-method for corpus and subcorpus objects
will be the default way to work with non standard evaluation in a manner that
feels "R-ish" (#40).
The zoom()-method that has been introduced experimentally has
been dropped again in favor of the subset()-method to get subcorpus objects
from corpus and subcorpus objects. A set of experimental methods for an
initial check of the feasibility of a non-standard evaluation approach to
the generation of subcorpora has been dropped (methods $, ==, !=, zoom for corpus-class).
To facilitate the transition from the partition class (inheriting from
the textstat class) to the subcorpus class (inheriting from the textstat
class), there is a new coerce()-method to turn a partition object into
a subcorpus object.
A new remote_corpus-class is the basis for accessing remote
corpora. A remote_subcorpus can be derived from a remote_corpus. Methods
available for remote corpora und subcorpora remain limited at this stage.
Consolidation of the class system: For all the S4 classes in the package, multiple
contains have been checked, and multiple contains have been removed.
The subcorpus_bundle class now inherits from partition_bundle. This is not
intended to be a long-term solution, but facilitates the implementation of new
workflows based on the subcorpus class rather than the partition class.
Calling the polmineR shiny app via polmineR did not have safeguards if
the suggested packages shiny and shinythemes were not installed. Now
there will be a conditional installation of the packages required for running
the shiny app.
The somewhat odd class CorpusOrSubcorpus has been removed. The ngrams-method
now applies for corpus and subcorpus objects.
The pipe operator of the magrittr package is imported now, and magrittr has moved
from a suggested package to a required package.
The label()-method, present for a while, is superseded by a edit()-method now.
It will call a shiny gadget either using DataTables or Handsontable. The former Labels reference class has been turned into a S4 class, because the
desired reference logic can also be achieved with a data.table in a slot of
the labels class.
The table-slot of the kwic class has been renamed as stat slot (a data.table),
so that the kwic class can now inherit from the textstat class. The enrich()-method for objects of class kwic now includes a new argument extra that will add extra tokens to the left of the windows for concordances so
that qualitative inspections for query hits can work with more context.
The as.TermDocumentMatrix() and the as.DocumentTermMatrix()-methods are now
also defined for kwic objects. They work exactly the same as for the context
class. To avoid having to write new methods, a new neighborhood virtual class has
been introduced. The aforementioned methods are defined for the virtual class and
are available for context and kwic class objects.
Added CQP functionality to count tab in shiny app, and to the dispersion tab.
There is now a basic implementation of get_token_stream() for a partition_bundle
object.
The Cooccurrences()-method is now available for subcorpus-objects (#88).
There is a new coerce method to turn a kwic-object into a context-object.
The neighborhood virtual class could be discarded again, and a bug could be removed
that left an enrich()-operation for kwic objects (argument p_attribute)
ineffectual (#103).
Minor changes
Added a new argument regex to the cpos()-method (for corpus objects), which
will interpret argument query as a regular expression. This may be faster than
taking query as an outright CQP query.
The configure-script in the package that would adjust paths in the registry files
for the corpora included in the package for documentation and testing purposes has
been removed. Having switched to a temporary registry directory, it has lost
its function.
The version of the data.table package now required is 1.12.2, because previous
versions did not allow adding columns to a new data.table.
Implemented the possibility to use multiple queries in dispersion-method (#92).
To keep up with the renaming of functions and arguments in the package, "sAttributes"
and "pAttributes" in the polmineR shiny app have been renamed ("s_attributes",
and "p_attributes", respectively).
The shiny app module for kwic output will not show p_attribute and positivelist
by default.
The format()-method is used to create proper output in the cooccurrences of the
shiny app.
User names that include non-ASCII characters were a persistent problem on Windows
machines (#66). The solution now is to check for non-ASCII characters in the path
to the data directory, and to use the "old" short DOS path if necessary. The worker is
a modified registry()-function.
The ordering of the table for ll-method had been somewhat mixed up, which is repaired
now. Tokens with NA values for the ll-test will show up at the end of the table.
The registry_move()-function, used only internally at this stage, is exported now
so that it can be used by other packages.
The return value of the get_token_stream()-method for regions objects was a data.table. The behavior is now in line with the other get_token_stream() methods
The tempcorpus()-method and the tempcorpus class have been removed from the package,
having become utterly deprecated.
The summary()-method for partition-class objects has been turned into a method
for the count-class, to eliminate an inconsistency. The example of a workflow has been
moved to the documentation object for the count-class.
The browse()-method has not proven to be useful and has been removed from the package.
A new browse()-function is introduced to throw a warning, if browse should be
called nevertheless.
A refactoring of the split()-method for partition-objects improved the readability
of the code, but the performance gain is minimal.
A new kwic_bundle-class has been introduced, a list of kwic objects can be turned
into this new class using as.bundle.
The context()-method will now take again as input character vectors for the arguments left and right to expand to the left and right boundaries of the designated
region (#87).
Rework of the way messages are printed to make it easy to implement notifications in
the shiny environment.
Default highlighting when a positivelist is supplied has been removed from the kwic()-method. This ensures that subsequent highlighting operations can assign
new colors (#38).
Implemented feature request for dispersion() that results are reported for all
values of structural attributes, including those with zero matches. (#104)
Performance improved for the cpos-method for matrix which unfolds a matrix with regions
of corpus positions, useful for operations that require many calls.
The count-method for partition_bundle has been reworked and is much faster and more
memory efficient.
as.TermDocumentMatrix() for partition_bundle optimized to work efficiently
with large corpora.
Introduction of a context,matrix-method to have a unified auxiliary function
to create contexts.
The as.corpusEnc()-function uses the localeToCharset()-function from the utils
package to determine the charset of input strings. On RStudio Server, we have seen
cases when the return value is NA. Then it will be assumed that the locale is UTF-8.
Functionality to highlight terms in kwic display has been restored for the shiny app.
Bug fixes
Removed a bug in the context()/kwic() method that led to superfluous words in the
right context.
Removed a bug that occurred with the as.data.frame()-method for kwic-objects
when no metadata were added.
The count()-method for partition_bundle-objects did not perform iconv() if
necessary - this has been corrected.
Indexing the concordances of a kwic object did not reduce the cpos table
concurringly. This has been corrected.
The as.speeches()-method failed to handle situations correctly, when one speaker
occurring in the corpus only contributed one single region to the entire corpus (#86).
This behavior has been debugged.
Counting over a partition_bundle started to throw a warning that an argument arrives
at the cpos()-method that is not used. The cause for the warning message is removed,
an additional unit test has been introduced to recognize issues with the count-method (#90).
The kwic()-method threw an error when trimming the matches by using a positivelist
or a stoplist resulted in no remaining matches. The method will now return a NULL
object and keep issuing a warning if no matches remain after filtering (#91).
Chaining subsetting calls on a corpus/subcorpus omitted filling the s_attribute slot
of the subcorpus object, resulting in false results when counting over
subcorpora. Fixed.
Started to remove bugs in the shiny app: kwic starts to work again (bug: slot table
has been replaced by stat).
The part of the shiny app for dispersions did not work at all - has been repaired,
exposing more functionality of dispersion() (#62).
In the as.speeches()-method, the argument verbose was not used (#64) - this had
been addressed when solving issue #86.
Telling messages when sending out emails - on success and error - have been added (#61).
A shortcoming in coerce method to turn a subcorpus into a String was removed:
A semicolon was not recognized as a punctuation mark. This makes decoding subcorpora
as Annotation more robust. The respective unit test has been updated.
Calling read() on a kwic object works again (#84).
Checks for the as.VCorpus() method that failed are now ok (#77). The reason was
that get_token_stream() assumed implicitly that a p-attribute "pos" is present,
which is not the case for the REUTERS test corpus.
A minor bug in the s_attributes-method was removed that would make retrieving the
metadata for the first strucs (index 0) of a s-attribute impossible.
Fixed an issue for as.DocumentTermMatrix that started to occur with the introduction
of the subcorpus_bundle class (#100).
Removed a bug in the kwic-method for character that prevented using different values for
right and left context (#101).
Removed a bug that occurred when using as.DocumentTermMatrix() on a corpus stated
by corpus ID / length-one character vector (#105).
Removed a bug from the kwic,character-method, and the context,corpus-method that
would result in odd behavior when either the left or right context is 0.
An endemic encoding issue for full text output on Windows machines (latin1 encoding)
has been solved by replacing internally markdown::markdownToHTML by a direct
call to markdown::renderMarkdown. On this occasion, some overhead preparing
fulltext output has been removed.
A bug that prevented getting extra left and right context for kwic objects has
been removed (#102).
The as.TermDocumentMatrix()-method for neighborhood-objects returned a
DocumentTermMatrix (unintendedly), this bug is removed now.
Documentation
Extended documentation for pmi()-method and t_test()-method.
New s_attributes()-method for corpus-class.
The documentation for the corpus-class has been rewritten entirely, and the
documentation for the remote_corpus-class has been integrated, whereas methods
applicable to the remote_corpous-class were integrated into the documentation
objects for the respective methods.
The documentation for the get_token_stream()-method has been reworked and expanded
thoroughly (#65). On this occasion, test coverage for the method has been improved
significantly. (Everything is tested now apart from parallelization.)