Releases: PolMine/polmineR
Nested Boxes
New features
- Using the
corpus
class throughout is an opportunity to keep the corpus ID
together with the registry directory of a corpus. And as we are able now to
handle corpora defined in different registry files, the temporary registry
directory is not necessary any more. It still exists, yet only for temporary
corpora and corpora that are described by registry files that cannot be
modified, i.e. corpora shipped in packages. The test corpus of the polmineR
package is an important respective scenario. get_token_stream()
now has an argumentmin_length
.registry_*()
functions are superseded byRcppCWB::corpus_*
functions and
throw a warning that they are deprecated.- The REUTERS corpus is not included in the package any more: There was an
identical copy of the REUTERS corpus included in the RcppCWB package. All
examples and unit tests now useuse(pkg = "RcppCWB", corpus = "REUTERS")
to
make the REUTERS corpus available. size()
works forpartition
/subcorpus
withs-attribute
that is a child
of the s-attribute the object is based on #216.- The
trim()
-method forcontext
objects has a new argumentfn
for
supplying a (trimming) function to be applied all match contexts. - A new s-attribute "protocol_date" has been added to sample corpus
"GERMAPARLMINI", so that sample data for nested corpus data is available. To
prevent confusion between s-attributes "protocol_date" (at protocol-level) and
"date" (at speaker-level), arguments_attribute_date
is stated explicitly in
all examples. - Method
size()
has been refactored to work with nested corpora. - Method
encoding()
and replace methodencoding<-
are defined forcall
andquosure
objects to get and adjust the encoding, replacing a previously
unexported function.recode_call()
. - The
subset()
methods forcorpus
andsubcorpus
objects now handle
expressions for subsetting as quosures, laying the ground to program against
subset(), see respective update of the examples, #212. - Functionality for indexing
bundle
objects with single square brackets is
developed now. Indexing with double brackets, suppling multiple values fori
is deprecated. The aim is a consistent behavior that abundle
indexed by[
will always return abundle
, and indexing with[[
always gets a single object
from the list of objects. #214
Minor improvements
- The
use()
function now has an additional argumentcorpus
to specify which
corpus from a package shall be loaded (#138). - The
get_token_stream()
-method forpartition_bundle
objects is more memory
efficient (no exhaustion for big corpora) and faster. - Significantly improved performance of
split()
-method forcorpus
objects. - The
split()
-method forcorpus
objects offers progress bar. as.speeches()
forcorpus
objects has new argumentsubset
, offering a
significantly faster approach than the method forsubcorpus
objects in many
cases.- The
size()
method will returnNA
and issue a telling warning if the slot
corpus
andregistry_dir
of thecorpus
object are not filled #222. get_token_stream()
will return list ofinteger
values ifdecode
is
TRUE
(#213).- After applying
trim()
on acontext
object using argumentspositivelist
ornegativelist
, thecount
slot as reported bylength
was not updated.
Fixed. (#220) - The
enrich()
method forcontext
objects has a new argumentstat
for
creating / updating thedata.table
in the slotstat
. - Method
subset()
forsubcorpus
objects has been debugged to work with
nested corpora. - New option
polmineR.mdsub
configures substitutions that are applied on
markdown documents to prevent presence of characters that would be
misinterpreted as formatting instructions. Fixes #166. - The messages issued by
check_cqp_query()
now include a hint that argument
check
can be used to omit checking the CQP syntax to prevent false positives.
Addresses #171.
Bug fixes
- The ability of
cooccurrences()
(andcontext()
) to process more than one
p-attribute has been lost temporarily. Fixed. #208. - Removed a bug for
hits()
method forpartition
objects #215. - After applying
trim()
on acontext
object using argumentspositivelist
ornegativelist
, the count statistics reported in thestat
slot were not
updated. Fixed. (#220) - Structural attributes do not disappear any more after adding tooltips to a
kwic
object #218. - Method
subset()
would not work reliably with argumentregex
if more than
one expression is passed #212. Fixed. terms()
did not work forsubcorpus
objects. Fixed. #209- When applying
as.speeches()
on asubcorpus
, the date may have been missing
from the object names. Fixed. #219 - Fixed an issue that
minNchar
in thenoise()
method would work exactly the
way opposite to the way intended #211. - The slot
registry_dir
of acooccurrences_bundle
derived from a
partition_bundle
was not filled, resulting in an error of theshow()
-method
for thecooccurrences_bundle
. Fixed #222.
Documentation
- The documentation of the
cooccurrences()
method now includes example code
for creating a table usingDT::datatable()
with buttons for exporting tables
(to Excel, for instance).
Yellow Submarine
New Features
- The
dispersion()
method now accepts an argumentfill
, alogical
value to
explicitly control whether (#160) zero matches for a value of a structural
attribute should be reported. The performance of adding columns (requred only if
two structural attributes are provided) is improved substantially by using the
reference semantic of the data.table package. If many columns are added at once,
a warning issued by the data.table package is supplemented by an further
explanatory warning of the polmineR package. Filling up thedata.table
was
limited previously tofreq = FALSE
, this limitation is lifted. - The
html()
method is implemented forremote_subcorpus
objects. - The
hits()
method is implemented forremote_corpus
andremote_subcorpus
class (#160). - A new S4 class
ranges
is introduced to manage ranges of corpus positions for
query matches. This is a preparatory step to remove an inconsistency from the
hits
class that mixed two very usages (getting ranges of corpus positions for
matches and getting counts). - A new S4 method
ranges
serves as the constructor to prepare aranges
class
object. In combination withas.data.table()
, it replaces former functionality
ofhits()
without arguments_attribute
. - The output of the
hits()
method is altered, making it much more consistent
than previously: The method will consistently return ahits
object. - The method
hits()
has a new argumentfill
that will report zeros for
combinations of s-attributes with no matches for a query. - The argument
subset
for thesubset
method forremote_corpus
objects can
now be a call (#162), this is a basis for passing vectors to OpenCPU server. -
p_attributes()
implemented forremote_corpus
andremote_partition
. - A new
regions()
method (forcorpus
class objects to start with) returns a
regions
class object with a regions matrix (slotcpos
) with regions for an
s-attribute (#176). - The
get_token_stream()
-method forregions
andmatrix
objects will now
accept a logical argumentsplit
. IfTRUE
, a list of character vectors is
returned. The envisaged use case is a fast decoding of sentences (#176). - A
encoding()
method has been defined if argumentobject
is missing.
Callingencoding()
will return the session character set. If it cannot be
determined usinglocaleToCharset()
, a UTF-8 session charset will be assumed.
Internally,encoding()
replaces a direct call oflocaleToCharset()
to avoid
errors that have occurred on GitHub Actions with Ubuntu 20.04 (#188). - If the session character set cannot be guessed by
localeToCharset()
(NA
return value), a startup message will issue a warning that 'UTF-8' is assumed
(#188). - The
size()
method is now able to handle nested s-attributes. - The
trim()
method forcontext
objects will now accept a matrix with ranges
apositivelist
argument. - The
highlight()
method now accepsmatrix
objects as elements of the list
of items to be highlighted. It is treated as a set of regions, such as resulting
fromcpos()
. Thus it is possible to highlight matches for CQP queries. - The package now requires at least RcppCWB v0.5.2, which includes a much more
efficient worker for token contexts for thecontext()
method. - The
count()
-method forpartition_bundle
objects failed with an opaque
error message if there were no query matches at all. There is now a check for
this scenario and the expected table is returned (zero values throughout.) - The
corpus
class is now a superclass for thetextstat
class, starting to
create a more coherent class structure in general. This is an important
preparatory step to be able to keep all registry files in the temporary registry
directory. To avoid a confusion in the class system resulting from the coerce
method frompartition
tocorpus
objects, this coerce method (defined by
setAs()
) has been removed. Theget_template()
-method forpartition
objects
using this coerce method has been removed - as it inherits the method anyway, it
is not needed any more. See #201. - The kwic tab of the shiny app included in the package exposes the improved
capabilities to determine the context of a query match based on an s-attribute
(argumentregion
) and to consider the changing value of an s-attribute as
a boundary of a context (argumentboundary
). New menu "boundary" and radio
buttons, conditional on presence of s-attributes "s" and/or "p".
Minor Improvements
- If arguments
sAttribute
orpAttribute
(instead ofs_attribute
and
p_attribute
) are still used withdispersion()
method, a warning is issued
declaring that the argument is deprecated. - Examples in packages that depend on polmineR would have faced the issue that
loading/re-loading the package in several examples would not be posssible as the
mechanism of cleaning up between examples would trigger a removal of polmineR's
temporary directories but not the re-creation. Removing temporary files is now
moved from polmineR's.onDetach()
to.onUnload()
(#164). - Significant improvement of the performance of the
as.phrases()
method (#172). - The
as.corpusEnc()
auxiliary function will now check whether non-convertible
characters lead to anNA
result and issue a warning how this warning can be
avoided (#151). - Significant performance improvement of the
context()
method formatrix
objects if argumentsleft
andright
are namedinteger
vectors. All
context()
benefit from the improved performance of this worker for creating
contexts for query matches. - New coerce-method to derive matrix with ranges from a
context
object. - The
enrich()
method forcontext
objects will now perform an in-place
operation when adding new s-attributes. - The
as.cqp()
function includes argumentscheck
andwarn
for running
check_cqp_query()
on queries. - The
context()
method formatrix
objects includes a new argumentboundary
and relies on a new functionRcppCWB::region_matrix_context()
. - Default value of argument
verbose
ofcontext()
-methods is nowFALSE
. - The
as.corpusEnc()
auxiliary function now includes a test whether input
character vector includes unexpected encodings and issues a warning if this is
the case. - The
cpos()
method will now check for accidental leading and/or trailing
whitespace and remove it for token lookup. Note thathits()
,count()
and
dispersion()
will report queries without removing whitespace. - Internals of the
count()
-method forpartition_bundle
objects will be much
more efficient when many columns with zero matches need to be added. The
implementation avoids a data.table warning when the bulk action of adding new
columns exceeds the number of columns reserved by data.table objects. - The DESCRIPTION files does not state "LazyData: yes" any more, as the package
does not have a data directory. - Typo in messages of
trim()
is removed (#197). encoding()
relies onl10n_info()
before usinglocaleToCharset()
as a
matter of performance and robustness (#196).- Class
corpus
has a new slotregistry_dir
. This is a preparatory step that
will facilitate managing corpora described by registry files in different
registry directories. - Constructor
corpus()
forcorpus
-class objects has an argument
registry_dir
that will be required to distinguish corpora described by
registry files in different registry directories. - The package now relies on the the fs package to handle directories and paths.
Slots in S4 classes are notfs_path
classes. - Internally, functions
registry_get_home()
andregistry_get_encoding()
have
been replaced by RcppCWB functionscl_charset_name()
andcorpus_data_dir()
with equivalent result, but faster due to immediate access to C representation
of the corpus. - The
corpus()
method will deduce the registry directory from the C representation
of the corpus if possible. - An inefficiency in the implementation of
as.markdown()
has been removed,
making fulltext display (usingread()
orhtml()
) much faster. - Calling
corpus()
without any arguments now returns an expandeddata.frame
reporting all slots of thecorpus
class objects, skipping only the data
directory of the corpus. - The
cpos()
method formatrix
objects that turns a matrix with corpus
positions into a vector ofinteger
values now relies on a C-level
implementation newly included in the RcppCWB package, that is significantly
faster than the best possible implementation in R. - The table generated by
kwic()
shows row numbers, which is convenient
when referring to specific rows (#184). - The
as.cqp()
now checks whether argumentquery
meets the expectation that
it is a query (#191). - The method
make_region_matrix()
, which has been used internally only, has
been removed.RcppCWB::s_attr_regions()
replaces the functionality. - The
as.speeches()
method had not yet been implemented for nested corpora. A
limited rewrite makes this work now (#198). - Inconsistencies and unnecessary limitations of the
get_token_stream()
method
forpartition_bundle
objects have been addressed: Multiple p-attributes can be
used without providingphrases
at the same time (#142) and using thesubset
argument does not depend on usingphrases
either (#141). - The
as.sparseMatrix()
method is now also defined forDocumentTermMatrix
objects (was available previously ony forTermDocumentMatrix
objects). - If a vector of queries is named, theses named are now used consistently by the
hits()
method (#195). get_type()
forsubcorpus_bundle
returnsNULL
if no type is defined as a
matter of consistency (#169).- If an expression for subsetting a
corpus
/subcorpus
includes invalid
s-attributes, the warning is telling andNULL
is returend (#179). - The cooccurrences option...
Putty Knive
New Features
- A new
decode()
method fordata.table
objects shall serve as a more user-friendly access to the efficiency of theRcppCWB::cl_cpos2str()
function. - The
data.frame
returned when callingcorpus()
will now include a column with the encoding of the corpus.
Bug fixes
- The
warn
argument of theget_template()
-method remained unused, resulting in a warning message even ifwarn
wasFALSE
, resulting in a set of warning messages when callingcorpus()
. The argument is used as intended now and defaults toFALSE
. - The
as.markdown()
-method forsubcorpus
objects now uses an (internal) default template accessible viapolmineR:::default_template
, if no template is defined for a corpus. - The
registry_get_encoding()
function returned a length-one character vector if the regular expression to extract the charset corpus property did not yield a match. To prevent errors, it now returns "latin1" as the CWB standard encoding (#159).
Unicorn Dream
Minor Improvements
- The
knit_print()
-method fortextstat
objects does not accept the three dots argument any more. As an installation of pandoc is necessary to include resultinghtmlwidget
in an html document, the method will check now whether pandoc is available. If not, a formatteddata.table
is returned. - The
knit_print()
-method forkwic
objects does not have thepagelength
argument any more as it has been unused. The pagelength is controlled by the optionpolmineR.pagelength
. Internally, the method will call the method for thetextstat
superclass of thekwic
class, which is newly robust against a missing installation of pandoc. - Any Unicode characters that could be detected have been removed from the documentation to avoid warnings on the CRAN Solaris test machine (#156).
Bug Fixes
- The
chisquare()
method needs to increase the number of digits temporarily, but failed to revert to the original value as expected. One implication was, that rounding the values indata.table
objects would fail, and rounding in general yielded very strange results (#155). Fixed.
Caterpillar Mambo
New Features
- The
corpus
class has been put in a shape to become the default point of
departure of most workflows. All core methods are now available for the
corpus
class, and have been implemented newly if necessary, e.g.show()
andsize()
-method. The constructor method for acorpus
object, the
corpus()
method, will now check whether the character vector with the corpus
ID refers to an available corpus, whether all letters are upper case and
issue informative warnings and error messages. - The
s_attributes()
-method forcorpus
objects has been reworked: It will decode
binary files directly, without reliance on the corpus library functions, which is
significantly faster. - The
Corpus
reference class is now obsolete after the introduction of the
S4corpus
class. To maintain the functionality not covered otherwise,
new genericsget_info
andshow_info
have been introduced and defined
for thecorpus
class. - Methods available for the
subcorpus
class have been expanded so that this
class can supersede thepartition
class: Methods newly available are
cpos()
,count()
,p_attributes()
,s_attributes()
get_token_stream()
,
andsize()
. Technically, there is virtualslice
-class, from which
subcorpus
inherits (methods called viacallNextMethod()
). - A new
subset()
-method for thecorpus
andsubcorpus
classes to generate subcorpora
(i.e.subcorpus
objects) has been introduced. It outperforms the
partition()
method. Thesubset()
-method forcorpus
andsubcorpus
objects
will be the default way to work with non standard evaluation in a manner that
feels "R-ish" (#40). - The
zoom()
-method that has been introduced experimentally has
been dropped again in favor of thesubset()
-method to getsubcorpus
objects
fromcorpus
andsubcorpus
objects. A set of experimental methods for an
initial check of the feasibility of a non-standard evaluation approach to
the generation of subcorpora has been dropped (methods$
,==
,!=
,
zoom
forcorpus
-class). - To facilitate the transition from the
partition
class (inheriting from
thetextstat
class) to thesubcorpus
class (inheriting from thetextstat
class), there is a newcoerce()
-method to turn apartition
object into
asubcorpus
object. - A new
remote_corpus
-class is the basis for accessing remote
corpora. Aremote_subcorpus
can be derived from aremote_corpus
. Methods
available for remote corpora und subcorpora remain limited at this stage. - Consolidation of the class system: For all the S4 classes in the package, multiple
contains have been checked, and multiple contains have been removed. - The
subcorpus_bundle
class now inherits frompartition_bundle
. This is not
intended to be a long-term solution, but facilitates the implementation of new
workflows based on thesubcorpus
class rather than thepartition
class. - Calling the polmineR shiny app via
polmineR
did not have safeguards if
the suggested packages shiny and shinythemes were not installed. Now
there will be a conditional installation of the packages required for running
the shiny app. - The somewhat odd class
CorpusOrSubcorpus
has been removed. Thengrams
-method
now applies forcorpus
andsubcorpus
objects. - The pipe operator of the magrittr package is imported now, and magrittr has moved
from a suggested package to a required package. - The
label()
-method, present for a while, is superseded by aedit()
-method now.
It will call a shiny gadget either using DataTables or Handsontable. The former
Labels
reference class has been turned into a S4 class, because the
desired reference logic can also be achieved with adata.table
in a slot of
the labels class. - The
table
-slot of thekwic
class has been renamed asstat
slot (adata.table
),
so that thekwic
class can now inherit from thetextstat
class. The
enrich()
-method for objects of classkwic
now includes a new argument
extra
that will add extra tokens to the left of the windows for concordances so
that qualitative inspections for query hits can work with more context. - The
as.TermDocumentMatrix()
and theas.DocumentTermMatrix()
-methods are now
also defined forkwic
objects. They work exactly the same as for thecontext
class. To avoid having to write new methods, a newneighborhood
virtual class has
been introduced. The aforementioned methods are defined for the virtual class and
are available for context and kwic class objects. - Added CQP functionality to count tab in shiny app, and to the dispersion tab.
- There is now a basic implementation of
get_token_stream()
for apartition_bundle
object. - The
Cooccurrences()
-method is now available forsubcorpus
-objects (#88). - There is a new coerce method to turn a
kwic
-object into acontext
-object.
Theneighborhood
virtual class could be discarded again, and a bug could be removed
that left anenrich()
-operation forkwic
objects (argumentp_attribute
)
ineffectual (#103).
Minor changes
- Added a new argument
regex
to thecpos()
-method (forcorpus
objects), which
will interpret argumentquery
as a regular expression. This may be faster than
takingquery
as an outright CQP query. - The configure-script in the package that would adjust paths in the registry files
for the corpora included in the package for documentation and testing purposes has
been removed. Having switched to a temporary registry directory, it has lost
its function. - The version of the data.table package now required is 1.12.2, because previous
versions did not allow adding columns to a new data.table. - Implemented the possibility to use multiple queries in
dispersion
-method (#92). - To keep up with the renaming of functions and arguments in the package, "sAttributes"
and "pAttributes" in the polmineR shiny app have been renamed ("s_attributes",
and "p_attributes", respectively). - The shiny app module for kwic output will not show
p_attribute
andpositivelist
by default. - The
format()
-method is used to create proper output in the cooccurrences of the
shiny app. - User names that include non-ASCII characters were a persistent problem on Windows
machines (#66). The solution now is to check for non-ASCII characters in the path
to the data directory, and to use the "old" short DOS path if necessary. The worker is
a modifiedregistry()
-function. - The ordering of the table for
ll
-method had been somewhat mixed up, which is repaired
now. Tokens with NA values for the ll-test will show up at the end of the table. - The
registry_move()
-function, used only internally at this stage, is exported now
so that it can be used by other packages. - The return value of
the get_token_stream()
-method forregions
objects was a
data.table
. The behavior is now in line with the otherget_token_stream()
methods - The
tempcorpus()
-method and thetempcorpus
class have been removed from the package,
having become utterly deprecated. - The
summary()
-method forpartition
-class objects has been turned into a method
for thecount
-class, to eliminate an inconsistency. The example of a workflow has been
moved to the documentation object for thecount
-class. - The
browse()
-method has not proven to be useful and has been removed from the package.
A newbrowse()
-function is introduced to throw a warning, if browse should be
called nevertheless. - A refactoring of the
split()
-method forpartition
-objects improved the readability
of the code, but the performance gain is minimal. - A new
kwic_bundle
-class has been introduced, a list ofkwic
objects can be turned
into this new class usingas.bundle
. - The
context()
-method will now take again as input character vectors for the arguments
left
andright
to expand to the left and right boundaries of the designated
region (#87). - Rework of the way messages are printed to make it easy to implement notifications in
the shiny environment. - Default highlighting when a positivelist is supplied has been removed from the
kwic()
-method. This ensures that subsequent highlighting operations can assign
new colors (#38). - Implemented feature request for
dispersion()
that results are reported for all
values of structural attributes, including those with zero matches. (#104) - Performance improved for the
cpos
-method formatrix
which unfolds a matrix with regions
of corpus positions, useful for operations that require many calls. - The
count
-method forpartition_bundle
has been reworked and is much faster and more
memory efficient. as.TermDocumentMatrix()
forpartition_bundle
optimized to work efficiently
with large corpora.- Introduction of a context,matrix-method to have a unified auxiliary function
to create contexts. - The
as.corpusEnc()
-function uses thelocaleToCharset()
-function from the utils
package to determine the charset of input strings. On RStudio Server, we have seen
cases when the return value is NA. Then it will be assumed that the locale is UTF-8. - Functionality to highlight terms in kwic display has been restored for the shiny app.
Bug fixes
- Removed a bug in the
context()
/kwic()
method that led to superfluous words in the
right context. - Removed a bug that occurred with the
as.data.frame()
-method forkwic
-objects
when no metadata were added. - The
count()
-method forpartition_bundle
-objects did not performiconv()
if
necessary - this has been corrected. - Indexing the concord...
Bright Side
polmineR 0.7.11
NEW FEATURES
- A
Cooccurrences()
-method and aCooccurrences
-class have been migrated from the (experimental) polmineR.graph package to polmineR to generate and manage all cooccurrences in a corpus/partition
. Acooccurrenes()
-method produces a subset ofCooccurrences
-class object and is the basis for ensuring that results are identical. - New functionality to make using corpora more robust when paths include special characters: There is now a temporary data directory which is a subdirectory of the per-session temporary directory. A new function
data_dir()
will return this temporary data directory. Theuse()
-function will now check for non-ASCII characters in the path to binary corpus data and move the corpus data to the temporary data directory (a subdirectory of the directory returned bydata_dir()
), if necessary. An argumenttmp
added touse()
will force using a temporary directory. The temporary files are removed when the package is detached. - Experimental functionality for a non-standard evaluation approach to create subcorpora via a
zoom()
-method. See documentation for (new)corpus
-class (?"corpus-class"
) and extended documentation forpartition
-class (?"partition-class"
). A newcorpus()
-method for character vector serves as a constructor. This is a beginning of somewhat re-arranging the class structure: Theregions
-class now inherits from the newcorpus
-class, and a newsubcorpus
-class inherits from theregions
-class. - A new function
check_cqp_query()
offers a preliminary check whether a CQP query may be faulty. It is used by thecpos()
-method, if the new argumentcheck
is TRUE. All higher-level functions callingcpos()
also include this new argument. Faulty queries may still cause a crash of the R session, but the most common source is prevent now, hopefully. - A
format()
-method is defined fortextstat
,cooccurrences
, andfeatures
, moving the formatting of tables out of theview()
, andprint()
-methods. This will be useful when including tables in Rmarkdown documents.
MINOR IMPROVEMENTS
- Startup messages reporting the package version of polmineR and the registry path are omitted now.
- The functions
registry()
anddata_dir()
now accept an argumentpkg
. The functions will return the path to the registry directory / the data directory within a package, if the argument is used. - The
data.table
-package used to be imported entirely, now the package is imported selectively. To avoid namespace conflicts, the former S4 methodas.data.table()
is now a S3 method. Warnings appearing if thedata.table
package is loaded after polmineR are now omitted. - The
coerce()
-methodes to turntextstat
,cooccurrences
,features
andkwic
objects into htmlwidgets now set apageLength
. - New methods for
partition_bundle
objects:[[<-
,$
,$<-
- Rework of indexing
textstat
objects. - A slot
p_attribute
has been added to thekwic
-class;kwic()
-methods and methods to processkwic
-objects are now able to use the attribute thus indicated, and not just the p-attribute "word". - A new
size()
-method forcontext
-objects will return the size of the corpus of interest (coi) and the reference corpus (ref). - New
encoding()
-method for character vector. - New
name()
-method for character vector. - A new
count()
-method forcontext
-objects will return thedata.table
in thestat
-slot with the counts for the tokens in the window. - The
decode()
-function replaces adecode()
-method and can be applied to partitions. The return value is adata.table
which can be coerced to atibble
, serving as an interface to tidytext (#37). - The
ngrams()
-method will work for corpora, and a newshow()
-method fortextstat
-object generates a proper output (#27).
BUG FIXES
- Any usage of
tempdir()
is wrapped into normalizePath(..., winslash = "/"), to avoid mixture of file separators in a path, which may cause problems on Windows systems. - In the calculation of cooccurrences, the node has previously been included in the window size. This has been corrected.
- The
kwic()
-method for corpora returned one surplus token to the left and to the right of the query. The excess tokens are not removed. - The object returned by the
kwic()
-method forcharacter
-objects method did not include the correct position of matches in thecpos
slot. Corrected. - Bug removed that occurrs when context window reaches beyond beginning or end of a corpus (#48).
- When generating a
partition_bundle
using theas.speeches()
-method, an error could occur when an empty partition has been generated accidentaly. Has been removed. (#50) - The
as.VCorpus()
-method is not available if thetm
-package has been loaded previously. A coerce method (as(OBJECT, "VCorpus")) solves the issue. The
as.VCorpus()`-method is still around, but serves as a wrapper for the formal coerce-method (#55). - The argument
verbose
as used by theuse()
-method did not have any effect. Now, messages are not reported as would be expected, ifverbose
isFALSE
. On this occasion, we took care that corpora that are activated are now reported in capital letters, which is consistent with the uppercase logic you need to follow when using corpora. (#47) - A new check prevents an error that has occurred when a token queried by the
context()
-method would occurr at the very beginning or very end of a corpus and the window would transgress the beginning / end of the corpus without being checked (#44). - The
as.speeches()
-function caused an error when the type of the partition was not defined. Solved (#57). - To deal with issues resulting from an unset locale, there is a check during startup whether the locale is unset (i.e. 'C') (#39).
- There was a difficulty to generate a
TermDocumentMatrix
from apartition_bundle
if the partitions in thepartition_bundle
were not named. The fix is to assign integer numbers as names to the partitions (#58).
DOCUMENTATION FIXES
- Substantial rework of the documentation of the
ll()
, andchisquare()
-methods to make the statistical procedure used transparent. - Expanded documentation for
cooccurrences()
-method to explain subsetting results vs applying positivelist/negativelist (#28). - Wrote some documentation for the
round()
-method fortextstat
-objects that will show up in documentation oftextstat
class. - Improved documentation of the
mail()
-method (#31). - In the examples for the
decode()
-function, using the REUTERS corpus replaces the usage
of the GERMAPARLMINI corpus, to reduce time consumed when checking the package.
Bachelor's Delight
polmineR 0.7.10
NEW FEATURES
- The package now offers a simplified and seamless workflow for dictionary-based sentiment analysis: The
weigh()
-method has been implemented for the classescount
andcount_bundle
. Via inheritance, it will also be available for thepartition
- andpartition_bundle
-classes. Then, a newsummary()
-method forpartition
-class objects is introduced. If the object has been weighed, the list that is returned will include a report on weights. There is an example that explains the workflow. - The
partition_bundle
-method forcontext
-objects has been reworked entirely (and is working again);
a newpartition
-method forcontext
-objects has been introduced. Buth steps are intended for workflows for dictionary-based sentiment analysis. - The
highlight()
-method is now implemented for classkwic
. You can highlight words in the neighborhood of a node that are part of a dictionaty. - A new
knit_print()
-method fortextstat
- andkwic
-objects offers a seamless inclusion of analyses in Rmarkdown documents. - A
coerce()
-method to turn akwic
-object into a htmlwidget has been singled out from theshow()
-method forkwic
-objects. Now it is possible to generate a htmlwidget from a kwic object, and to include the widget into a Rmarkdown document. - A new
coerce()
-method to turntextstat
-objects into an htmlwidget (DataTable), very useful for Rmarkdown documents such as slides. - A new argument height for the
html()
-method will allow to define a scroll box. Useful to embed a fulltext output to a Rmarkdown document.
MINOR IMPROVEMENTS
- The
partition_bundle
-class, rather than inheriting frombundle
-class directly, will now inherit from thecount_bundle
-class - The
use()
-function is limited now to activating the corpus in data packages. Having introduced the session registry, switching registry directories is not needed any more. - The
as.regions()
-function has been turned into aas.regions()
-method to have a more generic tool. - Some refactoring of the
context
-method, so that full use ofdata.table
speeds up things. - The
highlight()
-method allows definitions of terms to be highlighted to be passed in via three dots (...);
no explicit list necessary. - A new
as.character()
-method for kwic-class objects is introduced.
BUG FIXES
- The
size_coi
-slot (coi for corpus of interest) of thecontext
-object included the node; the node (i.e. matches for queries) is excluded now from the count of size_coi. - When calling
use()
, the registry directory is reset for CQP, so that the corpora in the package that have been activated can be used with CQP syntax. - The script configure.win has been removed so that installation works on Windows without an installation of Rtools.
- Bug removed from
s_attributes()
-method forpartition
-objects: "fast track" was activated without preconditions. - Bug removed that would swallow metadata/s-attributes to be displayed in
kwic
-output after highlighting. - As a matter of consistency, the argument
meta
has been renamed tos_attributes
for thekwic()
-method forcontext
-objects, and for theenrich()
-method forkwic
-objects. - To avoid confusion (with argument s_attributes), the argument
s_attribute
to check for integrity within
a struc has been renamed intoboundary
.
DOCUMENTATION FIXES
- Documentation for
kwic
-objects has been reworked thoroughly.
Jeanne d'Arc
The most visible change of polmineR v0.7.9 may be that the packages moves to a snake_case coding style. This is increasingly the state-of-the-art, and feels much more intuitive when working with the arguments 's_attributes' and 'p_attributes' (rather than pAttributes, and sAttributes). Functions/methods are fully backwards compatible, so old code should not break.
The package now uses a session registry directory, which is a subdirectory of the temporary session directory. This has become mandatory, because CRAN policies do not allow to reset paths within a package, once it has been installed. But it is very useful, because now, switching registry directories can be avoided. The use()
-function will now add the corpora in a R data package to the session registry. So this is a good start to work with multiple corpora wrapped in various packages. This involves a set of new functions:
- A (new)
registry_move()
-function is used to copy files to the tmp registry; - The (new)
registry()
-function will get the temporary registry directory;
A set of changes makes working with bundle
objects more versatile and robust:
- There is a new
as.list()
-method for bundle objects, to access the list in the slotobjects
; as.bundle()
is more generic now, so that any kind of object can be coerced to a bundle now;- The
as.speeches()
-method turned into function that allows a partition or a corpus as input;
The new version upgrades the count
-class. So the count()
-method will serve as a constructor for a count object, if no query is provided. This is particularly useful when working with count_bundle
-objects.
Minor new features
- There is a new
is.partition()
-function (a logical check); - A new argument 'type' has been added to
partition_bundle()
-method; - A new method
get_type()
introduced to make getting corpus type more robust. - A new
partition_bundle()
-method forpartition_bundle
-objects has been introduced;
Bug fixes
s_attributes()
for partition-objects in line with RcppCWB requirements (no negative values of strucs);count()
repaired for muliple p-attributes;- bug removed causing a crash for
as.markdown()
-method when cutoff is larger than number of tokens; - a bug removed that has prevented the
name<-
method to work properly for bundle objects - for
count()
forpartition_bundle
-objects, the column 'partition' will be a character vector now (not factor) - bug removed that has caused a crash when cutoff is larger than number of tokens in a partition when calling get_token_stream
Enjoy!
Panda Belly
- upon loading the package, new check that data directories are set correctly in registry files to make sure that sample data in pre-compiled packages can be used
- startup messages adjusted slightly
- first version that works with sample data without complications
v0.7.5
- class 'Regions' renamed to class 'regions' as a matter of consistency
- data type of slot cpos of class 'regions' is a matrix now
- rework and improved documentation for decode- and encode-methods
- new functions copy.corpus and rename.corpus
- as.DocumentTermMatrix-method checks for strucs with value -1
- improved as.speeches-method: reordering of speeches, default values
- blapply-method: verbose output will be suppressed of progress is TRUE