Skip to content

Commit

Permalink
Merge pull request #69 from Bridgeconn/dev
Browse files Browse the repository at this point in the history
Get ready for release
  • Loading branch information
joelthe1 authored Jun 11, 2020
2 parents 4502166 + e9a8803 commit 1b2f4eb
Show file tree
Hide file tree
Showing 13 changed files with 239 additions and 106 deletions.
124 changes: 80 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@
# USFM Grammar

An elegant [USFM](https://github.com/ubsicap/usfm) parser (or validator) that uses a [parsing expression grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar) to model USFM. The grammar is written using [ohm](https://ohmlang.github.io/). **Only USFM 3.x is supported**.
An elegant [USFM](https://github.com/ubsicap/usfm) parser (or validator) that uses a [parsing expression grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar) to model USFM. The grammar is written using [ohm](https://ohmlang.github.io/). **USFM 3.x is supported**.

The parsed USFM is an intuitive and easy to manipulate JSON structure that allows for painless extraction of scripture and other content from the markup. USFM Grammar is also capable of reconverting the generated JSON back to USFM.

## Online Demo!
## Features
- USFM validation
- USFM to JSON convertor with 2 different levels of strictness
- JSON to USFM convertor
- CSV/TSV converter for both USFM and JSON
- Command Line Interface (CLI)

Try out the usfm-grammar based convertor online: https://usfm.vachanengine.org

### Try it out

Try out the `usfm-grammar` based online convertor: https://usfm.vachanengine.org

## Example

Expand Down Expand Up @@ -84,7 +92,7 @@ Try out the usfm-grammar based convertor online: https://usfm.vachanengine.org
}
],
"_messages": {
"warnings": [ "Book code is in lowercase. " ]
"_warnings": [ "Book code is in lowercase. " ]
}
}
```
Expand All @@ -97,67 +105,95 @@ The parser is [available on NPM](https://www.npmjs.com/package/usfm-grammar) and

## Usage

### Command Line Interface (CLI)

To use usfm-grammar from the command line install it globally like:

`npm install -g usfm-grammar`

Then from the command line (terminal) to convert a valid USFM file into JSON (on `stdout`) run:

`> usfm-grammar <file-path>`

```
> usfm-grammar -h
--version Show version number [boolean]
-l, --level specify the level of strictness in parsing [choices: "relaxed"]
--filter filters out only the specific contents from input USFM
[choices: "scripture"]
--format specifies the output file format
[choices: "csv", "tsv", "usfm", "json"]
-o, --output specify the fully qualified file path for output.
-h, --help Show help
```
The options `-l` (`--level`) and `--filter` do not have any effect if used with JSON to USFM conversion.

### JavaScript APIs
#### USFM to JSON
1) `USFMParser.toJSON()`
2) `USFMParser.toJSON(grammar.FILTER.SCRIPTURE)`

```
const grammar = require('usfm-grammar');
let input = '/*****input USFM string*****/';
var input = '\\id PSA\n\\c 1\n\\p\n\\v 1 Blessed is the one who does not walk in step with the wicked or stand in the way that sinners take or sit in the company of mockers,';
const myUsfmParser = new grammar.USFMParser(input);
var jsonOutput = myUsfmParser.toJSON();
var cleanJsonOutput = myUsfmParser.toJSON(grammar.FILTER.SCRIPTURE);
```
The `USFMParser.toJSON()` method returns a JSON structure for the input USFM string, if it is a valid usfm file.
The `USFMParser.toJSON()` method can take an optional second argument, `grammar.FILTER.SCRIPTURE`. In which case, the output JSON will contain only the most relevant scripture content, excluding all other USFM content.
If you intent to create a usfm from the data after processing it, we recommend using this method without the `SCRIPTURE` flag as this would loose information of other markers.

```
const myUsfmParser = new grammar.USFMParser(input, grammar.LEVEL.RELAXED);
// Returns JSON representation of a valid input USFM string
var jsonOutput = myUsfmParser.toJSON();
// Returns a simplified (scripture-only) JSON representation while excluding other USFM content
var scriptureJsonOutput = myUsfmParser.toJSON(grammar.FILTER.SCRIPTURE);
```
This relaxed mode provides relaxation of sereval rules in the USFM spec and give you a JSON output for a file that can be considered a workable USFM file.
> *NOTE:* If you intend to re-convert a USFM from the generated JSON, we recommend using `.toJSON()` without the `grammar.FILTER.SCRIPTURE` option in order to retain all information of the original USFM file.
**Relaxed Mode**
There is high chance that a USFM file you encounter in the wild is _not_ fully valid according to the specifications. In order to accomodate such cases and provide a [parse-able](https://github.com/Bridgeconn/usfm-grammar/issues/53#issuecomment-614170275) output to work with we created a **Relaxed** mode. This maybe used as shown:
```
var usfmValidity = myUsfmParser.validate();
const myRelaxedUsfmParser = new grammar.USFMParser(input, grammar.LEVEL.RELAXED);
var jsonOutput = myRelaxedUsfmParser.toJSON();
```
The `USFMParser.validate()` method returns a Boolean depending on whether the input USFM text syntax satisfies the grammar or not.

#### USFM to CSV/TSV
This mode provides relaxation from checking several rules in the USFM specifcation. It tries hard to accomodate non-standard USFM markup and attempts to generate a JSON output for it. Only the most important markers are checked for, like the `\id` at the start, presence of `\c` and `\v` markers. Though all the markers in the input USFM file are preserved in the generated JSON output, their syntax or their positions in the file is not verified for correctness. Even misspelled markers would be accepted.

> _Caution:_
> Errors may go unnoticed that might lead to loss of information. For example, if the file has mistakenly not given a space between verse marker and verse number, and has `\v3` the parser in `relaxed` mode would accept it as a separate marker (`v3`) and fail to recognise it is a verse. The right (or the hard) thing to do is fix the markup according to the specification. We generally recommend using the grammar in the normal (strict) mode.
#### Validate
3) `USFMParser.validate()`

```
var csvString = myUsfmParser.toCSV();
var tsvString = myUsfmParser.toTSV();
// Returns a Boolean indicating whether the input USFM text satisfies the grammar or not.
// This method is available in both normal (strict) and Relaxed modes.
var isUsfmValid = myUsfmParser.validate();
```
The `toCSV()` and `toTSV()` methods give a tabular representation of bible verses in the <BOOK, CHAPTER, VERSE-NUMBER, VERSE-TEXT> format. These methods are available for the `USFMParser` class as well as `JSONParser` class.

#### JSON back to USFM
#### JSON to USFM
> *Note:*
> - The input JSON should have been generated by `usfm-grammar` (or in the same format).
> - If a USFM file is converted to JSON and then back to USFM, the re-created USFM will have the same contents but _spacing and new-lines will be normalized_.
4) `JSONParser.toUSFM()`
```
const myJsonParser = new grammar.JSONParser(jsonOutput);
// Returns the original USFM that was previously converted to JSON
let reCreatedUsfm = myJsonParser.toUSFM();
```
The `JSONParser` class can be initiated with a JSON object in the same format as the output of `USFMParser.toJSON()` method with or without the `FILTER.SCRIPTURE` option. If a USFM file is converted to JSON and then back to USFM, the re-created USFM will have same contents but spacing and new-lines will be normalized.

#### CLI

To use usfm-grammar as a command-line-interface, it should be installed globally

`npm install -g usfm-grammar`

Then it can be invoked by typing
This method works JSON output created with or without the `grammar.FILTER.SCRIPTURE` option.

`usfm-grammar <file-path>`
#### USFM/JSON to CSV/TSV
5) `USFMParser.toCSV()`
6) `JSONParser.toCSV()`

from the terminal(command-line). This command lets you convert the input USFM file into JSON object, if the file is a valid USFM.

The optional flags that can be used are
7) `USFMParser.toTSV()`
8) `JSONParser.toTSV()`
```
--version Show version number [boolean]
-l, --level specify the level of strictness in parsing [choices: "relaxed"]
--filter filters out only the specific contents from input USFM
[choices: "scripture"]
--format specifies the output file format
[choices: "csv", "tsv", "usfm", "json"]
-o, --output specify the fully qualified file path for output.
-h, --help Show help
var csvString = myUsfmParser.toCSV();
var tsvString = myUsfmParser.toTSV();
```
The options -l (--level) and --filter doesnot have any effect if used with JSON to USFM convertion.
The `toCSV()` and `toTSV()` methods give a tabular representation of bible verses in the
`<BOOK, CHAPTER, VERSE-NUMBER, VERSE-TEXT>` format.

> *NOTE:* For [disclaimer](https://github.com/Bridgeconn/usfm-grammar/blob/master/docs/Disclaimer.md), [release notes](https://github.com/Bridgeconn/usfm-grammar/blob/master/docs/changelog.md) etc refer the [docs](https://github.com/Bridgeconn/usfm-grammar/blob/master/docs) section.
3 changes: 2 additions & 1 deletion cli.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ const { argv } = require('yargs')
.alias('o', 'output')
.describe('o', 'specify the fully qualified file path for output.')
.alias('h', 'help')
.alias('v', 'version')
.help('help');
const grammar = require('./js/main.js');

Expand All @@ -39,7 +40,7 @@ try {
} catch (e) {
isJson = false;
}
if (argv.format === 'usfm' || isJson) {
if (argv.format === 'usfm' || isJson) {
const myJsonParser = new grammar.JSONParser(jsonInput);
try {
output = myJsonParser.toUSFM(inputFile);
Expand Down
72 changes: 61 additions & 11 deletions docs/Disclaimer.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,64 @@
# Disclaimer for usfm-grammar Beta-Release 0.1.0
# Disclaimer for usfm-grammar 2.0.0

- Only USFM 3.x is supported by the normal mode parsing. Most of the older versions may still work with the --LEVEL.RELAXED flag, but we haven't tested if all possible syntaxes from the old spec is supported or not

- No support for peripheral

- Paragraph markers

In scripture texts encoded using USFM (and similarly also in USX), the paragraph level
markup forms the main structure of the document, while chapter and verse markers are an
overlapping structure. But the USFM grammar views book-chapter-verse as the primary
structure and considers the paragraph markers as additional overlapping elements and does
not consider them as enclosing scripture contents. A null value will be provided in the
JSON output for paragraph markers and their text content if any would be considered part
of the enclosing element(eg: \v or \ip)

- Numbered markers

For all numbered markers we expect numbers upto 3(upto 5 for th, thr, tc and tcr which
indicate cells/columns of a table). If a number more than that is given, all features
might not be present in the result.(eg. contents being combined in verseText)

- Footnotes and cross-refs

All kinds of footnotes \f, \fe and \ef are mentioned as 'footnote' in JSON. but the
closing marker gives idea on which is which. Same is applicable for cross-references
(\x, \xt, \xe) also. This is true for the toJSON() output from normal mode parsing. If
the --LEVEL.RELAXED flag is set, the key footnote or cross-ref will not be present in
output, instead they will be given with corresponding marker itself as key.

- Marker closing in LEVEL.RELAXED parsing

With several rules relaxed in this grammar, we are not validating if a marker is closed
properly with its own closing marker. So which ever closing is encountered after it,
would be treated as a valid closing for any marker. This would be evident in the
formation of JSON object for footnotes and cross-refs

- Attributes in LEVEL.RELAXED parsing

Any text following a pipe(|) symbol is treated as attributes in this grammar. So the
key-value structure(as done in normal mode) will not be parsed and the JSON will not be a
structured one if this --LEVEL.RELAXED flag is set. This is done especially to accomodate
older USFM attribute syntaxes

- Combining multiple markers in JSON output

Other than creating a nested structure for chapters, its contents, verses and their
contents(which is done in both normal and RELAXED modes), we combine some markers
together as an array or a named object preserving their order, in the normal
mode( without the --LEVEL.RELAXED flag) JSON output.
Section header(\s, \ms) and their associated markers(\r, \mr, \ip) forms an array.
Also markers \mt, \mt1, \mt2 etc, \io, \io1, \io2 etc are combined into an array,
when coming consecutively in USFM file. Footnotes, cross-refs, lists, tables etc which
when formed with mulitple markers are combined together in an object structure named
correspondingly.

- Array values

Some markers will have an array(with its text content) as its value in JSON,
instead of plain text(string value). This is designed so, in order to accommodate nesting
of other markers within one markers contents

## Document Structure

Expand Down Expand Up @@ -56,16 +116,6 @@ The USFM document structure is validated by the grammar. These are the basic doc
> * namespaces: z*
> * milestones: qt-s, qt-e, ts-s, ts-e
## Some Design Limitations

* We have not considered USFM files with peripherals (<https://ubsicap.github.io/usfm/peripherals/index.html>)
* We are not validating/parsing the internal contents of markers or values provided for attributes. For example, verse numbers need not be continuous, column numbers in a table row need not be in accordance with other rows, or the format of reference need not be correct in an _\\ior_ marker to pass our validation. But the markers are being identified, their syntax verified and contents extracted.
* The markers are treated as either mandatory or optional. The valid number of occurances is not considered
eg: _\\usfm_ should ideally occur only once, if present, and similarly _\\sts_ can come multiple times. As per the current implemetation, the optional markers can occur any number of times.
* We have assumed certain structural constraints in USFM, which were not explicitly mentioned in the USFM spec. For example, the markers _\\ca_, _\\cl_, _\\cp_ and _\\cd_ occurs immediately below the _\\c_ marker, before the verse blocks start.
* Documentation says, _\\imt1, \\imt2, \\imt3_(similarly _imte, ili, ie, iq, mt_) are all parts of a major title. So we are combining them ignoring the numerical weightage factor/difference, in the output JSON.
* As per USFM spec, there is no limit for possible numbers(not limited to 1,2,and 3) in numbered markers...though the USX _valid style types_ lists them as specifically numbered(1 & 2 or 1,2 & 3). We are following _no limit_ rules.(except for _\\toc & \\toca_)
* We are checking for only the BCV structue in a document. Hence all markers like _\\p_, _\\q_, _\\nb_ etc that specifies an indentation, is considered to serve only the purpose of showing indentation and are treated like empty markers. We are not parsing the text contents according to these markers. The text is assumed to belong only to the _\\v_ marker of _\\ip_ marker above it.

## Rules made liberal, to accomodate real world sample files

Expand Down
39 changes: 0 additions & 39 deletions docs/Questions.md

This file was deleted.

29 changes: 29 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,34 @@
# Change log for usfm-grammar

## Version 1.1.0-beta.1 to 2.0.0

### New Features

- [Relax mode parser](https://github.com/Bridgeconn/usfm-grammar/issues/52) which can accomodate a USFM file that might not be fully valid according to specification but [parse-able](https://github.com/Bridgeconn/usfm-grammar/issues/53#issuecomment-614170275)
- Enable [TSV/CSV export](https://github.com/Bridgeconn/usfm-grammar/issues/29) of the USFM scripture content
- [Reverse conversion](https://github.com/Bridgeconn/usfm-grammar/issues/25) from JSON to USFM
- [CLI](https://github.com/Bridgeconn/usfm-grammar/issues/62): Enable the use of usfm-grammar library from command line also.

### Major changes

- Updation of JSON output
1. components that differ in JSON when using normal and relaxed parsing:
footnote, lists, table, cross-ref, \mt#, \io#, section headings,
milestone, attributes)
2. property names/keys introduced to the JSON structure, other than usfm maker names
- book, book.bookcode, book.details, book.meta, chapters, contents,
verseNumber, verseText, footnote, cross-ref, table, list, milestone, attributes,
closing
- API changes
usfm-grammar implementation is now class based.
new names for methods in previous version are as follows
- `parserUSFM()` becomes `USFMParser.toJSON()`
- `validate()` becomes `USFMParser.validate()`


added a new class(`JSONParser`), new methods and parameters as per new features. Refer [README](https://github.com/Bridgeconn/usfm-grammar#usage) for the usage.


## Version 1.0.0 to 1.1.0-beta.1

The main new feature is that, there is a new reverse conversion, that can take a JSON object in the usfm-grammar format and generate a USFM file out of it.
Expand Down
2 changes: 1 addition & 1 deletion docs/comparison.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Comparison of usfm-grammar and usfm-js Libraries
# Comparison of usfm-grammar(version 1.0.0) and usfm-js Libraries

## The Basic USFM Components

Expand Down
Loading

0 comments on commit 1b2f4eb

Please sign in to comment.