Skip to content

Commit

Permalink
Merge pull request #8 from Bridgeconn/dev
Browse files Browse the repository at this point in the history
Before Beta Release
  • Loading branch information
joelthe1 authored Jan 4, 2019
2 parents c086da3 + d8c809e commit 0660d64
Show file tree
Hide file tree
Showing 11 changed files with 642 additions and 389 deletions.
69 changes: 69 additions & 0 deletions Disclaimer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Disclaimer for usfm-grammar Beta-Release 0.1.0

## Document Structure

We have refered the USFM 3.0 specifications along with the USX documenations to arrive at a stucture definition for the langauge.
The USFM document structure is validated by the grammar. These are the basic document level criteria we check for

* The document starts with an id marker
* The id and usfm marker which follows it, if present, constitutes the *identification* section
* Next section is *Book headers*. The following tags may come within the section;
> * ide
> * sts
> * h
> * toc
> * toca
> * mt
> * mte
> * esb
* This is to be followed by an Introduction section which can contain
> * ib
> * ie
> * iex
> * ili
> * im
> * imi
> * imq
> * imt
> * imte
> * io
> * iot
> * ipi
> * ipq
> * ipr
> * ip
> * iq
> * is
> * rem
> * esb
* Following the above 3 metadata sections, there will be multiple chapters marked by c
* Within Chapter,at its starting, we may have a set of metacontents
> * cl(may also come immediately above the first chapter(c))
> * ca
> * cp
> * cd
* After the chapter metacontents, there comes the actual scripture plus some additional meta-Scripture contents(like sections, footnotes). The Following sections list the possiblities in the chapters content
> * v, va, vp
> * s, ms, mr, sr, r, d, sd
> * po, m, pr, cls, pmo, pm, pmc, pmr, pmi, nb, pc, b, pb, qr, qc, qd, lh, lf, p, pi, ph, q, qm, lim (treated as empty markers, and content treated along with v)
> * footnotes
> * cross references
> * fig
> * table, tr, th, thr, tc, tcr
> * li
> * lit
> * character markers: add, bk, dc, k, nd, ord, pn, png, addpn, qt, sig, sls, tl, wj, em, bd, it, bdit, no sc, sup, ndx, wg, wh, wa, qs, qac, litl, lik, rq, ior, cat, rb, w, jmp, liv
> * namespaces: z*
> * milestones: qt-s, qt-e, ts-s, ts-e
## Some Design Limitations

* We have not considered USFM files with peripherals (<https://ubsicap.github.io/usfm/peripherals/index.html>)
* We are not validating/parsing the internal contents of footnotes, cross-references and milestones. But the markers are being identified and contents extracted, without checking for their correctness
* The markers are treated as either mandatory or optional. The valid number of occurances is not considered
eg: _\\usfm_ should ideally occur only once, if present, and similarly _\\sts_ can come multtple times. As per the current implemetation, the optional markers can occur any number of times.
* We have assumed certain structural constraints in USFM, which were not explicitly mentioned in the USFM spec. For example, the markers _\\ca_, _\\cl_, _\\cp_ and _\\cd_ occurs immediately below the _\\c_ marker, before the verse blocks start.
* Documentation says, _\\imt1, \\imt2, \\imt3_(similarly _imte, ili, ie, iq, mt_) are all parts of a major title. So we are combining them ignoring the numerical weightage factor/difference.
* As per USFM spec, there is no limit for possible numbers(not limited to 1,2,and 3) in numbered markers...though the USX _valid style types_ lists them as specifically numbered(1 & 2 or 1,2 & 3). We are following _no limit_ rules.(except for _\\toc & \\toca_)
* The valid attribute names for word-level markers are not checked. Any attribute name with valid syntax would be accepted
* The paragraph markers(showing indentation) that appear within verses, should ideally be attached to the text that follows it. But we are attaching it to the verse marker immediatedly above it.
43 changes: 11 additions & 32 deletions Questions.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,16 @@
## Questions/ Dev Notes
## --------------------

* Is an empty line wihin a USFM file valid? Ideally not an error, but worth raising a warning(possibility of warnings, to be checked). The doc says, _"All paragraph markers should be preceded by *a single* newline."_ That makes an empty line an error, though we are not treating it so.
* Inline markers like _\\x_ , _\\f_ etc can start without a space seperating it from text content
* Inline markers(character markers) may also occur on new lines
* A problem not handled: The markers are treated as either mandatory or optional. The valid number of occurances is not considered
eg: _\\usfm_ should ideally occur only once, if present, and similarly _\\sts_ can come multtple times. Now the optional markers can occur any number of times
* why are there common markers(that can occur in any three) for these sections and why are they divided as 3 in USX, as _bookTitles_, _bookIntroductionTiles_, and _bookIntroductionEndTitles_ ?
* For the same marker(eg: _\\imt_) being in _bookTitles_, _bookIntroductionTiles_, and _bookIntroductionEndTitles_ requires 3 different rules as allowed child elements(in-line markers) for these sections are different. We have only one rule defining it with the larger child elements set(_bookIntroductionTilesTextContent_).
* added _\\mt_ along with bookHeaders. (it actually includes all markers under the identification section in USFM doc, except _\\id_ and _\\usfm_)
* added _toca#_ elements also to _book headers_, though they were not listed in the USX document structure's valid style types for the section
* The peripheral in USX seems separate from the scripture part. Hence avoiding it in Grammar, for now.
* There are two overlapping structures for bible content, in USFM.1) the paragraph structures used to express the discourse / narrative of the text and 2) the division of the text into books, chapters and verses. We are following only the following structure in the parsed JSON output: Chapter as parent, and verses as children. Hence ingoring the paragraph wise structuring and treating para markers as only meant for indentation change.
* In chapter element it lists _\\imt1_ as a valid style type as the first element. All other _imt_ markers(_\\imt, \\imt2, \\imt3_) are missing. The list of vaild style type says its alphabetical list explicitly... So assuming that _\\imt1_ got there by mistake and hence avoiding that from Grammar
* Assuming that the markers _\\ca_, _\\cl_, _\\cp_ and _\\cd_ occurs immediately below the _\\c_ marker, before the verse blocks start
* Documentation says, _\\imt1, \\imt2, \\imt3_(similarly _imte, ili, ie, iq, mt_) are all parts of a major title. So we are combining them ignoring the numerical weightage factor/difference. _\\ms#, \\is#_ have not been combined so.
* Do not understand the doc explaning change for _\\h_ as USFM3.0 comes
* As per doc, there is no limit for possible numbers(not limited to 1,2,and 3) in numbered markers...though the USX _valid style types_ lists them as specifically numbered(1 & 2 or 1,2 & 3). We are following _no limit_ rules.(except for _\\toc & \\toca_)
## Questions
## ---------

* In USX we see 3 sections in the introduction part, _bookTitles_, _bookIntroductionTiles_, and _bookIntroductionEndTitles_. What are their relevance when USFM is considered?


* As per USFM doc examples, _\\iex_ and _\\imte_ occurs within/at the end of chapter content...But included in _bookIntroductionTitles_ in the Grammar(as per the list of valid style types in USX doc).
* The _\\iot, \\io# & \\ior_ elements could be clubbed into an outline division and their relative ordering ensured...But not done(now all those can come anywhere in the _bookIntroductionTitles_)
* _\\ms#_ defines a major section outside of section(_\\s#_) division. But we have not captured it structural relevance. Instead, treating it as an independant element, and attaching it to section header of the section immeditately following it

* Use of the markers _\\wg, \\wh, \\wa_ is not clear from documentation. Assuming that it encloses verse's content words and not add additional contents to the verse text.
* Removed _verseElement_ from chapterContentTextContent(though the USX doc defines it so), inorder to avoid un-necessary nesting of verse elements
* Internal structure of crossref markers and footnote markers are not validated/parsed, as of now. Considers everything from open marker to close marker as a single unit and verifies that whatever marker occured in there is a permitted one there( its content or syntax not checked/parsed)
* The valid attribute names for word-level markers are not checked. Any attribute name with valid syntax would be accepted
* The _optbreak_ break in USX doc seems to have not been implemented as such in USFM. So not including that in Grammar(do _\\pb, \\b, ~ and \\\\ etc_ serves its purpose in USFM?)
* There seems to be not marker in place of USX _<ref>_ element in USFM. So removing that also from the Grammar. The reference text would be treated as normal text content itself.(looks like the character marker, _\\rq..\\rq*_, is a substitute)

* ms mentioned as valid child elements in USX spec, refers to milestones(_\\qt and \\ts_) in USFM rather than the _\\ms#_ element
* In USX we see _opt_break_. How are they/Are they implemented in USFM? (do _\\pb, \\b, ~ and \\\\ etc_ serves its purpose in USFM?)


* where does _\\mte_ occur in USFM files? Doc has a mention of _at the end of the introduction_ ... USX doc indicates its within chapter content... so going with that.
* Where does _\\mte_ occur in USFM files? The USFM spec has a mention of _at the end of the introduction_ ... USX doc indicates its within chapter content... (we have assumed its valid within the chapter content).

* the USX doc says within a chapter we can have _\\ip_ element. Hence added that to _metaScripture_
* took away the rule that says, there should be a paraElement at the start of chapter
* took away sections headings from the main JSON structure. Including them only as a metaScripture content. JSON follows Book-Chapter-verse structure now.
* the paragraph markers(showing indentation) that appear within verses should ideally be attached to the text that follows it. But we are attaching it to the verse marker immediatedly above it.
* Is there a rule, that there should be a _\\p_, or similar marker that shows indentation, at the start of the chapter(or the start of the the first chapter)?
50 changes: 34 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,41 @@
# USFM Parser
# USFM Grammar

A library that validates USFM files.
Uses [ohm-js](https://github.com/harc/ohm) for grammar implementation and validation.
This is a simple usfm parser/validator that uses a grammar to model the usfm syntax. The grammar is written in ohm-js(<https://ohmlang.github.io/).> The USFM3.0 syntax is supported. The parser outputs the USFM content in a json structure which gives importance to the easy extraction of scripture content from the mark-ups and additional usfm contents.
Implemented in Node.js

# Current implementation
1. Parse
2. Validate
(Only validates the internal structure of a set of markers and extracts their components as JSON.)
## To Setup

# Dependancies
Node server
The project is available as an npm library, which can be installed with the following command.

Node modules
`http, fs, formidable, ohm-js, path`
`npm install usfm-grammar`

# Install and Run
From the project directory, start the server, as
`node server.js`
## Usage

To use it from your node application:

```
var grammar = require('usfm-grammar)
var jsonOutput = garmmar.parse(/**The USFM Text to be converted to JSON**/)
var jsonCleanOutput = grammar.parse(/**The USFM Text to be converted to JSON**/,grammar.SCRIPTURE)
var usfmValidity = grammar.validate(/**USFM Text to be checked**/)
```

The `grammar.parse()` method returns a json structure for the USFM text contents, if it is a valid usfm file.
The `grammar.parse()` method can take an optional second argument, `grammar.SCRIPTURE`. If this is used, the returned json will contain only the most relevant scripture content, excluding all additional USFM contents
The `grammar.validate()` method returns a true/false, depending on whether the input usfm text's syntax is valid or not.

## To Use as a Local Node server

The project could also be installed locally for testing. For that there is a server setup provided.

### Install and Run: Local Node Server

Clone the git repo
`git clone https://github.com/Bridgeconn/usfm-grammar.git`

From the project directory, start the server, as
`node server.js` or `npm start`

and from browser, access
<http://localhost:8080/index.html>

from browser, access
http://localhost:8080/index.html
Loading

0 comments on commit 0660d64

Please sign in to comment.