Skip to content

gdocs2md debugging: ital. and bold links mangled, * off by 1

Louis Maddox edited this page Mar 10, 2015 · 40 revisions

Other people's improvements on the original code

(Long story short: new repo - gdocs2Rmd - forked off of clearf's fork, the only one in the network graph that had fixed the off-by-one error in bold/italics processing

Round 1

This means anyone can install it (accessible across all docs rather than using the script editor each time). Getting a doc plugin (in the long term) accepted as 'global'/both rather than solely publishing 'domain-restricted' takes longer (the script being subject to Google's scrutiny), but is available to anyone through the script plugin search feature.

It sounds like jumping through a few hoops, but surely less so than for publishing a Chrome Extension.
* Defaults to null output folder 😑 Eventually found out... it's off by 1 👎 * [bquast's fork](https://github.com/bquast/gdocs2md/) notes the README says [script manager rather than editor](https://github.com/bquast/gdocs2md/commit/ec5bdd3978ce0a336ce767f1eb7fb74cbdc446df) 1 off 👎 * [clearf's fork](https://github.com/clearf/gdocs2md) has ['big changes to conversions'](https://github.com/clearf/gdocs2md/commit/a176ca9a534963a29e92257b1bc7d27f73314aa4), 'adding files to the google directory structure' - creating a `Markdown` folder if not existing in the source document's diretory, images in a subfolder of the `markdown` folder (`document.getName()+"_images"`) * [Bingo](https://github.com/clearf/gdocs2md/commit/a176ca9a534963a29e92257b1bc7d27f73314aa4#diff-22699f26512f5a42cffea5f777e37110R406): `reformatted_txt.setLinkUrl(off, lastOff-1, url)` appears to be fixing the off-by-1... * ... but folder ID is hard-coded (for developer's Drive folder!) and after accidentally setting it up app can't access markdown folder (one was made though)

Round 2

New wiki page (fixing off-by-one with the clearf fork, then integrating functionality present in other forks)

My reading into the script

I'm trying to figure out the cause within the code here, but italicised links are output as mangled markdown. For e.g. :

gdocs2md markdown output:

###### Notes on Tamara Munzer’s 2000 dissertation, [Interactive visualization of large graphs and networks](https://graphics.stanford.edu/papers/munzner_thesis/index.html) (*[pd*f](https://graphics.stanford.edu/papers/munzner_thesis/all.onscreen.pdf))

"When I was considering graduate schools a few years later, a major part of my decision-making procedure was to read through the previous decade of **Siggraph proceeding*s. I ended up at Stanford because I found that the papers that most delighted me had Pat’s name on them. What I value most about this past five years was the opportunity to absorb not only his insights into the specifics of my work, but his fundamental approach to research that emphasizes rigor and first principles*"

which when rendered gives:

Notes on Tamara Munzer’s 2000 dissertation, Interactive visualization of large graphs and networks (*pd*f)

"When I was considering graduate schools a few years later, a major part of my decision-making procedure was to read through the previous decade of **Siggraph proceeding*s. I ended up at Stanford because I found that the papers that most delighted me had Pat’s name on them. What I value most about this past five years was the opportunity to absorb not only his insights into the specifics of my work, but his fundamental approach to research that emphasizes rigor and first principles*"

pOut=pOut.substring(0, off)+'*'+pOut.substring(off, lastOff)+'*'+pOut.substring(lastOff);

pOut = "Notes on Tamara Munzer’s 2000 dissertation, Interactive visualization of large graphs and networks ([pdf](https://graphics.stanford.edu/papers/munzner_thesis/all.onscreen.pdf))"

i.e. mappings (attrs array) are correct on first iteration, then after URL is processed two additional characters lead to off-by-one error. This is classic text-algo problem, potential for italic/bold within string means only OK if you either

  • apply URL []() modification first, then update attrs array accordingly on relevant values
    • benefit: always just start character +1 (∵ [X), end character +1 (∵ [...X)
  • apply bold/italic modification first, then update attrs array accordingly
    • disadvantage: multiple parts of the if statement would need to be edited for different cases:
      • if italic start character ++1 (∵ *X), end character ++1 (∵ *...X)
      • if bold start character ++2 (∵ **X), end character ++2 (∵ **...X)
      • if bold italic start char ++3 (∵ ***X), end char ++3 (∵ ***...X)

Common sense points to the former - make a single URL modification, and inform the styling sections of the change to the mapping of the original string.

Equivalent in previously deployed R script: abbreviation substitution algorithm, modifying variable for string length at each iteration.

Solution

Take the string pOut and check for a URL - when the URL is modified, increment the same attr entries that were used to place the [, ], ( and ) Already implemented in this fork.

Script workings

Currently runs as a core function processTextElement:

function processTextElement(inSrc, txt) {
  if (typeof(txt) === 'string') {
    return txt;
  }
  
  var pOut = txt.getText();
  if (! txt.getTextAttributeIndices) {
    return pOut;
  }
  
  var attrs=txt.getTextAttributeIndices();
  var lastOff=pOut.length;

  for (var i=attrs.length-1; i>=0; i--) {
    var off=attrs[i];
    var url=txt.getLinkUrl(off);
    var font=txt.getFontFamily(off);
    if (url) {  // start of link
      if (i>=1 && attrs[i-1]==off-1 && txt.getLinkUrl(attrs[i-1])===url) {
        // detect links that are in multiple pieces because of errors on formatting:
        i-=1;
        off=attrs[i];
        url=txt.getLinkUrl(off);
      }
      pOut=pOut.substring(0, off)+'['+pOut.substring(off, lastOff)+']('+url+')'+pOut.substring(lastOff);
    } else if (font) {
      if (!inSrc && font===font.COURIER_NEW) {
        while (i>=1 && txt.getFontFamily(attrs[i-1]) && txt.getFontFamily(attrs[i-1])===font.COURIER_NEW) {
          // detect fonts that are in multiple pieces because of errors on formatting:
          i-=1;
          off=attrs[i];
        }
        pOut=pOut.substring(0, off)+'`'+pOut.substring(off, lastOff)+'`'+pOut.substring(lastOff);
      }
    }
    if (txt.isBold(off)) {
      var d1 = d2 = "**";
      if (txt.isItalic(off)) {
        // edbacher: changed this to handle bold italic properly.
        d1 = "**_"; d2 = "_**";
      }
      pOut=pOut.substring(0, off)+d1+pOut.substring(off, lastOff)+d2+pOut.substring(lastOff);
    } else if (txt.isItalic(off)) {
      // pOut=pOut.substring(0, off)+'*'+pOut.substring(off, lastOff)+'*'+pOut.substring(lastOff);
      pOut=pOut.substring(0, off+1)+'*'+pOut.substring(off+1, lastOff+1)+'*'+pOut.substring(lastOff+1);
    }
    lastOff=off;
  }
  return pOut;
}

This core function processElement runs within processParagraph, appending its pOut value to the running pOut (in the sub-function's parent scope):

// Process each child element (not just paragraphs).
function processParagraph(index, element, inSrc, imageCounter, listCounters) {
  for (var i = 0; i < element.getNumChildren(); i++) {
   if (t === DocumentApp.ElementType.TEXT) {
    var txt=element.getChild(i);
    pOut += txt.getText();
    textElements.push(txt);
  }
     
  if (textElements.length==0) {
    // Isn't result empty now?
    return result;
  }
    
  prefix = findPrefix(inSrc, element, listCounters);
  
  var pOut = "";
  for (var i=0; i<textElements.length; i++) {
    pOut += processTextElement(inSrc, textElements[i]);
  }
    
  // replace Unicode quotation marks
  pOut = pOut.replace('\u201d', '"').replace('\u201c', '"');
    
  result.text = prefix+pOut;
     
  return result;
}

(irrelevant sections removed)

This function, processParagraph is called within convertToMarkdown, the top-level invoked function:

function ConvertToMarkdown() {
  var numChildren = DocumentApp.getActiveDocument().getActiveSection().getNumChildren();
  var text = "";
  var inSrc = false;
  var inClass = false;
  var globalImageCounter = 0;
  var globalListCounters = {};
  // edbacher: added a variable for indent in src <pre> block. Let style sheet do margin.
  var srcIndent = "";
  
  var attachments = [];
  
  // Walk through all the child elements of the doc.
  for (var i = 0; i < numChildren; i++) {
    var child = DocumentApp.getActiveDocument().getActiveSection().getChild(i);
    var result = processParagraph(i, child, inSrc, globalImageCounter, globalListCounters);
    globalImageCounter += (result && result.images) ? result.images.length : 0;
    if (result!==null) {
      if (result.sourcePretty==="start" && !inSrc) {
        inSrc=true;
        text+="<pre class=\"prettyprint\">\n";
      } else if (result.sourcePretty==="end" && inSrc) {
        inSrc=false;
        text+="</pre>\n\n";
      } else if (result.source==="start" && !inSrc) {
        inSrc=true;
        text+="<pre>\n";
      } else if (result.source==="end" && inSrc) {
        inSrc=false;
        text+="</pre>\n\n";
      } else if (result.inClass==="start" && !inClass) {
        inClass=true;
        text+="<div class=\""+result.className+"\">\n";
      } else if (result.inClass==="end" && inClass) {
        inClass=false;
        text+="</div>\n\n";
      } else if (inClass) {
        text+=result.text+"\n\n";
      } else if (inSrc) {
        text+=(srcIndent+escapeHTML(result.text)+"\n");
      } else if (result.text && result.text.length>0) {
        text+=result.text+"\n\n";
      }
      
      if (result.images && result.images.length>0) {
        for (var j=0; j<result.images.length; j++) {
          attachments.push( {
            "fileName": result.images[j].name,
            "mimeType": result.images[j].type,
            "content": result.images[j].bytes } );
        }
      }
    } else if (inSrc) { // support empty lines inside source code
      text+='\n';
    }
      
  }
}
Clone this wiki locally