-
Notifications
You must be signed in to change notification settings - Fork 7
gdocs2md debugging: ital. and bold links mangled, * off by 1
(Long story short: new repo -
gdocs2Rmd
- forked off of clearf's fork, the only one in the network graph that had fixed the off-by-one error in bold/italics processing
-
psychemedia's fork is a functioning version of the original script withSave to Drive
UI menu rather than automatically sending to email.-
The diff comments say it's 'hacky' and not recursive, possibly just an XML use(?)off by 1 👎
-
-
Jacksonicson's fork comments describe it being recursive and systematicfailed 👎 -
This commit in trepidacious's fork describes "improved handling of bold and italic formatting" (creation of a new function
processTextElement2
). -
josuahdemangeon's fork contains a commit which converts inline images tooff by 1 👎<img>
tags withdata:image/png;base64
URL cf. a new file for each supreetpal's fork has a commit describing the option to configure output folder, and a document describing how to publish a Google Docs plugin
This means anyone can install it (accessible across all docs rather than using the script editor each time). Getting a doc plugin (in the long term) accepted as 'global'/both rather than solely publishing 'domain-restricted' takes longer (the script being subject to Google's scrutiny), but is available to anyone through the script plugin search feature.* Defaults to null output folder 😑 Eventually found out... it's off by 1 👎 *
It sounds like jumping through a few hoops, but surely less so than for publishing a Chrome Extension.
New wiki page (fixing off-by-one with the clearf fork, then integrating functionality present in other forks)
I'm trying to figure out the cause within the code here, but italicised links are output as mangled markdown. For e.g. :
gdocs2md
markdown output:
###### Notes on Tamara Munzer’s 2000 dissertation, [Interactive visualization of large graphs and networks](https://graphics.stanford.edu/papers/munzner_thesis/index.html) (*[pd*f](https://graphics.stanford.edu/papers/munzner_thesis/all.onscreen.pdf))"When I was considering graduate schools a few years later, a major part of my decision-making procedure was to read through the previous decade of **Siggraph proceeding*s. I ended up at Stanford because I found that the papers that most delighted me had Pat’s name on them. What I value most about this past five years was the opportunity to absorb not only his insights into the specifics of my work, but his fundamental approach to research that emphasizes rigor and first principles*"
which when rendered gives:
Notes on Tamara Munzer’s 2000 dissertation, Interactive visualization of large graphs and networks (*pd*f)
"When I was considering graduate schools a few years later, a major part of my decision-making procedure was to read through the previous decade of **Siggraph proceeding*s. I ended up at Stanford because I found that the papers that most delighted me had Pat’s name on them. What I value most about this past five years was the opportunity to absorb not only his insights into the specifics of my work, but his fundamental approach to research that emphasizes rigor and first principles*"
pOut=pOut.substring(0, off)+'*'+pOut.substring(off, lastOff)+'*'+pOut.substring(lastOff);
pOut = "Notes on Tamara Munzer’s 2000 dissertation, Interactive visualization of large graphs and networks ([pdf](https://graphics.stanford.edu/papers/munzner_thesis/all.onscreen.pdf))"
i.e. mappings (attrs
array) are correct on first iteration, then after URL is processed two additional characters lead to off-by-one error. This is classic text-algo problem, potential for italic/bold within string means only OK if you either
- apply URL
[]()
modification first, then updateattrs
array accordingly on relevant values- benefit: always just start character +1 (∵
[X
), end character +1 (∵[...X
)
- benefit: always just start character +1 (∵
- apply bold/italic modification first, then update
attrs
array accordingly- disadvantage: multiple parts of the if statement would need to be edited for different cases:
- if italic start character ++1 (∵
*X
), end character ++1 (∵*...X
) - if bold start character ++2 (∵
**X
), end character ++2 (∵**...X
) - if bold italic start char ++3 (∵
***X
), end char ++3 (∵***...X
)
- if italic start character ++1 (∵
- disadvantage: multiple parts of the if statement would need to be edited for different cases:
Common sense points to the former - make a single URL modification, and inform the styling sections of the change to the mapping of the original string.
Equivalent in previously deployed R script: abbreviation substitution algorithm, modifying variable for string length at each iteration.
Take the string Already implemented in this fork.pOut
and check for a URL - when the URL is modified, increment the same attr
entries that were used to place the [
, ]
, (
and )
Currently runs as a core function processTextElement
:
function processTextElement(inSrc, txt) {
if (typeof(txt) === 'string') {
return txt;
}
var pOut = txt.getText();
if (! txt.getTextAttributeIndices) {
return pOut;
}
var attrs=txt.getTextAttributeIndices();
var lastOff=pOut.length;
for (var i=attrs.length-1; i>=0; i--) {
var off=attrs[i];
var url=txt.getLinkUrl(off);
var font=txt.getFontFamily(off);
if (url) { // start of link
if (i>=1 && attrs[i-1]==off-1 && txt.getLinkUrl(attrs[i-1])===url) {
// detect links that are in multiple pieces because of errors on formatting:
i-=1;
off=attrs[i];
url=txt.getLinkUrl(off);
}
pOut=pOut.substring(0, off)+'['+pOut.substring(off, lastOff)+']('+url+')'+pOut.substring(lastOff);
} else if (font) {
if (!inSrc && font===font.COURIER_NEW) {
while (i>=1 && txt.getFontFamily(attrs[i-1]) && txt.getFontFamily(attrs[i-1])===font.COURIER_NEW) {
// detect fonts that are in multiple pieces because of errors on formatting:
i-=1;
off=attrs[i];
}
pOut=pOut.substring(0, off)+'`'+pOut.substring(off, lastOff)+'`'+pOut.substring(lastOff);
}
}
if (txt.isBold(off)) {
var d1 = d2 = "**";
if (txt.isItalic(off)) {
// edbacher: changed this to handle bold italic properly.
d1 = "**_"; d2 = "_**";
}
pOut=pOut.substring(0, off)+d1+pOut.substring(off, lastOff)+d2+pOut.substring(lastOff);
} else if (txt.isItalic(off)) {
// pOut=pOut.substring(0, off)+'*'+pOut.substring(off, lastOff)+'*'+pOut.substring(lastOff);
pOut=pOut.substring(0, off+1)+'*'+pOut.substring(off+1, lastOff+1)+'*'+pOut.substring(lastOff+1);
}
lastOff=off;
}
return pOut;
}
This core function processElement
runs within processParagraph
, appending its pOut value to the running pOut (in the sub-function's parent scope):
// Process each child element (not just paragraphs).
function processParagraph(index, element, inSrc, imageCounter, listCounters) {
for (var i = 0; i < element.getNumChildren(); i++) {
if (t === DocumentApp.ElementType.TEXT) {
var txt=element.getChild(i);
pOut += txt.getText();
textElements.push(txt);
}
if (textElements.length==0) {
// Isn't result empty now?
return result;
}
prefix = findPrefix(inSrc, element, listCounters);
var pOut = "";
for (var i=0; i<textElements.length; i++) {
pOut += processTextElement(inSrc, textElements[i]);
}
// replace Unicode quotation marks
pOut = pOut.replace('\u201d', '"').replace('\u201c', '"');
result.text = prefix+pOut;
return result;
}
(irrelevant sections removed)
This function, processParagraph
is called within convertToMarkdown
, the top-level invoked function:
function ConvertToMarkdown() {
var numChildren = DocumentApp.getActiveDocument().getActiveSection().getNumChildren();
var text = "";
var inSrc = false;
var inClass = false;
var globalImageCounter = 0;
var globalListCounters = {};
// edbacher: added a variable for indent in src <pre> block. Let style sheet do margin.
var srcIndent = "";
var attachments = [];
// Walk through all the child elements of the doc.
for (var i = 0; i < numChildren; i++) {
var child = DocumentApp.getActiveDocument().getActiveSection().getChild(i);
var result = processParagraph(i, child, inSrc, globalImageCounter, globalListCounters);
globalImageCounter += (result && result.images) ? result.images.length : 0;
if (result!==null) {
if (result.sourcePretty==="start" && !inSrc) {
inSrc=true;
text+="<pre class=\"prettyprint\">\n";
} else if (result.sourcePretty==="end" && inSrc) {
inSrc=false;
text+="</pre>\n\n";
} else if (result.source==="start" && !inSrc) {
inSrc=true;
text+="<pre>\n";
} else if (result.source==="end" && inSrc) {
inSrc=false;
text+="</pre>\n\n";
} else if (result.inClass==="start" && !inClass) {
inClass=true;
text+="<div class=\""+result.className+"\">\n";
} else if (result.inClass==="end" && inClass) {
inClass=false;
text+="</div>\n\n";
} else if (inClass) {
text+=result.text+"\n\n";
} else if (inSrc) {
text+=(srcIndent+escapeHTML(result.text)+"\n");
} else if (result.text && result.text.length>0) {
text+=result.text+"\n\n";
}
if (result.images && result.images.length>0) {
for (var j=0; j<result.images.length; j++) {
attachments.push( {
"fileName": result.images[j].name,
"mimeType": result.images[j].type,
"content": result.images[j].bytes } );
}
}
} else if (inSrc) { // support empty lines inside source code
text+='\n';
}
}
}