-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My combined changes #268
base: main
Are you sure you want to change the base?
My combined changes #268
Conversation
return result.value; | ||
let output = result.value; | ||
if (options.tableOfContentsStyle === 'replace') { | ||
output = output.replace(/\\?\[TOC\\?\]/g, '[TOC]'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly a bit of a hack because the conversion kept producing \[TOC]
instead of simply [TOC]
and I couldn't figure out why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s doing that because [TOC]
is (potentially) a reference-style link in Markdown, not regular text. The \
escapes it to ensure it is treated as text by a Markdown renderer.
I would normally say this should probably be kept with the \
, but I’m guessing the Markdown renderer you are using doesn’t support that. Is that right? What renderer? (I know some use HTML comments, e.g. <!-- toc -->
, so this is not the only format that someone might want) The right behavior here probably depends on how common or broadly used it is. Since this is way-outside-the-standard behavior, I think the option to replace should not be the default. You should at least change that.
The best way to do this is probably to make it an HTML node in the AST, not text. That is, down in replaceTableOfContents()
, you should:
node.children.splice(tocStartIndex, tocLength, {
type: 'element',
tagName: 'p',
children: [{
- type: 'text',
+ type: 'html',
value: '[TOC]'
}]
});
…then you should be able to get rid of the rewriting here.
You can see another spot over in convert.js
where we do this for stuff that not really HTML but beyond the scope of the standardized Markdown AST:
google-docs-to-markdown/lib/convert.js
Lines 49 to 55 in ef2c386
idCode = ` {#${node.properties.id}}`; | |
} | |
newNode.children?.push({ | |
type: 'html', | |
value: idCode, | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing the type to html and it still escaped it. I could try again, but I'm pretty sure that's the first thing I tried.
I only assumed that escaping my toc macro would make it not render in most editors. I didn't test it. I thought about adding <!-- toc -->
as well which is why I made it a dropdown select so it could be added. I might be able to toss that in quickly at some point. I did have this in there once and <!-- toc -->
gets escaped (prefixed with backslash) as well.
I agree about the default being the original way. I was of course coding it for my needs at the time and never changed it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, I see, I forgot we are still working with the HTML AST at this point, not the Markdown AST. html
isn’t actually even a valid node here; I guess it’s just being treated like text
if it’s an invalid type.
So the “right” way is actually in two parts:
-
Mark up the TOC with something identifiable in
fix-google-html.js
.For example, maybe
replaceTableOfContents()
wraps the TOC in a<nav is-table-of-contents="true">
element or something instead of actually replacing the TOC with a text node that says[TOC]
. -
Add a handler that can replace that markup with the right
html
node during conversion from HTML AST → Markdown AST inconvert.js
.anchorHandler()
is probably a good example of this second part.To go with the example in step 1, This would identify
<nav>
elements and either replace them with the{type: 'html', value: '[TOC]'}
node if theis-table-of-contents
attribute is present or defer to the default handler if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually makes a lot of sense and would be way better. For the title detection I added a property on the node. I could do the same thing here if that's sounds appropriate to you. I understand the structure of this code better than when I first started and tbh, I was doing the bare minimum to achieve the results I wanted while keeping things in separate branches so I could provide back in case the changes ended up being decent. I don't mind spending a little more time on this to produce cleaner maintainable code to provide back, but it might end up taking me a while since I have my job, a kid, and personal project backlog things I want to get done.
I also don't mind if someone wants to edit my branch here directly to help speed this along or something like that. I think there was an option I selected that should allow others to submit code directly to this PR branch. Whatever is easiest for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the title detection I added a property on the node. I could do the same thing here if that's sounds appropriate to you.
Yep, that should work just fine for the TOC.
Just in case you haven’t tripped over this, if/when you start using the slice clip data to identify the title, you’ll need to handle it a bit differently — however you change or annotate it needs to be done in a way that can get serialized back to HTML (so it needs to be an attribute or a change to the DOM structure). That’s because the workflow is basically:
- Paste event.
- Grab the HTML and the slice clip data from the clipboard and run
updateHtmlWithSliceClip()
to get a nice HTML tree that is decorated with stuff info from the slice clip that is normally missing from the HTML (like the title). - Serialize the resulting AST to HTML and put it in the input area.
- Pick up the content of the input area and run it through
cleanGoogleHtml()
to get an HTML AST that is ready to turn into Markdown. - Convert the HTML AST to a Markdown AST.
- Serialize the resulting AST to Markdown and put it in the output area.
When stuff is coming from the slice clip (so it's happening in step 2), you need to make sure whatever you do makes it between steps 3 and 4, so it has to be something that can be serialized to HTML. For the title, that might be making it an <h1>
and downgrading everything else, or it might be some special attribute. (The reason for this 2 step conversion is so people can further edit the input area after pasting, or paste snippets from multiple documents into the input area together, without having to worry about how that changes the index of things in the slice clip data.)
I don't mind spending a little more time on this to produce cleaner maintainable code to provide back, but it might end up taking me a while
No worries! Just let me know when it’s ready for more review. 😄
@@ -37,9 +37,9 @@ | |||
"main": "./lib-ui/index.js", | |||
"type": "module", | |||
"scripts": { | |||
"build": "mkdir -p dist && webpack", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fail for me on windows. My linux dev box is down for now. Are they really needed. In other projects, webpack created these for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure! I originally wrote this way back in 2018, so I wouldn’t be surprised if Webpack has added this functionality internally since then. :)
Aha. I didn't add any new tests or run the existing ones. I will have to check that later.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty neat! 👍
I would definitely like to add the title/headers stuff, and the TOC thing seems reasonable to me, too.
I added a few notes here, but I probably won’t have much time for a deeper review of this or any additional changes you make until at least this weekend, maybe not until next week. Apologies for the delay.
In the mean time, I left some notes on stuff you called out. Some other quick thoughts:
I can submit in several PRs if preferred, but I would have to do some additional cleanup and a couple branches depend on each other…
It would be helpful if you could separate out the continuous code block and stripping-formatting-from-headers stuff into separate PRs.
Don't over-format headers: Header blocks do not need additional formatting like bold and italics. Let's strip the plain text for this conversion.
I am not sure I agree with this one. A header without special formatting in Google Docs should already be coming across a plain text here, and if somebody has specially added italics, bold, etc. in their doc, I think it makes sense to respect that. A lot of work has already been put into making this work well (see #113, #158, #212).
Downgrade headers… Detect Title
Having support for titles is great! This has been something I’ve wanted to get to for a while. One alternative approach to titles that I had been thinking of was to turn them into YAML frontmatter:
title: [title here]
---
[rest of markdown doc]
I don’t think you need to implement that here, but I’m thinking this suggests a different way to approach the options/settings. Instead of separate settings for "downgrade headers" and "detect title", just have one setting for how to treat titles if they are detected:
- Ignore (don’t detect)
- Make H1 (and downgrade other headers) (the default value)
- YAML Frontmatter (no need to actually have this option or implement now; just reserving for the future)
What do you think about making it work that way?
/** | ||
* Convert first large text block to H1 if it looks like a title | ||
*/ | ||
function convertTitleToHeading(tree, options) { | ||
let foundTitle = false; | ||
|
||
visit(tree, (node, index, parent) => { | ||
if (foundTitle || node.tagName !== 'p') return; | ||
|
||
const firstChild = node.children?.[0]; | ||
if (!firstChild || firstChild.tagName !== 'span') return; | ||
|
||
const style = resolveNodeStyle(firstChild); | ||
const isTitleCandidate = ( | ||
parseInt(style['font-size']) >= 24 || // 24pt or larger | ||
style['font-weight'] === 'bold' || | ||
style['font-weight'] >= '300' | ||
); | ||
|
||
if (isTitleCandidate) { | ||
// Replace paragraph with H1 and mark as detected title | ||
parent.children.splice(index, 1, { | ||
type: 'element', | ||
tagName: 'h1', | ||
properties: { | ||
__detectedTitle: true | ||
}, | ||
children: node.children | ||
}); | ||
|
||
foundTitle = true; | ||
return EXIT; | ||
} | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be better handled using the slice clip data — i.e. it could be called from updateHtmlWithSliceClip()
and use the slice clip data to determine the title (which should tell you unambiguously, without having to do some fuzzy guessing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at it. I really don't like doing fuzzy guessing, but when I inspected the html, I didn't see any indicators that this was originally a title 😞. The "fuzziness" of the logic is one reason I made it a selectable option. It does at least only consider this for the first text block, so that helps. I also checked an nothing else included a font-weight numerical value besides headers and titles, so I thought about checking "if font-weight is a number but node isn't a header", but I still don't trust that universally, and who knows, maybe someone will like the option to assume their first header is the title 🤷
I'm open to changing this logic to whatever works if you have preferences/ I have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I inspected the html, I didn't see any indicators that this was originally a title
Yeah, there is nothing in the HTML to really indicate it reliably, which is why I had punted on this for a while. The slice clip data (which is just Google Docs’s internal representation for copy/pasting, and which I added support for later) does have this: a title paragraph will have a ps_hd
property of 100
.
You can see in the getHeadings()
function in slice-clip.js
where we skip over titles:
google-docs-to-markdown/lib/slice-clip.js
Lines 149 to 159 in ef2c386
// Titles have `ps_hd = 100`, so confine ouselves to normal heading levels. | |
if (style.ps_hdid && style.ps_hd < 7) { | |
const text = sliceClipText(sliceClip).slice(current.start, i); | |
headings.push({ | |
...current, | |
end: i, | |
text, | |
level: style.ps_hd, | |
id: style.ps_hdid, | |
}); | |
} |
…and in fixInternalLinks()
where we use that to make sure all the headings in the pasted HTML have appropriate id
attributes so they can be linked to. You could use the same technique to identify the paragraph that is the title and mark it up appropriately. Then no guessing is needed.
For this particular case, you might need to use visitRangesInTree()
or replaceRangesInTree()
to match up the data about headings with actual HTML nodes. fixCodeSnippedObjects()
or markSuggestions()
might be good examples of how to use those.
return result.value; | ||
let output = result.value; | ||
if (options.tableOfContentsStyle === 'replace') { | ||
output = output.replace(/\\?\[TOC\\?\]/g, '[TOC]'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s doing that because [TOC]
is (potentially) a reference-style link in Markdown, not regular text. The \
escapes it to ensure it is treated as text by a Markdown renderer.
I would normally say this should probably be kept with the \
, but I’m guessing the Markdown renderer you are using doesn’t support that. Is that right? What renderer? (I know some use HTML comments, e.g. <!-- toc -->
, so this is not the only format that someone might want) The right behavior here probably depends on how common or broadly used it is. Since this is way-outside-the-standard behavior, I think the option to replace should not be the default. You should at least change that.
The best way to do this is probably to make it an HTML node in the AST, not text. That is, down in replaceTableOfContents()
, you should:
node.children.splice(tocStartIndex, tocLength, {
type: 'element',
tagName: 'p',
children: [{
- type: 'text',
+ type: 'html',
value: '[TOC]'
}]
});
…then you should be able to get rid of the rewriting here.
You can see another spot over in convert.js
where we do this for stuff that not really HTML but beyond the scope of the standardized Markdown AST:
google-docs-to-markdown/lib/convert.js
Lines 49 to 55 in ef2c386
idCode = ` {#${node.properties.id}}`; | |
} | |
newNode.children?.push({ | |
type: 'html', | |
value: idCode, | |
}); |
Also, yes! Please add tests. |
Unfortunately that wasn't my experience. It seemed to be picking up formatting from the header styling that wasn't explicitly applied. Even if I did manually bold or italicize my headers, I wouldn't want it to carry over into a markdown document. That's what stylesheets in renderers are for. I could add an option for it or simply not include it as a PR if this isn't a direction you want to take for the core project.
I thought about that and almost made the selection a dropdown, but I don't use that and I was rushing at that point 😅
You read my mind! I almost included detect titles in the downgrade headers options, but didn't because there is no title tag in the copied gdocs html and wanted to make it a separate option in case someone experienced issues with the fuzzy logic. I have one more change I plan on making. I had an issue where my code block formatting ran into an image pasted into my doc which made it fail to include the image in any way. Strangely, this was before I even added the logic for "continuous code blocks". I fixed this by going into that doc and changing the paragraph style for that image and played around with adding and deleting new lines, but I think I'd like to go back and add a change to break code blocks on images as well so I don't have to scrutinize my conversions too closely. I can try to add that to one of the PRs I make. If you have a recommendation on something that can download these image links as local resources, please let me know. I think Obtainium can do it, but I haven't looked yet. If not, I was thinking of writing a small py script for it.
I will try to split these PRs soon. Maybe this weekend. I want to finish this project soon since I already have everything done besides monotonously copying and pasting each of the documents. PS: |
Can you supply some example docs? What browser? This is definitely a bug.
I guess I have seen a lot of examples where this is not the case. I can see this if the entire heading has consistent formatting (whatever that formatting is), but I would be surprised if you mean you want this when, say, a couple words in the heading are italicized. Stylesheets and renderers cannot solve that. I’m open to adding an option here, but I suspect you are focused on a case where the whole heading is styled one way, and I am focusing on cases where it is not. I think it might be fine if any styling that applies to the whole heading is stripped, and that probably doesn’t need a special option. I haven’t had time for a close look, but I think that’s different from what you implemented here. That’s probably another good reason to split out a separate PR for it; it seems like there’s more to hash out about how it should work. 🙂
Can you link to an example doc that is public?
Just in case you haven’t seen it: there is a system for downloading all the relevant parts and versions of a Google doc as a fixture in Here’s an example of how you can utilize those fixtures: google-docs-to-markdown/test/unit/convert.test.js Lines 63 to 73 in ef2c386
That’ll create a test that loads |
Oh wow I had not thought about emphasizing individual words in a header. I of course have not put as much time or thought into this beyond my own use case. Perhaps the clean headers is better as an option or not merged. When I put it in a PR, I'll make it a selectable option or perhaps change it to clean only whole header formatting unless you don't see value in adding it to this project. I made a sample document containing a combination of some of the issues I was running into. It's interesting. Until I added the config blocks with special characters, it actually produced a continuous code block without my additional code fixing this for my docs. The image issue is happening either way. I imagine it should be easy to write something to detect images in a paragraph and fix them to not be treated as a code block. |
Note: I reverted the continuous code block and stripping-formatting-from-headers commits from this PR for now to put into their own PRs later. |
That works! I think an option would be fine, and I also think no option but it only strips formatting that applies to the whole header would work pretty well, too (maybe better!).
👍
I think what’s happening here is that we wind up with a tree that represents something like: <p>
<!-- Originally was: <span style="font-family: monospace;">, but we convert to: -->
<code>
<img src="...">
</code>
</p> And then the default conversion of So yeah, the best way to solve this is to pull that image out of the code block. Probably when we convert monospace formatting to <p>
<span style="font-family: monospace;">
abc<img src="...">def
</span>
</p> To this: <p>
<!-- Break the code element around the img -->
<code>abc</code>
<img src="...">
<code>def</code>
</p> …then make sure we remove the Another approach might be to customize the handler for
No worries! Just let me know when it’s ready for more review. 😄 |
I have a number of new options and minor tweaks and fixes that I can submit in several PRs if preferred, but I would have to do some additional cleanup and a couple branches depend on each other and would have to go in order. I thought I would start by submitting this together to avoid procrastination on my part.
Fixes/ Enhancements:
Continuous code blocks: Combine fenced code blocks that aren't separated by a new line without formatting changes in the html
Don't over-format headers: Header blocks do not need additional formatting like bold and italics. Let's strip the plain text for this conversion.
New Options:
Downgrade headers: It's usually bad form to have more than one H1 heading. Add an option to downgrade headers by 1. If we reach H6, convert to a paragraph with bolded text.
Detect Title: Titles aren't a header in google docs. Let's add an option to try to detect titles. Pairs well with downgrade headers.
Table of Contents: Let's add an option to replace the table of contents with the
[TOC]
macro supported by many extended markdown renderers or to leave as a list of converted links.Example Options Fields with Sample Input and Output: