Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My combined changes #268

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

MaderHatt3r
Copy link

I have a number of new options and minor tweaks and fixes that I can submit in several PRs if preferred, but I would have to do some additional cleanup and a couple branches depend on each other and would have to go in order. I thought I would start by submitting this together to avoid procrastination on my part.

Fixes/ Enhancements:

Continuous code blocks: Combine fenced code blocks that aren't separated by a new line without formatting changes in the html

Don't over-format headers: Header blocks do not need additional formatting like bold and italics. Let's strip the plain text for this conversion.

New Options:

Downgrade headers: It's usually bad form to have more than one H1 heading. Add an option to downgrade headers by 1. If we reach H6, convert to a paragraph with bolded text.

Detect Title: Titles aren't a header in google docs. Let's add an option to try to detect titles. Pairs well with downgrade headers.

Table of Contents: Let's add an option to replace the table of contents with the [TOC] macro supported by many extended markdown renderers or to leave as a list of converted links.

Example Options Fields with Sample Input and Output:

image

return result.value;
let output = result.value;
if (options.tableOfContentsStyle === 'replace') {
output = output.replace(/\\?\[TOC\\?\]/g, '[TOC]');
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly a bit of a hack because the conversion kept producing \[TOC] instead of simply [TOC] and I couldn't figure out why.

Copy link
Owner

@Mr0grog Mr0grog Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s doing that because [TOC] is (potentially) a reference-style link in Markdown, not regular text. The \ escapes it to ensure it is treated as text by a Markdown renderer.

I would normally say this should probably be kept with the \, but I’m guessing the Markdown renderer you are using doesn’t support that. Is that right? What renderer? (I know some use HTML comments, e.g. <!-- toc -->, so this is not the only format that someone might want) The right behavior here probably depends on how common or broadly used it is. Since this is way-outside-the-standard behavior, I think the option to replace should not be the default. You should at least change that.

The best way to do this is probably to make it an HTML node in the AST, not text. That is, down in replaceTableOfContents(), you should:

   node.children.splice(tocStartIndex, tocLength, {
      type: 'element',
      tagName: 'p',
      children: [{
-       type: 'text',
+       type: 'html',
        value: '[TOC]'
      }]
    });

…then you should be able to get rid of the rewriting here.

You can see another spot over in convert.js where we do this for stuff that not really HTML but beyond the scope of the standardized Markdown AST:

idCode = ` {#${node.properties.id}}`;
}
newNode.children?.push({
type: 'html',
value: idCode,
});

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing the type to html and it still escaped it. I could try again, but I'm pretty sure that's the first thing I tried.

I only assumed that escaping my toc macro would make it not render in most editors. I didn't test it. I thought about adding <!-- toc --> as well which is why I made it a dropdown select so it could be added. I might be able to toss that in quickly at some point. I did have this in there once and <!-- toc --> gets escaped (prefixed with backslash) as well.

I agree about the default being the original way. I was of course coding it for my needs at the time and never changed it back.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I see, I forgot we are still working with the HTML AST at this point, not the Markdown AST. html isn’t actually even a valid node here; I guess it’s just being treated like text if it’s an invalid type.

So the “right” way is actually in two parts:

  1. Mark up the TOC with something identifiable in fix-google-html.js.

    For example, maybe replaceTableOfContents() wraps the TOC in a <nav is-table-of-contents="true"> element or something instead of actually replacing the TOC with a text node that says [TOC].

  2. Add a handler that can replace that markup with the right html node during conversion from HTML AST → Markdown AST in convert.js. anchorHandler() is probably a good example of this second part.

    To go with the example in step 1, This would identify <nav> elements and either replace them with the {type: 'html', value: '[TOC]'} node if the is-table-of-contents attribute is present or defer to the default handler if not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually makes a lot of sense and would be way better. For the title detection I added a property on the node. I could do the same thing here if that's sounds appropriate to you. I understand the structure of this code better than when I first started and tbh, I was doing the bare minimum to achieve the results I wanted while keeping things in separate branches so I could provide back in case the changes ended up being decent. I don't mind spending a little more time on this to produce cleaner maintainable code to provide back, but it might end up taking me a while since I have my job, a kid, and personal project backlog things I want to get done.

I also don't mind if someone wants to edit my branch here directly to help speed this along or something like that. I think there was an option I selected that should allow others to submit code directly to this PR branch. Whatever is easiest for everyone.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the title detection I added a property on the node. I could do the same thing here if that's sounds appropriate to you.

Yep, that should work just fine for the TOC.

Just in case you haven’t tripped over this, if/when you start using the slice clip data to identify the title, you’ll need to handle it a bit differently — however you change or annotate it needs to be done in a way that can get serialized back to HTML (so it needs to be an attribute or a change to the DOM structure). That’s because the workflow is basically:

  1. Paste event.
  2. Grab the HTML and the slice clip data from the clipboard and run updateHtmlWithSliceClip() to get a nice HTML tree that is decorated with stuff info from the slice clip that is normally missing from the HTML (like the title).
  3. Serialize the resulting AST to HTML and put it in the input area.
  4. Pick up the content of the input area and run it through cleanGoogleHtml() to get an HTML AST that is ready to turn into Markdown.
  5. Convert the HTML AST to a Markdown AST.
  6. Serialize the resulting AST to Markdown and put it in the output area.

When stuff is coming from the slice clip (so it's happening in step 2), you need to make sure whatever you do makes it between steps 3 and 4, so it has to be something that can be serialized to HTML. For the title, that might be making it an <h1> and downgrading everything else, or it might be some special attribute. (The reason for this 2 step conversion is so people can further edit the input area after pasting, or paste snippets from multiple documents into the input area together, without having to worry about how that changes the index of things in the slice clip data.)

I don't mind spending a little more time on this to produce cleaner maintainable code to provide back, but it might end up taking me a while

No worries! Just let me know when it’s ready for more review. 😄

@@ -37,9 +37,9 @@
"main": "./lib-ui/index.js",
"type": "module",
"scripts": {
"build": "mkdir -p dist && webpack",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fail for me on windows. My linux dev box is down for now. Are they really needed. In other projects, webpack created these for me.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure! I originally wrote this way back in 2018, so I wouldn’t be surprised if Webpack has added this functionality internally since then. :)

@MaderHatt3r
Copy link
Author

Aha. I didn't add any new tests or run the existing ones. I will have to check that later..

Copy link
Owner

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty neat! 👍

I would definitely like to add the title/headers stuff, and the TOC thing seems reasonable to me, too.

I added a few notes here, but I probably won’t have much time for a deeper review of this or any additional changes you make until at least this weekend, maybe not until next week. Apologies for the delay.

In the mean time, I left some notes on stuff you called out. Some other quick thoughts:

I can submit in several PRs if preferred, but I would have to do some additional cleanup and a couple branches depend on each other…

It would be helpful if you could separate out the continuous code block and stripping-formatting-from-headers stuff into separate PRs.

Don't over-format headers: Header blocks do not need additional formatting like bold and italics. Let's strip the plain text for this conversion.

I am not sure I agree with this one. A header without special formatting in Google Docs should already be coming across a plain text here, and if somebody has specially added italics, bold, etc. in their doc, I think it makes sense to respect that. A lot of work has already been put into making this work well (see #113, #158, #212).

Downgrade headers… Detect Title

Having support for titles is great! This has been something I’ve wanted to get to for a while. One alternative approach to titles that I had been thinking of was to turn them into YAML frontmatter:

title: [title here]
---
[rest of markdown doc]

I don’t think you need to implement that here, but I’m thinking this suggests a different way to approach the options/settings. Instead of separate settings for "downgrade headers" and "detect title", just have one setting for how to treat titles if they are detected:

  • Ignore (don’t detect)
  • Make H1 (and downgrade other headers) (the default value)
  • YAML Frontmatter (no need to actually have this option or implement now; just reserving for the future)

What do you think about making it work that way?

Comment on lines +933 to +967
/**
* Convert first large text block to H1 if it looks like a title
*/
function convertTitleToHeading(tree, options) {
let foundTitle = false;

visit(tree, (node, index, parent) => {
if (foundTitle || node.tagName !== 'p') return;

const firstChild = node.children?.[0];
if (!firstChild || firstChild.tagName !== 'span') return;

const style = resolveNodeStyle(firstChild);
const isTitleCandidate = (
parseInt(style['font-size']) >= 24 || // 24pt or larger
style['font-weight'] === 'bold' ||
style['font-weight'] >= '300'
);

if (isTitleCandidate) {
// Replace paragraph with H1 and mark as detected title
parent.children.splice(index, 1, {
type: 'element',
tagName: 'h1',
properties: {
__detectedTitle: true
},
children: node.children
});

foundTitle = true;
return EXIT;
}
});
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better handled using the slice clip data — i.e. it could be called from updateHtmlWithSliceClip() and use the slice clip data to determine the title (which should tell you unambiguously, without having to do some fuzzy guessing).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at it. I really don't like doing fuzzy guessing, but when I inspected the html, I didn't see any indicators that this was originally a title 😞. The "fuzziness" of the logic is one reason I made it a selectable option. It does at least only consider this for the first text block, so that helps. I also checked an nothing else included a font-weight numerical value besides headers and titles, so I thought about checking "if font-weight is a number but node isn't a header", but I still don't trust that universally, and who knows, maybe someone will like the option to assume their first header is the title 🤷

I'm open to changing this logic to whatever works if you have preferences/ I have time.

Copy link
Owner

@Mr0grog Mr0grog Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I inspected the html, I didn't see any indicators that this was originally a title

Yeah, there is nothing in the HTML to really indicate it reliably, which is why I had punted on this for a while. The slice clip data (which is just Google Docs’s internal representation for copy/pasting, and which I added support for later) does have this: a title paragraph will have a ps_hd property of 100.

You can see in the getHeadings() function in slice-clip.js where we skip over titles:

// Titles have `ps_hd = 100`, so confine ouselves to normal heading levels.
if (style.ps_hdid && style.ps_hd < 7) {
const text = sliceClipText(sliceClip).slice(current.start, i);
headings.push({
...current,
end: i,
text,
level: style.ps_hd,
id: style.ps_hdid,
});
}

…and in fixInternalLinks() where we use that to make sure all the headings in the pasted HTML have appropriate id attributes so they can be linked to. You could use the same technique to identify the paragraph that is the title and mark it up appropriately. Then no guessing is needed.

For this particular case, you might need to use visitRangesInTree() or replaceRangesInTree() to match up the data about headings with actual HTML nodes. fixCodeSnippedObjects() or markSuggestions() might be good examples of how to use those.

return result.value;
let output = result.value;
if (options.tableOfContentsStyle === 'replace') {
output = output.replace(/\\?\[TOC\\?\]/g, '[TOC]');
Copy link
Owner

@Mr0grog Mr0grog Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s doing that because [TOC] is (potentially) a reference-style link in Markdown, not regular text. The \ escapes it to ensure it is treated as text by a Markdown renderer.

I would normally say this should probably be kept with the \, but I’m guessing the Markdown renderer you are using doesn’t support that. Is that right? What renderer? (I know some use HTML comments, e.g. <!-- toc -->, so this is not the only format that someone might want) The right behavior here probably depends on how common or broadly used it is. Since this is way-outside-the-standard behavior, I think the option to replace should not be the default. You should at least change that.

The best way to do this is probably to make it an HTML node in the AST, not text. That is, down in replaceTableOfContents(), you should:

   node.children.splice(tocStartIndex, tocLength, {
      type: 'element',
      tagName: 'p',
      children: [{
-       type: 'text',
+       type: 'html',
        value: '[TOC]'
      }]
    });

…then you should be able to get rid of the rewriting here.

You can see another spot over in convert.js where we do this for stuff that not really HTML but beyond the scope of the standardized Markdown AST:

idCode = ` {#${node.properties.id}}`;
}
newNode.children?.push({
type: 'html',
value: idCode,
});

@Mr0grog
Copy link
Owner

Mr0grog commented Feb 27, 2025

Aha. I didn't add any new tests or run the existing ones.

Also, yes! Please add tests.

@MaderHatt3r
Copy link
Author

MaderHatt3r commented Feb 27, 2025

Don't over-format headers: Header blocks do not need additional formatting like bold and italics. Let's strip the plain text for this conversion.

I am not sure I agree with this one. A header without special formatting in Google Docs should already be coming across a plain text here

Unfortunately that wasn't my experience. It seemed to be picking up formatting from the header styling that wasn't explicitly applied. Even if I did manually bold or italicize my headers, I wouldn't want it to carry over into a markdown document. That's what stylesheets in renderers are for. I could add an option for it or simply not include it as a PR if this isn't a direction you want to take for the core project.

Having support for titles is great! This has been something I’ve wanted to get to for a while. One alternative approach to titles that I had been thinking of was to turn them into YAML frontmatter:

I thought about that and almost made the selection a dropdown, but I don't use that and I was rushing at that point 😅

I don’t think you need to implement that here, but I’m thinking this suggests a different way to approach the options/settings. Instead of separate settings for "downgrade headers" and "detect title", just have one setting for how to treat titles if they are detected:

You read my mind! I almost included detect titles in the downgrade headers options, but didn't because there is no title tag in the copied gdocs html and wanted to make it a separate option in case someone experienced issues with the fuzzy logic.

I have one more change I plan on making. I had an issue where my code block formatting ran into an image pasted into my doc which made it fail to include the image in any way. Strangely, this was before I even added the logic for "continuous code blocks". I fixed this by going into that doc and changing the paragraph style for that image and played around with adding and deleting new lines, but I think I'd like to go back and add a change to break code blocks on images as well so I don't have to scrutinize my conversions too closely. I can try to add that to one of the PRs I make. If you have a recommendation on something that can download these image links as local resources, please let me know. I think Obtainium can do it, but I haven't looked yet. If not, I was thinking of writing a small py script for it.

It would be helpful if you could separate out the continuous code block and stripping-formatting-from-headers stuff into separate PRs.

I will try to split these PRs soon. Maybe this weekend. I want to finish this project soon since I already have everything done besides monotonously copying and pasting each of the documents.

PS:
Thanks for your quick response and for providing and maintaining this project. I was so happy to find it!

@Mr0grog
Copy link
Owner

Mr0grog commented Feb 28, 2025

Unfortunately that wasn't my experience. It seemed to be picking up formatting from the header styling that wasn't explicitly applied.

Can you supply some example docs? What browser? This is definitely a bug.

Even if I did manually bold or italicize my headers, I wouldn't want it to carry over into a markdown document. …That's what stylesheets in renderers are for.

I guess I have seen a lot of examples where this is not the case. I can see this if the entire heading has consistent formatting (whatever that formatting is), but I would be surprised if you mean you want this when, say, a couple words in the heading are italicized. Stylesheets and renderers cannot solve that.

I’m open to adding an option here, but I suspect you are focused on a case where the whole heading is styled one way, and I am focusing on cases where it is not. I think it might be fine if any styling that applies to the whole heading is stripped, and that probably doesn’t need a special option. I haven’t had time for a close look, but I think that’s different from what you implemented here. That’s probably another good reason to split out a separate PR for it; it seems like there’s more to hash out about how it should work. 🙂

I had an issue where my code block formatting ran into an image pasted into my doc which made it fail to include the image in any way.

Can you link to an example doc that is public?

I already have everything done besides monotonously copying and pasting each of the documents.

Just in case you haven’t seen it: there is a system for downloading all the relevant parts and versions of a Google doc as a fixture in download-fixtures.js. Just make your Google doc public and add its ID to the list of fixtures at the top, then run npm run download-fixtures. That way you don’t have to copy everything by hand, plus there is some automation here that will update those fixtures if Google Docs changes its copy/paste or export formats.

Here’s an example of how you can utilize those fixtures:

createFixtureTest('internal-links', { type: 'copy' });
createFixtureTest('internal-links', {
type: 'copy',
variation: '.html',
options: { headingIds: 'html' },
});
createFixtureTest('internal-links', {
type: 'copy',
variation: '.extended',
options: { headingIds: 'extended' },
});

That’ll create a test that loads test/fixtures/<test-name>.copy.html and test/fixtures/<test-name>.gdocssliceclip.json (both created by the download-fixtures script), run a conversion with the specified options, and diff the result against test/fixtures/<test-name>.expected.<variation>.md (this markdown file you’ll have to code by hand).

@MaderHatt3r
Copy link
Author

MaderHatt3r commented Mar 2, 2025

but I would be surprised if you mean you want this when, say, a couple words in the heading are italicized

Oh wow I had not thought about emphasizing individual words in a header. I of course have not put as much time or thought into this beyond my own use case. Perhaps the clean headers is better as an option or not merged. When I put it in a PR, I'll make it a selectable option or perhaps change it to clean only whole header formatting unless you don't see value in adding it to this project.

I made a sample document containing a combination of some of the issues I was running into. It's interesting. Until I added the config blocks with special characters, it actually produced a continuous code block without my additional code fixing this for my docs. The image issue is happening either way. I imagine it should be easy to write something to detect images in a paragraph and fix them to not be treated as a code block.

screenshot of doc conversion

@MaderHatt3r
Copy link
Author

Note: I reverted the continuous code block and stripping-formatting-from-headers commits from this PR for now to put into their own PRs later.

@Mr0grog
Copy link
Owner

Mr0grog commented Mar 3, 2025

Perhaps the clean headers is better as an option or not merged. When I put it in a PR, I'll make it a selectable option or perhaps change it to clean only whole header formatting unless you don't see value in adding it to this project.

That works! I think an option would be fine, and I also think no option but it only strips formatting that applies to the whole header would work pretty well, too (maybe better!).

I reverted the continuous code block and stripping-formatting-from-headers commits from this PR for now to put into their own PRs later.

👍

I imagine it should be easy to write something to detect images in a paragraph and fix them to not be treated as a code block.

I think what’s happening here is that we wind up with a tree that represents something like:

<p>
  <!-- Originally was: <span style="font-family: monospace;">, but we convert to: -->
  <code>
    <img src="...">
  </code>
</p>

And then the default conversion of <code> into Markdown just looks at the text content, and doesn’t consider whether there is replaced content (images, audio, video, etc.) in there.

So yeah, the best way to solve this is to pull that image out of the code block. Probably when we convert monospace formatting to <code> nodes, look for replaced content nodes (or at least <img> nodes) and pull them up a level and out of the <code>. e.g. convert this:

<p>
  <span style="font-family: monospace;">
    abc<img src="...">def
  </span>
</p>

To this:

<p>
  <!-- Break the code element around the img -->
  <code>abc</code>
  <img src="...">
  <code>def</code>
</p>

…then make sure we remove the <code> nodes on either side of the <img> if they are empty (which they would be in your example doc).

Another approach might be to customize the handler for <pre> and <code> nodes in lib/convert.js.

I don't mind spending a little more time on this to produce cleaner maintainable code to provide back, but it might end up taking me a while

No worries! Just let me know when it’s ready for more review. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants