A TinyMCE plugin for cleaning up HTML copy-paste disasters

This post was written by eli on April 23, 2022
Posted Under: JavaScript,Rich text editors

Introduction

This post is about a plugin I wrote for TinyMCE v5.10.2, which cleans up some mess that often occurs in the HTML code when copy-pasting text from the document to itself. It doesn’t cover copy-pasting from other sources, however the plugin is easily modified to wipe away all kinds of superfluous HTML. See this as boilerplate code.

Canonicalize plugin iconThe plugin adds a button to the toolbar with an icon of a cleaning mop, as shown to the left. When clicked, the selected text is cleaned up, or as I preferred calling it, canonicalized.

It’s a bit of work in progress. For now, it addresses two issues:

  • Unnecessary “style” formatting attributes
  • Break tags inside <pre> blocks

I’ll explain both of these issues, and then I’ll show the plugin, with source and explanations.

Note that the selection granularity is blocks (as in <p> and <div> segments). This means that if a word within a block is selected, all text in the block is canonicalized. I’ll explain why further down, but I wanted to have this on the table right away, as it may appear to be a bug. Anyhow, the way I will probably use this plugin is by selecting all text, so I doubt it will matter much. At least to me.

Also note that this plugin considers a “style” formatting attribute superfluous if the formatting of the enclosed text remains exactly the same even when it’s removed. Hence if there are multiple views of the same page (e.g. for printing, or a narrower viewport) it may remove formatting that would have made a difference on another view. It may be correct to remove that formatting nevertheless. I discuss this later too.

My anecdotal tests with Chrome and Firefox resulted in bytewise exactly the same HTML output code. But that can change in the future.

The superfluous “style” attribute problem

Rich text editing inside web browsers has the well-known problem that the same keyboard input can produce different HTML, depending on which browser (and which version) is used. But that’s small potatoes compared with what happens when text is pasted into the editing window.

The problem is that unless “Paste Text Only” is used, formatting attributes go along with the pasted text. Browsers can be pretty creative in how they turn these attributes into the DOM representation of the text, and consequently the HTML it produces once saved (for example, this thread and this too, both regardless of TinyMCE). This is kind-of understandable when the formatted text comes from an external source (a word processor or a web page), but it gets really obnoxious when junk formatting is inserted as a result of copy-pasting within the same editing window.

So yes, a plain copy-paste sequence from one part of the editing window to another can end up with things like:

<p style="font-family: verdana, Arial, Helvetica, sans-serif; font-size: 16px; font-weight: 400;">Text</p>

when the copied text was just

<p>Text</p>

Break tags inside <pre>

Another annoying thing is that when pasting into a preformatted block (that is, within a <pre> tag), newlines are sometimes translated into <br> tags. That’s completely unnecessary, ending up with a very long single line instead of HTML that is easily read as plain code.

Why do browsers add the “style” attribute?

What apparently happens is that the browser (Google Chrome in this case) sees importance in preserving the formatting of the text as copied, and adds an inline “style” attribute with the explicit definition of the font formatting. What’s really annoying is when the text would have been rendered exactly the same without this “style” attribute. That is, when the calculated formatting of the text part is exactly the same with and without it.

One may wonder why Chrome (and other browsers) doesn’t optimize away this unnecessary formatting assignment. The truth is that I don’t know, but I speculate that the rationale is that it’s not 100% clear whether the formatting is necessary or not. The thing is that the calculated formatting (the visible result) of the environment can be altered by virtue of CSS manipulations. For example, a different viewport, or even resizing the window can change the CSS definition of the document. Hence the fact that the pasted text happens to be formatted exactly like its environment doesn’t necessarily mean that it will stay so: The environment can change dynamically.

And that leaves us with the question: If the environment’s formatting changes, should the pasted text change along with it, or should it retain the format it had in its origin? If the pasted text is formatted the same as its environment, does it necessarily means it’s the same in essence?

That answer is very easy to answer in my case, since I don’t make any CSS manipulations. Hence if it looks the same, it is the same. Therefore if the inline style formatting doesn’t change anything, it should go away.

Selection with block granularity

How do you manipulate all DOM elements within a selection? Loop on all DOM elements inside it, of course. But if all text in a paragraph is selected, does it include the <p> tag that encloses it? And the last character is excluded from the selection, possibly by mistake, should the <p> tag be excluded or included?

This isn’t a philosophical question: Often the “style” attributes that need to be removed are placed in the <p> tag itself.

So I could have developed a clever algorithm for this (and fixed its bugs eternally), but I went for a simpler approach: Apply a block-level formatting for the entire selected region. The trick is to call formatter.apply with a format that is registered as block-level formatting, so TinyMCE wraps the entire selection with a <div> element defining a dedicated class. Now all that is left is to request all DOM elements that are descendants of this specific class. That’s a simple CSS filter.

However because this is block-level formatting, it applies to all text in the block, not just exactly the segment that has been selected. It’s like requesting a block quote: It won’t be done on part of the block.

One could ask why not use inline formatting, like I did with the syntax highlighting plugin. That would, in theory, generate a <span> element on the exact part of the selection, nothing else. But that wouldn’t work at all: If the selection includes more than one block, a <span> tag is inserted inside each of these blocks. Hence this method is guaranteed to miss <p> and other block tags that have “style” attributes that need removal.

So even if I wrote a clever algorithm, there would have been quirky behavior no matter what choices I would make. Applying the plugin to the entire block is relatively intuitive.

The source code

Without further ado, this is the JavaScript code for the said plugin.

(function () {
    'use strict';

    let canonicalize = function(api) {
      // First clean up possible leftovers
      editor.dom.select('div.canonicalizer_tmp').forEach(function(el) {
	editor.dom.remove(el, true);
       });

      editor.formatter.apply('canonicalizer_wrapper');

      // Replace <br> tags under <pre> with newlines
      editor.dom.select('div.canonicalizer_tmp pre br').forEach(function(el) {
        var newline = el.ownerDocument.createTextNode("\n");
	editor.dom.insertAfter(newline, el);
	editor.dom.remove(el, false);
      });

      editor.dom.select('div.canonicalizer_tmp *').forEach(function(el) {
        var style = editor.dom.getAttrib(el, 'style');

	if (!style)
	  return;

	var thewindow = el.ownerDocument.defaultView;
	var from_inline = editor.dom.parseStyle(style);
	var prestyle = thewindow.getComputedStyle(el, null);
	var before = { };
	var p;

	for (p in from_inline) {
	  before[p] = prestyle[p];
	}

	editor.dom.setAttrib(el, 'style', null); // Remove the attribute
	var after = thewindow.getComputedStyle(el, null);
	var survivors = { };

	for (p in from_inline) {
	  // If some style names should be removed anyhow, this is the place
	  // to look it up and issue a "continue" on a match.

	  if (before[p] === after[p])
	    continue;
	  survivors[p] = from_inline[p];
	}

	editor.dom.setAttrib(el, 'style', editor.dom.serializeStyle(survivors));
      });

      // Cleanup the <div> wrapper
      editor.dom.select('div.canonicalizer_tmp').forEach(function(el) {
        editor.dom.remove(el, true);
      });

      editor.undoManager.add(); // Make operation undoable
    }

    tinymce.PluginManager.add('canonicalize', function (editor) {
      editor.ui.registry.addIcon('canonicalize-icon', '<svg width="24" height="24" fill-rule="evenodd"><path d="M 5.86,22.51 C 5.51,22.44 3.79,21.81 3.41,20.46 3.03,19.11 3.38,16.66 5.80,13.83 8.23,10.99 9.54,10.29 10.70,10.15 10.72,9.79 11.26,9.23 12.12,9.48 12.67,8.38 19.15,0.70 19.69,0.00 20.10,-0.16 20.29,-0.27 20.76,0.00 19.25,2.28 13.81,9.44 13.14,9.95 13.46,10.27 13.69,10.74 13.47,11.10 14.08,11.86 13.74,13.06 13.00,15.04 12.39,16.57 10.84,19.72 15.22,19.71 15.04,20.17 14.85,20.60 13.57,20.74 13.97,21.05 14.21,21.32 14.64,21.52 14.04,21.50 13.08,21.68 12.27,21.21 12.54,21.77 12.79,22.06 13.03,22.30 12.07,22.21 11.08,21.96 10.53,21.55 10.54,22.06 10.66,22.43 10.77,22.69 9.90,22.51 8.41,22.07 7.69,21.46 7.74,22.44 7.81,22.27 7.93,22.63 6.30,21.91 6.29,21.75 5.59,21.07 5.68,21.84 5.68,22.05 5.86,22.51 Z M 5.92,14.23 C 5.92,14.23 6.92,13.04 7.05,12.91 7.74,14.46 9.31,14.87 12.82,14.54 12.50,15.40 12.33,15.66 12.28,15.81 9.93,16.07 7.12,16.52 5.92,14.23 Z" /></svg>');

      editor.ui.registry.addButton('canonicalize', {
	icon: 'canonicalize-icon',
        tooltip: 'Canonicalize (Cleanup unnecessary styles and <br>)',
        onAction: canonicalize
      });
   });
}());

For this to work, the string “canonicalize” should be added to the plugin property, as well as to somewhere in the toolbar property in the editor’s initialization (the tinymce.init call).

Also, the “init_instance_callback” property should be set to “mce_post_init”, so that the canonicalizer_wrapper format is registered during initialization with

function mce_post_init(ed) {
  editor.formatter.register('canonicalizer_wrapper',
    { block: 'div', classes: 'canonicalizer_tmp', wrapper: true } );
}

So we have something like

tinymce.init({
  selector: '#mytextarea',
  init_instance_callback : "mce_post_init",
  [ ... ]
  plugins: 'canonicalize [ ... ]',
  toolbar: '[ ... ] canonicalize [ ... ]',
  [ ... ]
});

Walking through the code

The canonicalize function starts with cleaning up possible <div> wrappers that may have be left from a previous messup. This should never be necessary, but better safe.

The call to editor.formatter.apply() applies the format defined as canonicalizer_wrapper to the current selection. Because this format is defined as “block” and has the “wrapper” property true, TinyMCE wraps the selected region with a <div> having the class “canonicalizer_tmp”. This is a block-level wrap, including the entire blocks that contain the selection.

The next thing is to replace <br> tags inside <pre>. The tags to manipulate are found with editor.dom.select(‘div.canonicalizer_tmp pre br’), i.e. by virtue of a plain CSS selection expression that says “find me all <br> tags that are descendants of <pre> tags, that are descendants of <div> tag with class “canonicalizer_tmp”. Or for short, inside the selection.

For each such <br> element found, a text DOM element containing a newline is created by the browser API’s createTextNode() (does TinyMCE have something parallel? I didn’t find any). This element is put after the <br> element, after which the latter is removed from the DOM with editor.dom.remove(). The second argument in this call, “false”, means that any sub-elements should be removed as well — but <br> shouldn’t have any, so this doesn’t matter much.

That was the easy part. Now to getting rid of useless “style” assignments.

Similar to before, editor.dom.select(‘div.canonicalizer_tmp *’) finds all DOM elements that are descendants of <div> tag with class “canonicalizer_tmp”. An anonymous function is called for each.

If the “style” attribute isn’t present at all, nothing is done for the element (“return” inside a looped function is like “continue” inside a loop).

Then a few preparations: “thewindow” is assigned with what is usually referred to as “window” in JavaScript, and odds are that the latter could be used anyhow. I use this form because it appears somewhere in TinyMCE’s internal code, so it may be an overkill, and maybe not.

Then “from_inline” is assigned with an object that is a key-value representation of the text in the “style” attribute. And then “prestyle” is assigned with the object that represents the browser’s computed style of the element. In other words, the key-value attributes of the formatting, as displayed, with keys like “font-size”, “font-family” and many many more. This is done by calling getComputedStyle() on the Window Object.

Next, there’s a loop on all attributes that were assigned in the DOM element’s “style” attribute — they are the keys of “from_inline”. In this loop, the “before” object is populated with keys taken from the “style” attribute, and values from the currently calculated formatting for that DOM element.

Note that stringwise comparing the value in “from_inline” with that of “prestyle” wouldn’t have helped much, because there’s more than one way to express formatting. For example, the value in “from_inline” could be “red” and in “prestyle” it would probably say “rgb(255,0,0)”. Different strings meaning the same thing.

Because there are gazillion key-value pairs in the browser’s computed style, a copy of the relevant ones is made, as a preparation for the next step:

The “style” attribute is removed from the DOM element, as to say: Now let’s see if that made a difference.

And here comes the most misleading part of this code: “after” is assigned with the value of getComputedStyle(). This is most likely unnecessary: The values in the “prestyle” object (previously returned by getComputedStyle() ) magically change as the DOM element changes, at least on Chrome. Hence I could have used the same object to check the formatting even after removing the “style” attribute.

However as far as I could tell, the documentation doesn’t say anything on this peculiar feature. So the bottom line is that “prestyle” can’t be relied upon to be updated after the change, but neither can be be relied upon to retain the values before the change in the DOM object. The easy way to solve it: Call getComputedStyle() again, after the change. That surely works either way.

That’s why the relevant key-value pairs are copied into “before” object prior the DOM change. It’s a simple object, and nothing fishy will happen with it. I nevertheless call getComputedStyle() again to get a fresh “after” object, just to be sure.

So after the removal of the “style” attribute, there’s a loop again on the keys of “from_inline”, this time comparing the computed value before the change, as copied into “before”, and the currently computed value. If they are different, the “survivors” object is assigned with the key-value pair as it appeared in the original “style” attribute.

Note the comment in this loop: If you want to wipe out certain style attributes whether they make a difference or not, just make sure that a “continue” is called when “p” equals their name, and you’re done.

After this loop, the DOM element’s “style” attribute is assigned with the translation to text (using serializeStyle() ) of the key-value pairs that were stored in the “survivors” object. If the object is empty, no “style” attribute is generated.

This concludes the loop for each DOM element in the selection. Prior to exiting, all “canonicalizer_tmp” <div> elements are removed. Note that the second argument of the remove() call method is true, which means that the children of this element should be retained. Or else all content in the selection would be wiped away.

Plus undoManager.add() is called to make the entire operation undoable.

Some extra notes

  • The plugin code doesn’t remove <span> elements that have no “style” nor “class” attributes. Such elements are pointless, and they may very well be the result of a <span> element that carried a “style” attribute that was removed throughout the process. As it turns out, TinyMCE removes these in the output given to getContent(), so they don’t appear in the saved document nor when viewing the source code inside the editor. In fact, they can be removed just by viewing the source code and then saving it. So I found it pointless to add an additional manipulation to fix a problem that practically doesn’t exist.
  • TinyMCE has a getStyle() method for obtaining the computed format. It’s based upon getComputedStyle() and does a whole lot of API friendliness stuff. I opted out mainly because I just needed a before-after check.
  • It may be possible to make the cleanup as the text is pasted, so there won’t be a need to do it explicitly. For example, this page suggests hooking on “paste_preprocess” to get a chance to clean up the content. By the same coin, maybe the “paste_postprocess” callback can be used to call canonicalize() every time new text is pasted, so it gets cleaned up immediately. The problem with this approach is the on-the-fly intervention, which can be confusing if something goes wrong. I prefer that witchcraft is done when I request it, so I have an eye on if something fishy happened. And by the way, this requires the Paste premium tinyMCE plugin.

Add a Comment

required, use real name
required, will not be published
optional, your blog address