First Look at Rich Text: Overview of Rich Text

Rich text editors usually refer to products that can edit text, images, etc., and have the ability to see what you get. For tags like Input and Textarea, they support content editing, but they do not support features like formatted text or inserting images, so a rich text editor is required to meet these needs. Nowadays, rich text editors are not limited to just text and images; they also include more complex functions such as videos, tables, code blocks, mind maps, attachments, formulas, and formatting brushes.

Description

Rich text editors are actually a very deep field and are very difficult to implement. For example, how to handle the cursor, how to handle selections, and so on. Of course, with the help of the browser's capabilities, we can relatively easily implement similar functions, but this may lead to over-reliance on the browser and compatibility issues. In addition, there is a very strong competitor called Word. When a product proposes a requirement that is modeled after Word, it becomes a big challenge because the capabilities supported by Word are extremely powerful. The large installation package is not without reason; it has handled a huge amount of rich text related cases internally.

Of course, what we are describing here is rich text implemented in the browser. It is not feasible to implement Word-like functionality in the browser with just a few hundred KB or a few MB. Although implementing rich text editing capabilities in the browser is not easy, for us developers, we may prefer to use Markdown to complete the writing of relevant documents. Of course, this does not fall within the scope of rich text editors, as Markdown files are plain text files with the main focus on rendering. If you want to expand syntax or even embed React components in Markdown, you can refer to the markdown-it and mdx projects.

Evolutionary Journey

Web-based rich text editors are constantly evolving. Throughout the development process, they have encountered many difficulties, which has led to the developmental process being divided into three stages: L0, L1, and L2. Of course, there is no good or bad here, only suitable or unsuitable. Usually, L1 editors already meet the needs of the vast majority of rich text editing scenarios, and there are many out-of-the-box rich text editors to choose from. The specific selection depends on the requirements.

Type Features Products Advantages Disadvantages
L0 1. Rich text editing based on contenteditable provided by the browser. 2. Executing commands using the browser's document.execCommand. Early lightweight editors. Quickly complete development in a short period of time. Very limited customization space.
L1 1. Rich text editing based on contenteditable provided by the browser. 2. Data-driven, with custom data models and command execution. Shimo Document, Feishu Document. Meets the vast majority of usage scenarios. Unable to overcome the browser's own typesetting effects.
L2 1. Self-implemented typesetting engine. 2. Only relies on a small amount of browser API. Google Docs, Tencent Document. Completely control typesetting by themselves. Technically very challenging.

L0

Earlier, the development of current editors was classified into three stages: L0, L1, and L2. Some have even further divided the L0 stage into two stages, making a total of four stages. However, since these two stages are completely dependent on the implementation of contenteditable and document.execCommand, they are all grouped together under L0.

Before the appearance of rich text editors, web browsers already had the capability to display rich text content through styling with HTML and CSS, but for user input, the <textarea /> and <input /> provided by the browser only allowed users to input plain text content, with very limited capabilities. Later, the appearance of contenteditable gave nodes the ability to edit their HTML structure, allowing us to directly edit rich text areas without needing to use <textarea /> or <input /> for input. Below is the simplest implementation of a text editor; just copy the code below into the browser's address bar to have a simple text editor. At this point, the only thing missing to have a rich text editor is the execution of document.execCommand, where the main work is similar to completing a toolbar to execute commands and convert the format of selected text to another format.

data:text/html, <div contenteditable="true"></div>

Those who have dealt with text copy functionality should be familiar with the command document.execCommand("copy"), which is a fallback solution when navigator.clipboard is not available. MDN lists all the commands supported by document.execCommand, showing that it supports parameters such as bold and heading. We can use a combination of contenteditable and these parameters to implement a simple rich text editor.

Of course, if we need to combine command execution, we can't simply implement it with just one line of code. Here, using bold as an example, let's complete a DEMO.

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>L0</title>
</head>
<body>
    <div contenteditable style="border: 1px solid #aaa;">Test text bold<br>Select the text and click the bold button below to make it bold.</div>
    <button onclick="document.execCommand('bold')" style="margin-top: 10px;">Bold</button>
</body>
</html>

L1

The approach of using document.execCommand to modify the HTML is simple, but its controllability is obviously poor. For example, when implementing the bold function, we cannot control whether to use <b></b> or <strong></strong> to achieve the bold effect. There are also browser compatibility issues; for example, in IE, <strong></strong> is used for bold, while in Chrome, it is <b></b>. Additionally, IE and Safari do not support implementing heading commands through execCommand. document.execCommand can only achieve relatively simple formatting, and it cannot handle more complex functions like images, code blocks, etc.

For greater extensibility and to resolve the issue of data and view mismatch, the rich text editor in L1 uses the concept of a custom data model. This model is a data structure derived from the DOM tree, ensuring that the same data structure guarantees consistent rendering of HTML. By controlling the data model directly with custom commands, the final rendering of the HTML document is ensured to be consistent. In simple terms, it constructs a data model that describes the document structure and content, and uses custom execCommand to modify the data model. Similar to the MVVM model, when a command is executed, the current model is modified, reflecting the rendering in the view.

The rich text editor in L1 addresses the issues of dirty data and the difficulty in implementing complex functions in rich text. Through data-driven development, it can better meet the needs of customized features, cross-platform parsing, online collaboration, and so on. Here, a L1 rich text DEMO based on slate has been implemented, GithubEditor DEMO.

L2

In reality, L1 already meets the needs of most scenarios. Only when there is a deep need for customization of layout and cursor would it be necessary to consider using L2 customization. However, in most cases, my suggestion is to re-evaluate the requirements.

Regarding the need for customized layout and cursor, let's take an example: many users who are familiar with Word might have noticed that if we write text that perfectly fills a line and then add a period, the preceding words will be slightly adjusted to accommodate the period, allowing it to remain on the same line. However, if we add another character, it will cause a line break. This kind of text layout is not achievable in browser rendering. Therefore, in order to overcome the limitations of browser rendering, custom layout capabilities need to be implemented.

Users who have used CodeMirror5 might have noticed that in the default configuration of CodeMirror, except for the layout capabilities, all other aspects have their own implementation. For example, the cursor position is simulated using a div, input is managed through a Textarea with a following cursor, text selection is covered by a div, and the scrollbar is also custom implemented. This definitely has the vibe of L2. However, it's not fully L2 because it still relies on the browser for text layout and cursor positioning using Range. Yet, such a nearly fully self-implemented solution is already quite challenging.

Fully implementing L2 capabilities is no less difficult than developing a browser from scratch because it requires supporting various layout capabilities of browsers, making it highly complex. Moreover, there is a significant performance overhead in this area. Creating a layout engine from scratch involves constantly calculating the position of each character, which is a considerable computational task. If the performance is poor, it can lead to rendering lag during layout, resulting in a very poor user experience. The mainstream L2 rich text editors currently rely on Canvas to draw all content. Since Canvas is just a drawing board, all aspects such as layout, selection, cursor, etc., need to be calculated and implemented from scratch. Due to the large amount of calculations, much of the computational workload is often offloaded to Web Workers to improve performance through data communication using postMessage.

Similar to a game character breaking through a bottleneck period, the transition from L0 to L1 in a rich text editor involved the extraction of a custom data model, while the transition from L1 to L2 involved the development of a custom layout engine.

Core Concepts

The core concepts here mainly refer to some universal concepts in the L1 rich text editor. Because in L1, the editor usually maintains its own set of data structures and rendering schemes, there will generally be a self-built model system, such as Quill's Parchment, Blot, Delta, and so on, as well as Slate's Transforms, Normalizing, DOM DATA MODEL, and so on. However, as long as it relies on the implementation of the browser and contenteditable, it cannot do without some basic concepts.

Selection

The Selection object represents the text range selected by the user or the current position of the insertion point, representing the text selection on the page, which may span multiple elements and is generated by the user dragging the mouse over the text. When the Selection is in a Collapsed state, it is what is commonly referred to as a cursor, which means that the cursor is actually a special state of Selection. In addition, it is important to note that there is no necessary connection with the focus event or values like document.activeElement.

Because it's still running in the browser, implementing a rich text editor still needs to rely on changes in this selection area. Usually, when the selected text content changes, the SelectionChange event will be triggered, and some operations can be completed through the callback of this event.

window.getSelection();
// {
//   anchorNode: text,
//   anchorOffset: 0,
//   baseNode: text,
//   baseOffset: 0,
//   extentNode: text,
//   extentOffset: 3,
//   focusNode: text,
//   focusOffset: 3,
//   isCollapsed: false,
//   rangeCount: 1,
//   type: "Range",
// }

In the editor, the selection area is usually not directly used to perform the desired operations. For example, in Quill, the selection is represented by the starting position and length, which is mainly to match its Delta to describe the document model, so in Quill, the operation from Selection selection area to Delta selection area is completed in this way, in order to obtain a selection of Delta that we can manipulate.

quill.getSelection()
// {index: 0, length: 3}

In Slate, many concepts from the DOM are used, such as Void Element, Selection, and so on. In Slate, the selection area is also processed, and it is also because its data structure is a JSON data type similar to the DOM MODEL structure, so its Point is represented by Path + Offset, and its selection area is therefore composed of two Points. In addition, compared to Quill, Slate retains the order of the user's selection from left to right or from right to left, which means that selecting the same area from left to right and from right to left is different, specifically, the anchor and focus are reversed.

slate.selection
// {
//   anchor: {
//     offset: 0,
//     path: [0, 0],
//   },
//   focus: {
//     offset: 3,    // text offset
//     path: [0, 0], // text node
//   },
// };

Range

Whether based on contenteditable or beyond contenteditable, editors will have the concept of Range. Translated into English, Range means range or extent, similar to the concept of intervals in mathematics, that is, Range refers to a content range. In fact, the Selection in the browser is composed of Range, and we can use selection.getRangeAt to obtain the Range object of the current selection.

Specifically, the Range provided by the browser is used to describe a continuous range in the DOM tree. startContainer and startOffset are used to describe the start of the Range, while endContainer and endOffset describe the end of the Range. When the start and end of a Range are at the same position, the Range is in a Collapsed state, which is our common cursor state. In fact, Selection is a way to represent Range, and Selection is usually read-only, but the constructed Range object can be manipulated. By combining the Selection object and the Range object, we can complete some operations on the selection, such as adding or canceling the current selection.

const selection = document.getSelection();
const range = document.createRange();
range.setStart(node, 0); // text node offset
range.setEnd(node, 1); // text node offset
selection.removeAllRanges();
selection.addRange(range);

Copy & Paste

Copying and pasting is a quite fundamental concept because in current rich text editors, we typically maintain a highly customized DOM structure. For example, when using a top-level heading, we may not use the H1 tag but instead replicate it using a div to avoid problems associated with nesting H1. However, this approach can lead to other issues.

Firstly, for copying, we want the copied text/html nodes to adhere to standards; a top-level heading should be represented using H1. Since we maintain our own data structure and have control over the text/html generation, particularly in L2 editors where there isn't even a DOM structure, we have to implement copying ourselves. As for pasting, we are more concerned because the current data model is usually maintained by us. Therefore, when we copy rich text from elsewhere, we need to parse it into a data structure that can be used by us, such as Quill's Delta model or Slate's JSON DOM model. Thus, for the copy and paste behavior, we also need to intervene and prevent the default behavior.

For the copy behavior, we can get the current selection's content during copying, serialize it, concatenate the HTML string, and then, if possible, use the navigator.clipboard and window.ClipboardItem objects to directly construct a Blob for writing. If not supported, we can fallback by using the setData method of the clipboardData object of the onCopy event to set it to the clipboard. If it still cannot be completed, we simply write in text/plain and not in text/html; there are many fallback strategies.

// slate example // serializing
const editor = {
  children: [
    {
      type: 'paragraph',
      children: [
        { text: 'An opening paragraph with a ' },
        {
          type: 'link',
          url: 'https://example.com',
          children: [{ text: 'link' }],
        },
        { text: ' in it.' },
      ],
    },
    {
      type: 'quote',
      children: [{ text: 'A wise quote.' }],
    },
    {
      type: 'paragraph',
      children: [{ text: 'A closing paragraph!' }],
    },
  ],
}

// ===>

// <p>An opening paragraph with a <a href="https://example.com">link</a> in it.</p>
// <blockquote><p>A wise quote.</p></blockquote>
// <p>A closing paragraph!</p>

As for the paste behavior, we need to listen to the onPaste event and use event.clipboardData.getData("text/html") to obtain the pasted text/html string. Of course, if it's not available, then getting the text/plain would suffice. If neither is available, it's as if we pasted loneliness. If there is a text/html string, we can utilize DOMParser to parse the string and then construct the data structure we need.

<!-- slate example --> <!-- deserializing -->
<p>An opening paragraph with a <a href="https://example.com">link</a> in it.</p>
<blockquote><p>A wise quote.</p></blockquote>
<p>A closing paragraph!</p>

<!-- ===> -->


```javascript
const fragment = [
  {
    type: 'paragraph',
    children: [
      { text: 'An opening paragraph with a ' },
      {
        type: 'link',
        url: 'https://example.com',
        children: [{ text: 'link' }],
      },
      { text: ' in it.' },
    ],
  },
  {
    type: 'quote',
    children: [
      {
        type: 'paragraph',
        children: [{ text: 'A wise quote.' }],
      },
    ],
  },
  {
    type: 'paragraph',
    children: [{ text: 'A closing paragraph!' }],
  },
]

History

History, also known as Undo/Redo operation, has two implementation methods: recording data and recording actions.

The operation of recording data is similar to saving snapshots. When the user performs an operation, regardless of any changes, the entire content is saved and a linear stack is maintained. When performing the Undo/Redo operation, the content to be restored from the stack is completely presented. This operation is similar to trading space for time. We don't have to consider exactly which data the user has changed, we just record all possible changed parts when there is a change. This approach is relatively simple to implement, but if the data volume is large, it consumes a lot of memory.

Recording actions saves the actual operations, including specific operation actions and the data modified by the actions, also maintaining a linear stack. When performing the Undo/Redo operation, the saved actions are reversed. This method is similar to trading time for space. Each time, only the type of user operation and the related operation data need to be recorded, without storing the entire content, saving space. However, the complexity is increased significantly. Since rich text operations are actually implemented through commands, we can completely store these contents, and maintaining a method of saving operation records is more in line with the current design. Additionally, a well-designed implementation of this part will be very helpful for implementing collaborative algorithms such as Operational Transform.

Daily Question

https://github.com/WindrunnerMax/EveryDay

References

https://zhuanlan.zhihu.com/p/90931631 https://www.zhihu.com/question/38699645 https://www.zhihu.com/question/404836496 https://juejin.cn/post/7114547099739357214 https://juejin.cn/post/6844903555900375048 https://juejin.cn/post/6955335319566680077 https://segmentfault.com/a/1190000040289187 https://segmentfault.com/a/1190000041457245 https://codechina.gitcode.host/programmer/application-architecture/7-youdao-note.html