Rich text editors usually refer to products that can edit text, images, etc., and have the ability to see what you get. For tags like Input
and Textarea
, they support content editing, but they do not support features like formatted text or inserting images, so a rich text editor is required to meet these needs. Nowadays, rich text editors are not limited to just text and images; they also include more complex functions such as videos, tables, code blocks, mind maps, attachments, formulas, and formatting brushes.
Rich text editors are actually a very deep field and are very difficult to implement. For example, how to handle the cursor, how to handle selections, and so on. Of course, with the help of the browser's capabilities, we can relatively easily implement similar functions, but this may lead to over-reliance on the browser and compatibility issues. In addition, there is a very strong competitor called Word
. When a product proposes a requirement that is modeled after Word
, it becomes a big challenge because the capabilities supported by Word
are extremely powerful. The large installation package is not without reason; it has handled a huge amount of rich text related cases internally.
Of course, what we are describing here is rich text implemented in the browser. It is not feasible to implement Word
-like functionality in the browser with just a few hundred KB or a few MB. Although implementing rich text editing capabilities in the browser is not easy, for us developers, we may prefer to use Markdown
to complete the writing of relevant documents. Of course, this does not fall within the scope of rich text editors, as Markdown
files are plain text files with the main focus on rendering. If you want to expand syntax or even embed React
components in Markdown
, you can refer to the markdown-it
and mdx
projects.
Web-based rich text editors are constantly evolving. Throughout the development process, they have encountered many difficulties, which has led to the developmental process being divided into three stages: L0
, L1
, and L2
. Of course, there is no good or bad here, only suitable or unsuitable. Usually, L1
editors already meet the needs of the vast majority of rich text editing scenarios, and there are many out-of-the-box rich text editors to choose from. The specific selection depends on the requirements.
Type | Features | Products | Advantages | Disadvantages |
---|---|---|---|---|
L0 |
1. Rich text editing based on contenteditable provided by the browser. 2. Executing commands using the browser's document.execCommand . |
Early lightweight editors. | Quickly complete development in a short period of time. | Very limited customization space. |
L1 |
1. Rich text editing based on contenteditable provided by the browser. 2. Data-driven, with custom data models and command execution. |
Shimo Document, Feishu Document. | Meets the vast majority of usage scenarios. | Unable to overcome the browser's own typesetting effects. |
L2 |
1. Self-implemented typesetting engine. 2. Only relies on a small amount of browser API . |
Google Docs, Tencent Document. | Completely control typesetting by themselves. | Technically very challenging. |
Earlier, the development of current editors was classified into three stages: L0
, L1
, and L2
. Some have even further divided the L0
stage into two stages, making a total of four stages. However, since these two stages are completely dependent on the implementation of contenteditable
and document.execCommand
, they are all grouped together under L0
.
Before the appearance of rich text editors, web browsers already had the capability to display rich text content through styling with HTML
and CSS
, but for user input, the <textarea />
and <input />
provided by the browser only allowed users to input plain text content, with very limited capabilities. Later, the appearance of contenteditable
gave nodes the ability to edit their HTML
structure, allowing us to directly edit rich text areas without needing to use <textarea />
or <input />
for input. Below is the simplest implementation of a text editor; just copy the code below into the browser's address bar to have a simple text editor. At this point, the only thing missing to have a rich text editor is the execution of document.execCommand
, where the main work is similar to completing a toolbar to execute commands and convert the format of selected text to another format.
Those who have dealt with text copy functionality should be familiar with the command document.execCommand("copy")
, which is a fallback solution when navigator.clipboard
is not available. MDN lists all the commands supported by document.execCommand
, showing that it supports parameters such as bold
and heading
. We can use a combination of contenteditable
and these parameters to implement a simple rich text editor.
Of course, if we need to combine command execution, we can't simply implement it with just one line of code. Here, using bold as an example, let's complete a DEMO
.
The approach of using document.execCommand
to modify the HTML
is simple, but its controllability is obviously poor. For example, when implementing the bold function, we cannot control whether to use <b></b>
or <strong></strong>
to achieve the bold effect. There are also browser compatibility issues; for example, in IE
, <strong></strong>
is used for bold, while in Chrome
, it is <b></b>
. Additionally, IE
and Safari
do not support implementing heading commands through execCommand
. document.execCommand
can only achieve relatively simple formatting, and it cannot handle more complex functions like images, code blocks, etc.
For greater extensibility and to resolve the issue of data and view mismatch, the rich text editor in L1
uses the concept of a custom data model. This model is a data structure derived from the DOM
tree, ensuring that the same data structure guarantees consistent rendering of HTML
. By controlling the data model directly with custom commands, the final rendering of the HTML
document is ensured to be consistent. In simple terms, it constructs a data model that describes the document structure and content, and uses custom execCommand
to modify the data model. Similar to the MVVM
model, when a command is executed, the current model is modified, reflecting the rendering in the view.
The rich text editor in L1
addresses the issues of dirty data and the difficulty in implementing complex functions in rich text. Through data-driven development, it can better meet the needs of customized features, cross-platform parsing, online collaboration, and so on. Here, a L1
rich text DEMO
based on slate
has been implemented, Github | Editor DEMO.
In reality, L1
already meets the needs of most scenarios. Only when there is a deep need for customization of layout and cursor would it be necessary to consider using L2
customization. However, in most cases, my suggestion is to re-evaluate the requirements.
Regarding the need for customized layout and cursor, let's take an example: many users who are familiar with Word
might have noticed that if we write text that perfectly fills a line and then add a period, the preceding words will be slightly adjusted to accommodate the period, allowing it to remain on the same line. However, if we add another character, it will cause a line break. This kind of text layout is not achievable in browser rendering. Therefore, in order to overcome the limitations of browser rendering, custom layout capabilities need to be implemented.
Users who have used CodeMirror5
might have noticed that in the default configuration of CodeMirror
, except for the layout capabilities, all other aspects have their own implementation. For example, the cursor position is simulated using a div
, input is managed through a Textarea
with a following cursor, text selection is covered by a div
, and the scrollbar is also custom implemented. This definitely has the vibe of L2
. However, it's not fully L2
because it still relies on the browser for text layout and cursor positioning using Range
. Yet, such a nearly fully self-implemented solution is already quite challenging.
Fully implementing L2
capabilities is no less difficult than developing a browser from scratch because it requires supporting various layout capabilities of browsers, making it highly complex. Moreover, there is a significant performance overhead in this area. Creating a layout engine from scratch involves constantly calculating the position of each character, which is a considerable computational task. If the performance is poor, it can lead to rendering lag during layout, resulting in a very poor user experience. The mainstream L2
rich text editors currently rely on Canvas
to draw all content. Since Canvas
is just a drawing board, all aspects such as layout, selection, cursor, etc., need to be calculated and implemented from scratch. Due to the large amount of calculations, much of the computational workload is often offloaded to Web Workers
to improve performance through data communication using postMessage
.
Similar to a game character breaking through a bottleneck period, the transition from L0
to L1
in a rich text editor involved the extraction of a custom data model, while the transition from L1
to L2
involved the development of a custom layout engine.
The core concepts here mainly refer to some universal concepts in the L1
rich text editor. Because in L1
, the editor usually maintains its own set of data structures and rendering schemes, there will generally be a self-built model system, such as Quill
's Parchment
, Blot
, Delta
, and so on, as well as Slate
's Transforms
, Normalizing
, DOM DATA MODEL
, and so on. However, as long as it relies on the implementation of the browser and contenteditable
, it cannot do without some basic concepts.
The Selection
object represents the text range selected by the user or the current position of the insertion point, representing the text selection on the page, which may span multiple elements and is generated by the user dragging the mouse over the text. When the Selection
is in a Collapsed
state, it is what is commonly referred to as a cursor, which means that the cursor is actually a special state of Selection
. In addition, it is important to note that there is no necessary connection with the focus
event or values like document.activeElement
.
Because it's still running in the browser, implementing a rich text editor still needs to rely on changes in this selection area. Usually, when the selected text content changes, the SelectionChange
event will be triggered, and some operations can be completed through the callback of this event.
In the editor, the selection area is usually not directly used to perform the desired operations. For example, in Quill
, the selection is represented by the starting position and length, which is mainly to match its Delta
to describe the document model, so in Quill
, the operation from Selection
selection area to Delta
selection area is completed in this way, in order to obtain a selection of Delta
that we can manipulate.
In Slate
, many concepts from the DOM
are used, such as Void Element
, Selection
, and so on. In Slate
, the selection area is also processed, and it is also because its data structure is a JSON
data type similar to the DOM MODEL
structure, so its Point
is represented by Path + Offset
, and its selection area is therefore composed of two Points
. In addition, compared to Quill
, Slate
retains the order of the user's selection from left to right or from right to left, which means that selecting the same area from left to right and from right to left is different, specifically, the anchor
and focus
are reversed.
Whether based on contenteditable
or beyond contenteditable
, editors will have the concept of Range
. Translated into English, Range
means range or extent, similar to the concept of intervals in mathematics, that is, Range
refers to a content range. In fact, the Selection
in the browser is composed of Range
, and we can use selection.getRangeAt
to obtain the Range
object of the current selection.
Specifically, the Range
provided by the browser is used to describe a continuous range in the DOM
tree. startContainer
and startOffset
are used to describe the start of the Range
, while endContainer
and endOffset
describe the end of the Range
. When the start and end of a Range
are at the same position, the Range
is in a Collapsed
state, which is our common cursor state. In fact, Selection
is a way to represent Range
, and Selection
is usually read-only, but the constructed Range
object can be manipulated. By combining the Selection
object and the Range
object, we can complete some operations on the selection, such as adding or canceling the current selection.
Copying and pasting is a quite fundamental concept because in current rich text editors, we typically maintain a highly customized DOM
structure. For example, when using a top-level heading, we may not use the H1
tag but instead replicate it using a div
to avoid problems associated with nesting H1
. However, this approach can lead to other issues.
Firstly, for copying, we want the copied text/html
nodes to adhere to standards; a top-level heading should be represented using H1
. Since we maintain our own data structure and have control over the text/html
generation, particularly in L2
editors where there isn't even a DOM
structure, we have to implement copying ourselves. As for pasting, we are more concerned because the current data model is usually maintained by us. Therefore, when we copy rich text from elsewhere, we need to parse it into a data structure that can be used by us, such as Quill
's Delta
model or Slate
's JSON DOM
model. Thus, for the copy and paste behavior, we also need to intervene and prevent the default behavior.
For the copy behavior, we can get the current selection's content during copying, serialize it, concatenate the HTML
string, and then, if possible, use the navigator.clipboard
and window.ClipboardItem
objects to directly construct a Blob
for writing. If not supported, we can fallback by using the setData
method of the clipboardData
object of the onCopy
event to set it to the clipboard. If it still cannot be completed, we simply write in text/plain
and not in text/html
; there are many fallback strategies.
As for the paste behavior, we need to listen to the onPaste
event and use event.clipboardData.getData("text/html")
to obtain the pasted text/html
string. Of course, if it's not available, then getting the text/plain
would suffice. If neither is available, it's as if we pasted loneliness. If there is a text/html
string, we can utilize DOMParser
to parse the string and then construct the data structure we need.
History
, also known as Undo/Redo
operation, has two implementation methods: recording data and recording actions.
The operation of recording data is similar to saving snapshots. When the user performs an operation, regardless of any changes, the entire content is saved and a linear stack is maintained. When performing the Undo/Redo
operation, the content to be restored from the stack is completely presented. This operation is similar to trading space for time. We don't have to consider exactly which data the user has changed, we just record all possible changed parts when there is a change. This approach is relatively simple to implement, but if the data volume is large, it consumes a lot of memory.
Recording actions saves the actual operations, including specific operation actions and the data modified by the actions, also maintaining a linear stack. When performing the Undo/Redo
operation, the saved actions are reversed. This method is similar to trading time for space. Each time, only the type of user operation and the related operation data need to be recorded, without storing the entire content, saving space. However, the complexity is increased significantly. Since rich text operations are actually implemented through commands, we can completely store these contents, and maintaining a method of saving operation records is more in line with the current design. Additionally, a well-designed implementation of this part will be very helpful for implementing collaborative algorithms such as Operational Transform
.