Previously, we have implemented streaming output with SSE
and a RAG
service based on vector indexing, both falling under the realm of AI Infra
. Now, let's delve into combining Markdown parsing and rich text editor rendering on the basis of SSE
streaming output to achieve an incremental parsing algorithm for editors, also part of infrastructure development in the document context.
In the scenario of SSE
streaming output, the LLMs
model will gradually output Markdown text. In a basic scenario, we only need to implement rendering of the DOM
. However, in the context of a rich text editor, things get more complex. Editors usually maintain their own data structure and cannot directly accept DOM
structure. Additionally, taking into consideration performance issues, we need to address the following points:
Even in the case of basic DOM
rendering, there are many boundary cases to consider. Performance issues like full rendering every time Md
is output with SSE
, and problems like re-rendering image nodes leading to potential reload problems, need to be addressed. Therefore, the incremental parsing algorithm process discussed here is valuable even for basic scenarios.
Moreover, in more complex scenarios during SSE
streaming output, users might always have temporary incomplete Md
text. Before complete Md
text is output, there might be non-standard Md
parsing or misalignment with another Md
syntax. Such situations require additional algorithm processing to implement some syntax correction solutions.
Furthermore, tasks like rendering syntax extensions and custom components involve additional syntax handling. For custom parsing and rendering implementations, react-markdown
provides a good reference. This pipeline is quite extensive and serves as a good example for custom processing of AST
. The remark
ecosystem is also rich, offering various implementations related to Markdown parsing.
However, here we are not delving into discussions on custom syntax parsing. Instead, the focus is mainly on basic Md
syntax parsing and incremental rendering. Yet, during the parsing process, handling issues related to syntax error matching is inevitable.
For related implementations, you can refer to BlockKit and StreamDelta. Additionally, unit tests are available to verify the implementation results and various important boundary cases.
Let's first implement incremental parsing for Markdown, allowing progressive parsing. The key foundation for enabling incremental parsing is that subsequent formatting in Md
output generally does not affect earlier formatting, enabling archiving of stable content. Thus, we can design a data processing approach based on the following core principles to adhere to during parsing:
Delta
based on the structure parsed by Lexer
.Initially, we need a lexical analyzer to parse Md
into a stream of Tokens
, or a syntax parser to transform Md
into an AST
. For our flat data structure, secondary parsing of standard flat Token
flow is not overly complex. However, for a completely nested data structure, utilizing a syntax parser to generate AST
for parsing Md
might be more convenient.
Popular parsers in the current scene have various related implementations. marked
provides a lexer
method and marked.Lexer
as a parser. remark
, a part of the extensive unified
ecosystem, offers mdast-util-from-markdown
as an independent parser. Similarly, markdown-it
also provides parse
and parseInline
methods as parsers.
Here we have chosen marked
as the parser mainly because it is simple and easy to understand. The ecosystem of the remark
series is too extensive, but as a standard AST
parser, it is a very good choice. markdown-it
's Token
parser is slightly more complex, it does not nest directly but processes data using tags like heading_open
.
From the name, one can also tell that the lexer
provided by marked
leans towards lexical analysis, but it is not purely a lexical analyzer. This is because the output structure also contains nested elements. However, it is not a standard AST
structure but rather leans towards a mixed structure more akin to a Token
stream implementation. The parsing structure of a piece of Markdown text is as shown below:
During the parsing process, we need to maintain several variables to hold the current parsing state. content
maintains the current segment being parsed, accepting streaming data requires continuously obtaining new segments to assemble, indexIds
uses index values as stable keys for mapping id
values, and archiveLexerIndex
maintains the index values of the already archived primary Tokens
nodes.
When appending new content, we simply merge the content
with the text being appended, then with the help of marked
lexer, we can obtain the Token Tree
. In the scheduling main framework, we only parse primary nodes. Parsing only primary nodes can simplify the data we need to handle, and secondary nodes can be further indexed for processing.
Next, we need to handle the archived positions. Archiving means that we consider the Token
will no longer be processed, so the index held increments. The block.raw
is the raw content of the Token
being parsed, which is the simplification of using marked
, otherwise we would need to parse it manually based on the index. For text archival, we simply remove the portion directly.
After handling the archived indices properly, we can then specifically process these archived portions within the loop. The strategy here is quite simple - if a node in the loop has a preceding node, we can archive the previous node. However, archiving here is not ideal, especially when there are list, table, etc., nodes, which require special handling, a topic we will discuss later.
Then it's time to parse the current Token
, this part needs to be adapted to the editor's own data structure implementation. In the current editor, we consider a Token
as a primary node, yet it might not directly correspond to the state of an editor row completely. For instance, a list
node might have nested list_item
nodes, while our structure is purely flattened row structure, therefore requiring independent adaptation.
In the process of converting Token
to Delta
fragments, let's take the heading
line format and the bold
inline format as examples to briefly explain the parsing process. For line nodes, the order of execution is crucial. It is necessary to recursively process all inline nodes first, and then proceed with the processing of line nodes. Whether it is a line node or an inline node, the properties are modified in-place in an encapsulated manner.
Furthermore, if there is a need to further parse HTML
nodes, it is necessary to introduce independent HTML analysis tools such as parse5/htmlparse2
to parse HTML fragments. However, the parse5 parsing result strictly handles the nested structure of the DOM, making it more suitable to use htmlparse2
to handle HTML fragments. Another common parsing approach is to process all data as HTML and then parse it into an HTML-AST format.
In the above implementation, we have been able to achieve incremental parsing for the Markdown part. Although this strategy is mostly effective, there are two issues that may arise during streaming output. Firstly, there might be transient nodes during streaming input, leading to incorrect archiving, such as indented lists. Secondly, there could be erroneous syntax matching, as in the case of interpreting an indented unordered list item as a heading.
Therefore, it is essential to address these cases, primarily focusing on correcting the syntax mismatches rather than completing incomplete syntax. Let's first consider the issue of list indentation. Using the above strategy directly in the following example would result in the misclassification of the 1
node in the first line, causing the 1.1
node to lack proper indentation formatting since there is no nested list token.
Let's analyze the problem here. During the streaming output process, the 1.1
line will have a transient
state, i.e., three preceding spaces for indentation. Before outputting the -
character, the parsing format of this token is as follows. According to the aforementioned strategy, if the preceding token exists and the current token also exists, the preceding token is archived.
After archiving, only the 1.1
node remains for parsing. Since the previous 1
node has been archived, 1.1
is now considered a new list
and list_item
node. As there are no nested nodes, it can still parse the content properly, but the indentation format is lost.
Therefore, to address this issue, extra processing of the parsed tokens is necessary. In the aforementioned case, we can consider the presence of a space
node causing premature archiving of the preceding list node. Hence, the space
node is deemed a temporary state and should not trigger archive of the preceding node.
Regarding the second issue, errors in syntax matching may also manifest notably. Incorrect syntax matching can lead to style inconsistencies. These states are also transient and should be rectified during continued parsing to ensure the correct structural matching, thereby preventing abrupt style changes during output. For instance, in the example below, an unordered list -
should have been interpreted as a heading, but was erroneously parsed.
Actually, this format is not problematic at all, because this is the format specified in the standard itself, just that using ---
to set the heading format is not commonly used in practice. Naturally, we can also consult the format definition in the standard, in the Setext headings
section where relevant definitions can be found:
The setext heading underline can be preceded by up to three spaces of indentation, and may have trailing spaces or tabs - Example 86
A list item can contain a heading - Example 300
Therefore, the solution to this issue is quite simple as well, which is to avoid the occurrence of this temporary state. This is also the principle we should follow when dealing with these two issues. Hence, we can use regex to match this temporary state, and if a match is found, the line can be removed, awaiting the appearance of the correct format later.
After the incremental parsing of Markdown
, we need to map the parsed result to the editor's own data structure, and the main process here needs to be coordinated with the implementation of the Md
parsing process. Similarly, we follow the process of Md
parsing, implementing archive indexing and reconstruction of parsed data, among other methods.
Several variables are also needed in the process of streaming parsing. Firstly, the Token Delta
being processed needs to ensure that it is an independent top-level Token
node, which is the main node that matches the Md
parsing. Secondly, the length of the archived Delta
as well as the text length index of the Delta
array are different from the indexing in Md
parsing.
When appending content, we need to apply the latest Token
to the editor. However, since the editor has already applied the content, the editor applies changes, and hence the difference between the content applied and the target content needs to be calculated so that it can be directly applied to the editor itself.
The implementation of archiving is relatively straightforward, mainly involving archiving the current Delta
and updating the archive index. Since theoretically only insert
operations will exist, the length
method can be directly called to obtain the current length of the Delta
, although it is advisable to still check the type of operation here.
Next, we need to integrate the combining process of Delta
into the parsing process of Md
. First, we need to obtain the current archived index, which needs to be converted to retain
to align the indexes. Then, while looping through the archiving process, we need to continue handling index information, primarily because the subsequent process is merge diff
, where the length is not the same as dc
.
The final compose
method merges the remaining parsed Token
transformed Delta
, ultimately returning the delta, which represents the changes of the current appended content, as mentioned earlier in the diff
method. Finally, the merge
method merges the changes with the previous retain
pointer to ensure the correctness of the index.
In practice, this implementation requires thorough testing to ensure stability, especially when handling various cases. It is essential to maintain a test suite to avoid issues with previously tested content parsing. During unit testing, we mainly focus on the following types of inputs and outputs:
Md
text input is tested mainly for the correctness of parsing all Token
.Although our current editor does not implement block-level structure nesting, such structures are usually inevitable, such as code blocks, tables, etc. Regardless of simple block structure nesting or structure implementation in Blocks
mode, there are usually unique block-level structures identified by an id
value.
During the streaming output process, the problem of id
value being regenerated easily arises, especially in complex structures like tables. When implementing archiving only for primary nodes, each cell might be re-parsed, leading to new id
values. Thus, maintaining a mapping table for id
values during parsing is crucial to avoid performance issues caused by repeated rendering.
Therefore, maintaining the id
mapping table requires implementing a stable key-value. Without a stable key-value, it is impossible to determine whether the id
parsed previously and the current id
should match. Here, we use a combination of index
and depth
values as the index to map id
values for complex structures, which is the role of indexIds
mentioned earlier.
Typically, we do not allow user editing during the streaming output process, so setting readonly
to true
suffices until the output is complete, then set it to false
. However, if users need to edit during streaming output, the issue becomes complex.
Regarding the editing mode, resolving data conflicts caused by various input modes usually involves using the OT
algorithm. If the editor structure can support CRDT
data mode, the problem should be simpler since it theoretically handles relative edit positions. The need to address index conflicts arises from absolute position issues.
In OT
implementation, the most crucial aspect is the transform
method. Looking at the meaning conveyed by transform
, if in collaboration, b' = a.t(b)
implies that if a
and b
stem from the same draft, b'
is what b
should transform to based on a
.
Also, viewing transform
as addressing the impact of a
operations on b
operations, similar to the undoable
implementation in the History
module, transform
resolves the effect of operations Opa
and Opb
on a local client, hence handling the impact of streaming output and user input data interactions.
This strategy actually applies to incremental streaming processing on existing documents as well. In other words, it performs incremental Md
parsing on existing documents, which would be more significant in the context of our editor. Imagine a scenario where, similar to the editing mode in an IDE
, users want to add or delete content to existing documents. In such cases, the content can be applied directly instead of using pop-up dialogs.
Since we have recorded archiveLength
, changes within archiveLength
are considered to be stable structures, while changes outside archiveLength
are considered temporary states. Therefore, handling OT
becomes much simpler here, splitting our calculations into two parts:
archiveLength
, we simply add the index values for insert
operations and subtract them for delete
operations.archiveLength
, we need to construct a retain
operation based on the index, and then integrate the user's changes into the current Delta
.Here we need to clarify a basic principle. The term archive
refers to stable content, so only the values expressed by retain
in the changes should be used as indices, while the lengths of insert
and delete
operations should be considered as actual change content lengths. Therefore, when calculating archiving index values, we should only consider the length of retain
, treating other operations as lengths of changed content.
Another issue, although in theory, when inputting content, we mostly encounter retain + insert/delete
operations, for unforeseen circumstances, we should handle all types of operations completely. Hence, we need to use retain
to calculate the cutting points, where the operations on the left are considered changes in archiving length, and those on the right are considered changes in the current Delta
.
In this article, we implemented lexical parsing for Md
and further processed incremental Token
archiving and processing. We combined the entire Md
process with incremental rendering of the editor's data structure. We also addressed specific syntax matching issues and editor details such as id
indexing and editing mode. Based on these aspects, we achieved the implementation of an incremental rich text parsing algorithm for Markdown.
Furthermore, since editor data structures are usually specific patterns maintained by each, we leaned towards business code implementation rather than generic parsing patterns. Adaptations are needed in different business scenarios. However, abstracting a lower-level implementation from the Md
parsing process is a fairly common pattern, which can indeed provide a universal algorithm.
In fact, at this point, we need to consider one more thing. If the goal is to start from scratch and output content rather than incremental processing, we can also implement a streaming output of pure HTML
rendering mode, then replace it with the editor after the streaming output is complete. This is a viable solution that can avoid many complex implementations, but the cost shifts to needing an additional set of pure rendering styles to match the editor's style, ensuring user experience.