When developing an online document system, it is crucial to consider the document export capabilities, especially for complex ToB
products deployed privately. The ability to deliver private versions of documents becomes significantly important. Additionally, mature online document systems involve numerous complex scenarios that necessitate document export capabilities. In this article, we will delve into the plugin-based design implementation of document export to MarkDown
, Word
, and PDF
using the Quill
rich text editor engine.
Recently, a friend shared an interesting story with me. One of their B
end major clients requested the ability to support remote printer connection for printing documents directly from their online document system. The reason was quite valid - their company's top management prefers reading physical copies of documents instead of staring at computer screens. To retain this major client, it became imperative to prioritize this feature. Indeed, this capability is essential for a complete online document SaaS
system.
Although our online document primarily serves as a SaaS
, we can also provide services as a PaaS
platform. The scenario is quite clear; for instance, the data structure stored in our document system is usually custom-made. When users wish to initialize document content through locally generated MarkDown
templates, we need to offer import capabilities. Similarly, if users want to convert documents back to MarkDown
templates, exporting capabilities are again required. For cross-platform data migration or collaboration, it is usually necessary to provide various data conversion capabilities through OpenAPI
, fundamentally based on our data structure design.
Regarding data transformation capabilities, we can use a generic data structure type as a benchmark to perform various data format conversions. In our document system, the least costly generic data structure is typically HTML
, which can serve as the basis for data transformation, with numerous open-source implementations available as references. Implementing data transformation using this approach is cost-effective, although it may not be as efficient. Therefore, the discussion here focuses on data conversion and export based on our benchmark data structure, the DSL - Domain Specific Language
, specifically designed in the flattened rich text description DSL
of quill-delta
. This article discusses data conversion export using the design of quill-delta
data structure. When designing the conversion model, it is essential to consider a plugin-based design, as we cannot guarantee that the document system will not expand block types in the future. Thus, this design concept is highly necessary, and each conversion design we are about to discuss will have related examples at https://github.com/WindrunnerMax/QuillBlocks/tree/master/examples
.
In our work, we may encounter scenarios where users wish to embed online documents into their product's website, serving as API documentation or help center documents. Due to cost considerations, most help centers are built on MarkDown
since maintaining a rich text product is relatively expensive. As a PaaS
product, we need to provide data conversion capabilities. While offering an SDK
to directly render our data structure is a product capability, in many cases, investing in manpower for document rendering migration is challenging. Therefore, direct data conversion is the most cost-effective approach.
The trend is gradually shifting from MarkDown
to rich text for various product documentation. As developers, using MarkDown
to write documents is common. Thus, initially using MD
renderers for product deployment is reasonable. However, as products iterate and user bases grow, operational teams and professional technical writing teams come into play, especially for products maintained both domestically and internationally. In such cases, operational and technical writing teams play crucial roles. While we, as developers, may only complete the initial content writing, operational teams are required for maintenance and updates. Operational teams typically do not use MD
for document writing, especially if the document repository is managed using Git
, which makes it challenging to accept. Hence, WYSIWYG online document products become essential in such cases. Maintaining an online document product is costly, so most teams may opt for integrating a document middleware. The capabilities mentioned above become vital in such scenarios.
As an online document PaaS
, it is essential not only to provide the capability of converting data to MD
but also the ability to import from MD
. Common scenarios include users using MD
to write document templates and import them into the document system, along with products already online that do not yet have operational teams configured and use MD
for document writing. These products utilize our document SDK
renderer, requiring all document content updates to go through our PaaS
platform, making data conversion to our DSL
crucial in such cases. If positioning ourselves as a PaaS
product, continuous compatibility with various scenarios and systems is necessary, aligning with the concept of middleware. However, this article primarily focuses on outbound data conversion solutions rather than data import capabilities.
So, now we are officially starting the conversion of data to MD
. First, we need to consider an issue. Different MD
parsers have varying levels of support for syntax. For example, in terms of the most basic line breaks, some parsers interpret a single line break as a paragraph, while others require two spaces followed by a carriage return or two carriage returns to properly interpret it as a paragraph. Therefore, to handle such compatibility issues, our plugin-based design is essential. Moving on to the next question, MD
is a lightweight format description, and our DSL
is a complex format description. We have a wide variety of block structures, so we also need HTML
to assist us in converting complex formats. Now, the question arises: Why don't we convert it directly to HTML
instead of mixing it with MD
format? In reality, this is also for the sake of compatibility. Users' MD
may involve different plugins, and combining with HTML
may result in style discrepancies. Combining complex styles can be cumbersome, especially when utilizing mixin-react
similar to the MDX
implementation. Therefore, we choose MD
as the base and HTML
as an aid to achieve data conversion.
Earlier, we mentioned that our blocks are quite complex and actually involve many nested structures. In HTML
, this is akin to nesting code blocks within a table structure. However, the data structure of quill-delta
is flat, so we need to convert it into a nested structure for easy handling. Complete conversion to a tree structure would lead to increased complexity. Thus, we opt for a compromise by wrapping an external Map
structure to dynamically construct nested structures via key
when obtaining data.
Moreover, we need to select a basis for processing the data. Our document is essentially composed of paragraph formats and inline formats. It is evident that we can split it into two parts: line format and inline format. Mapped to delta
, this equates to nesting Line
with Ops
and carrying its own line format such as headings, alignment, etc. Essentially, with our DeltaSet
structure, we divided it into three parts to describe the data structure we aim to convert.
For DeltaSet
, we need to define the entrance Zone
, marked as "ROOT"
in the case of delta
structure. In the following DEMO
, we only defined the nesting structure of the CodeBlock
block level. Therefore, in the example below, we are handling the data nesting expression of the code block. Since the original data structure is flat, we need to handle certain boundary conditions; that is, the start and end of the code block structure. When encountering a code block structure, we point the current processing Zone
to a new delta
block and establish a reference in the original structure by specifying a zoneId
identifier in the op
. Upon completion, we restore the pointer to the previous target Zone
. Typically, when dealing with multi-level nested blocks, we need to use a stack. However, we will not delve into that here.
Now that the data is ready, it's time to design the entire conversion system. As mentioned earlier, the converter consists of two types, so our plugin system is also divided into two parts. In essence, for MD
, it's all about string concatenation, so the main output of the plugins is strings. An important point to note is that the same Op
description might have multiple formats. For example, a block could be a combination of bold and italic, which would be handled by two different plugins. Therefore, plugins should not directly output results but should concatenate using prefix
and suffix
. This is especially crucial for line formats, where HTML tags are needed for expression. Additionally, when it's certain that there won't be nested nodes, such as in the case of image formats, a last
identifier can be used to mark the last node, avoiding unnecessary checks.
Since there are nodes that require HTML formatting and our iteration process is similar to recursive string concatenation, we need a flag to indicate when to parse into HTML instead of MD markup. For example, if a line node is center-aligned, all nodes within that line need to be parsed into HTML tags. It's essential to reset this flag at the beginning of each line iteration to prevent interference from previous content affecting subsequent content.
In handling plugin types, passing adjacent descriptions together during iteration is useful for processing list formats where additional blank lines are required before and after lists. Combining inline formats also helps avoid generating multiple tags for a description block. It's crucial to assign each plugin a unique identifier. As mentioned earlier, compatibility across multiple scenarios is needed, and handling plugins in the order of instantiation for processing can ensure accurate display styles, setting plugin priority is necessary, such as combining reference and list stacking line formats where the reference format needs to be parsed before the list to display styles correctly.
With the scheduler in place, our focus now shifts to implementing the plugins. Here, let's take the heading plugin as an example to implement the transformation logic. In fact, this part of the logic is very simple, it only needs to parse LineAttributes
to determine the return value.
For inline plugins, the logic is similar. Here, we take the bold plugin as an example to implement the conversion logic. Similarly, it only needs to check OpAttributes
to determine the return value.
In https://github.com/WindrunnerMax/QuillBlocks/blob/master/examples/
, there is a complete DeltaSet
data conversion delta-set.ts
and MarkDown
data conversion delta-to-md.ts
, which can be tested using ts-node
. In fact, we may have also noticed that this dispatcher can not only convert MD
format but can also perform complete HTML
format conversion. With the HTML conversion logic in place, we now have a very common intermediate product to generate various files. Moreover, if the plugins are modified to a synchronous mode, this solution can also be used to handle the copying behavior of online documents, making its practical applications quite versatile. Additionally, during actual usage, it is crucial to conduct thorough testing of the plugins. Test cases should be accumulated during development to avoid unknown issues caused by modifications, especially when dealing with complex business scenarios involving multiple plugin combinations. Proper handling of various test cases becomes particularly crucial, especially in scenarios of full synchronous updates, emphasizing the accumulation of boundary test cases.
Earlier, we discussed the data conversion compatibility of a PaaS
platform, while the ability to directly generate delivery documents is indispensable for an SaaS
platform, especially when the product requires private deployment and the provision of multiple online versions. Word
is one of the most common document delivery formats, especially useful when the document needs to be exported for further modifications. In this section, let's discuss how to generate delivery documents in Word
format.
OOXML
, which stands for Office Open XML
, is a new document format proposed by Microsoft in Office 2007
. In Office 2007
, Word
, Excel
, and PowerPoint
default to the OOXML
format, which has also become part of the ECMA
standard with the designation ECMA-376
. In practice, for current Word
documents, we can directly unzip them to obtain encapsulated data by changing the file extension to zip
. Inside, we find various components of a docx
file.
[Content_Types].xml
: Defines the content type of each file, marking whether a file is an image (.jpg
) or textual content (.xml
), for example._rels
: Typically contains .rels
files to save relationships between different Part
s, describing the associations between different files, such as the connection between text and images.docProps
: Contains the property information of the entire Word document, such as author, creation time, and tags.word
: Stores the main content of the document, including text, images, tables, and styles, among others.
document.xml
: Stores all text and references to the text.styles.xml
: Stores all styles used in the document.theme.xml
: Saves the theme settings applied to the document.media
: Stores all media files used in the document, such as images.With all these descriptions, one might be perplexed about how to actually assemble a Word file given the complex relationship descriptions. Since it might be challenging to instantly grasp the composition of an entire docx
file, we can rely on frameworks to generate docx
files. After investigating some frameworks, I found roughly two methods of generation. One involves using a common HTML
format for generation, such as html-docx-js
, html-to-docx
, pandoc
, while the other method involves direct code control for generation, effectively skipping the HTML conversion step, like officegen
, docx
. Noting that many libraries have not been updated for years and aiming for direct docx
output without an intermediate HTML
step, especially for online document deliveries requiring strict formatting control, I opted for using docx
to generate Word files.
docx
simplifies the generation process of the entire Word file. By constructing hierarchical relationships of built-in objects, we can easily create the final file. Moreover, this process can run seamlessly in both Node and browser environments. Therefore, in the upcoming demo in this section, there will be versions for both Node and browser environments. Now, let's take the Node version as an example to discuss how to generate a Word file. Firstly, we need to define styles. In Word, there is a style pane module, similar to CSS
classes, where we can understand it as a way to reference styles directly when generating the document, rather than defining styles for each node individually.
Next, we need to tackle the unit conversion. When working with word
, we often use the unit value PT
, whereas in our browser it is usually PX
. In our demo, we mainly deal with handling image sizes using DAX
and proportions, so here we have listed the unit conversions that are used.
Similar to the MD
conversion, we also need to define the logic for the conversion dispatch. However, one difference is that in MD
, the output is a string and offers great flexibility, whereas in docx
, there are strict object structure relationships. Therefore, here we need to strictly define the relationships between lines and inline types, and the passed Tag
needs to contain more content.
The plugin's input design is similar to MD
, but the content of the output needs to be more precise. The output of inline element plugins must be inline object types, and the output of line element plugins must be line object types. It is important to note that in line plugins, we pass the leaves
parameter, which means that at this point, the scheduling of inline and line elements is handled by line plugins, rather than by the external Zone
scheduling module.
Moving on to the entry Zone
scheduling function. Unlike the previous MD
scheduling, here we first need to handle the leaf nodes, which are inline styles. A crucial point to note here is that a Table
object cannot wrap a Paragraph
object. If we need to implement a block structure, then the outer element should wrap a Table
and not a Paragraph
. This means that the content of inline elements will determine the format of line elements, where A
affects B
, so we first handle A
, the inline elements. Therefore, the inline elements are processed first, and only one plugin will match a single block structure. Hence, the handling of common content will need to be encapsulated into generic functions.
Next, we also need to define plugins. Here we take the text plugin as an example to implement the conversion logic. Since basic text styles are encapsulated in the TextRun
object, we only need to handle the properties of the TextRun
object. Of course, for other Run
type objects such as ImageRun
, we still need to define plugins to handle them separately.
For line type plugins, we take the paragraph plugin as an example to implement the conversion logic. The paragraph plugin is the plugin that needs to be merged finally when other paragraph formats cannot be matched. As we mentioned earlier, the problem that the Paragraph
object cannot wrap the Table
element also needs to be handled here because our block-level expressions are implemented using the Table
object. If the leaf nodes do not match a block element, simply return a paragraph element. If a block element is matched and only one element exists, then promote and return it directly. If a block element is matched and there are other elements, wrap all elements in a block element before returning. In fact, this part of the logic should be encapsulated and called by all inline element plugins to ensure compatibility in parsing. Otherwise, if there are nesting issues, the generated word
document will not be able to be opened.
Next, let's discuss headers and footers. In word
, a common way of expressing a header is by placing the title of the current page in the upper right corner. This is quite an interesting feature. In word
, this is achieved through fields. Leveraging the expression in OOXML
and the encapsulation in docx
, we can also implement this functionality. Implementing expressions similar to fields is achievable. A commonly used expression for referencing titles is STYLEREF
. Simply assembling the string allows us to achieve this. A typical way to express a footer is to display the page number in the lower right corner or center. For this part, no field referencing is needed, making it simple to display the page number, with main focus on positioning control.
Another crucial feature in word
is the ability to generate a table of contents. Let's consider a question: Have you noticed throughout our entire document we haven't mentioned the introduction of fonts? If we want to know on which page a certain word or paragraph is rendered in word
, we need to know the font size so that we can lay it out and determine the page where the title appears. Since we haven't even imported fonts, it's evident that we don't perform rendering and layout execution when generating the document, but rather when the user opens the document. Thus, after introducing a table of contents, you may receive prompts like whether to update these fields in the document. This is because the table of contents is a field whose content is generated or updated only by word
, and we cannot achieve this programmatically.
In https://github.com/WindrunnerMax/QuillBlocks/blob/master/examples/
, you can find the complete word
data conversion files delta-to-word.ts
and delta-to-word.html
. You can use ts-node
to execute tests by opening the HTML
file in a browser. Converting data to generate a word
document is indeed quite complex, involving many details to handle, especially in the conversion of rich text content like multi-level block nesting, rendering of flowcharts/images, table merging, dynamic content transformation, and more. Achieving comprehensive word
export capability requires continuous adaptation to various edge cases and thorough unit testing to maintain stability.
Apart from Word
, PDF
is essential for our delivery capabilities on our SaaS
platform. In fact, for many documents that need to be printed, PDF
is a better choice because it offers a fixed document format that eliminates layout issues across different devices. We can consider PDF
as an advanced image - just like images maintain their layout consistency across devices, PDF
documents also provide rich content that can be added without layout issues. In this section, we will discuss how to generate delivery documents in PDF
format.
There are two main methods to generate PDF
: one is based on converting HTML
to PDF
, commonly done using libraries like dom-to-image/html2canvas
to convert HTML
to an image and then the image to PDF
. However, this method has limitations such as non-selectable text and reduced clarity when zoomed in. Another common approach is using Puppeteer
, which offers advanced APIs to control Chromium
via the DevTools
protocol for generating PDF
files. Alternatively, directly printing from the frontend using window.print
or react-to-print
with the help of iframe
for partial printing is also feasible. Another method involves manual layout design to create PDF
files. Handling PDF
operations is similar to working with Canvas
, where everything can be drawn, for example, tables can be drawn using rectangles. Popular libraries for this approach include pdfkit
, pdf-lib
, and pdfmake
.
In this discussion, we will focus on generating PDF
directly from our delta
data. While it's feasible to convert via intermediary formats like MD
, HTML
, or Word
, we will opt for direct output. Given the complexity of fully understanding the standard PDF
data format in a short time, we rely on libraries to generate PDF
files. We have chosen pdfmake
to create PDF
files via a JSON configuration, essentially transforming from one JSON
to another. For handling Outline/Bookmark
, I extensively researched and ultimately chose pdf-lib
for managing the outline generation.
Unlike the descriptive language OOXML
used in creating Word
, which lacks drawing instructions for direct rendering and acts more like static markup, PDF
creation involves actual path rendering using PostScript-PDL
, a language for describing text, vector graphics, and image rendering directly in the document. When a PDF
file is opened, all drawing instructions are already embedded, allowing rendering directly from these directives without needing client-side layout rendering.
To ensure complete cross-platform document format, PDF
files typically embed fonts to guarantee proper display across devices. Therefore, when creating PDF
files, font files need to be imported. It's important to note that many fonts require commercial licenses, although there are open-source options like Source Han Serif and Source Han Sans, covering normal
, bold
, italics
, and bolditalics
font styles. Installing and referencing fonts-noto-cjk
font directly on the server can be considered. Additionally, CJK
fonts tend to be large; therefore, subsetting for font embedding is a preferred practice.
In pdfmake
, we can achieve a style panel similar to word
through preset styles. Of course, a pdf
file cannot be directly edited, so the style panel here mainly serves the purpose of facilitating the implementation of different styles.
For the transformation scheduling module, similar to the scheduling module in word
, we need to define the relationship between lines and inline types, as well as the content to be passed to the Tag
. The type control in pdfmake
is quite loose, allowing us to easily achieve nested formats as required. However, validation for illegal format nesting is done during runtime. Ideally, we should strive to move this validation to the type definition stage. For example, in reality, ContentText
cannot directly have ContentImage
as a child element, but it is allowed in the type definition. We can define similar nesting relationships more strictly.
Regarding the plugin definition part, we continue with the previously designed types. This part follows a similar design pattern, with adjacent block structures and Tag
still being the inputs. Line plugins also incorporate leaf node data. The plugin definition maintains key plugin overloading, priority plugin priority, match matching rules, and processor processing functions. The outputs remain two types of block structures, indicating that our previous design approach is quite versatile.
The entry Zone
scheduling function is quite similar to handling word
, as there is no nesting relationship for individual block structures. All format configurations of the same type can be achieved using the same plugin. Therefore, here too, it is a matter of matching a single plugin form. Additionally, leaf nodes are processed first, as the content of leaf nodes will determine the nested block format of line elements.
Next, we define plugins. Here we take text plugin as an example to implement the conversion logic. Basic text styles are encapsulated in the ContentText
object, so we only need to handle the properties of the ContentText
object. For other Content
type objects such as ContentImage
, we still need to define plugins to handle them separately.
For line type plugins, let's take paragraph plugin as an example to implement the conversion logic. The paragraph plugin is the plugin that should be merged in when no other paragraph format matches. The nested relationship of Content
objects mentioned earlier also needs to be handled here. Firstly, for empty lines, a \n
should be merged in. If it's an empty object or array, there won't be line breaks. For a single Zone
content, there's no need for wrapping, for example, CodeBlock
block-level structures can be directly merged into the main document. For multiple types of structures like parallel tables, images, etc., they need to be wrapped in a Table/Columns
structure to display correctly. Unlike in OOXML
, issues with nested hierarchy won't cause errors when opened, they just affect the display of the related content areas.
Next, let's discuss how to generate Outline/Bookmark
. Outline
usually refers to the outline displayed on the left side of a PDF when opened. pdfmake
does not directly support generating Outline
, so we need to use another library to achieve this function. After researching for a long time, I found pdf-lib
, which can be used to process existing PDF files and generate Outline
. In this example, the generated Outline
in the PDF is achieved through an id
system for navigation. Another approach is to use pdfjs-dist
to parse and store page and position information corresponding to the headings in the PDF, and then write the Outline
using pdf-lib
. Furthermore, generating Outline
in conjunction with Puppeteer
for generating PDFs is very useful, primarily because Chromium
does not support generating Outline
when exporting a PDF. Therefore, adding Outline
using pdf-lib
is a good complementary capability.
The complete PDF
data conversion can be found at https://github.com/WindrunnerMax/QuillBlocks/blob/master/examples/
in delta-to-pdf.ts
and delta-to-pdf.html
, and the pdf-with-outline.ts
for adding Outlines. You can test it using ts-node
and opening the HTML file in a browser. When testing with ts-node
, pay attention to font references. Converting data to generate a PDF is a complex task. Thanks to various open-source projects, we can easily accomplish this. However, when applying it to a production environment, achieving comprehensive PDF export capabilities also requires continuous adaptation to various edge cases, and thorough unit testing is essential to maintain functionality stability.