I'm thrilled to share that the installation count for my browser extension has finally surpassed a thousand! On Firefox AddOns, it has reached 1.7k+
installs, while on the Chrome Web Store, it's over 1k+
installs. In fact, the statistics on the Firefox extension market reflect weekly average installations, and the actual installs on any given day tend to exceed this average significantly. As for the Chrome extension market, installation numbers become inaccurate after surpassing 1k
, meaning there are likely more than 1k
actual installs.
Before developing the extension, I had implemented a script to handle related functionalities, which has garnered 2,688k+
installs on GreasyFork. There were two main reasons for creating the extension: Firstly, I wanted to learn about extension development, as I found that there were real-world applications at work, especially when needing to bypass browser limitations for specific tasks. Secondly, I discovered that some code I had packaged and published on GreasyFork
under the GPL
license was repurposed into a plugin that included ads, and it surprisingly amassed 400k+
installs.
Hence, I utilized my scripting skills to develop this browser extension, primarily for learning purposes. I built the entire development environment from scratch and addressed numerous compatibility issues. Next, let's delve into the related challenges and their solutions. You can find the project repository on GitHub; if you enjoy it, please give it a star
! 😁
As mentioned earlier, we built the development environment from the ground up, which necessitated choosing a packaging tool for the extension. I opted for rspack
, although using webpack
or rollup
would also work fine. I prefer rspack
because I'm more familiar with it and it offers faster build times. The configuration is quite similar across all packaging tools. It's also worth noting that we are using a build
level packaging approach here, and the devserver
option in v3
isn't particularly suitable at this time.
A key point to remember while developing a browser extension is that we need to define multiple entry files with a single-file packaging scheme. We should avoid having multiple chunks
for a single entry. This applies to CSS
files as well; they should be packaged as a single entry with a single output. Moreover, the output filenames should not include hash
suffixes to prevent issues with locating files. However, this is not a major concern if you pay attention to your configuration.
Here, you'll notice that the output filename for INJECT_FILE
is dynamic. Since the inject
script needs to be injected into the browser page, conflicts may arise due to the injection approach. As a result, the filenames generated during each build
will differ, and they will change with every release. The naming for simulated event communication is also uniquely generated each time.
Chrome
has been strongly pushing for the V3
version of extensions, which means manifest_version
needs to be specified as 3
. However, submitting a version with manifest_version: 3
in Firefox
will trigger a warning against its usage. Personally, I also prefer not to use the v3
version due to its numerous restrictions, making it difficult to implement many functionalities properly; we can discuss this further later. Since Chrome
mandates the use of v3
while Firefox
recommends v2
, we need to establish a compatibility solution for both the Chromium
core and the Gecko
core.
In fact, this resembles a multi-platform build scenario, where we need to package the same code for multiple platforms. When dealing with cross-platform compilation issues, my go-to method has been using process.env
and __DEV__
. However, I found that using conditional compilation this way, with extensive process.env.PLATFORM === xxx
checks, can easily lead to deep nesting issues, negatively impacting readability. After all, our Promise
exists to address the problem of asynchronous callback hell; introducing further nested issues due to cross-platform compilation doesn't seem like a wise solution.
In C/C++
, there's an interesting preprocessor known as the C Preprocessor
, which isn’t part of the compiler but acts as a separate step in the compilation process. Simply put, the C Preprocessor
is like a text replacement tool, where macro parameters without identifiers directly replace raw text, guiding the compiler to complete the necessary pre-processing before actual compilation. Directives like #include
, #define
, and #ifdef
, among others, belong to the preprocessor's directives. Here, we primarily focus on the conditional compilation aspects, such as #if
, #endif
, #ifdef
, #endif
, #ifndef
, and #endif
.
We can implement a similar method using build tools. The C Preprocessor
operates as a pre-processing tool that does not partake in the actual compilation process, making it quite similar to a loader
in webpack
. The direct replacement of raw text can also be fully achieved within loader
. We can use comments to simulate directives like #ifdef
and #endif
, effectively avoiding deep nesting issues. Moreover, we can directly modify the string replacement logic; for instance, removing lines that don’t meet platform conditions while retaining those that do – achieving effects similar to #ifdef
and #endif
. Additionally, using comments can assist in complex scenarios, I've encountered instances involving intricate SDK
packaging where internal and external behaviors differ significantly, making cross-platform setups challenging without multiple builds. Users may need to configure build tools themselves, but using comments can allow for complete packaging without needing to adjust loader
configurations, thus alleviating the need for users to modify their settings in certain cases. However, this is still closely tied to specific business scenarios and serves merely as a reference.
Initially, I considered using regex for direct processing, but I found it quite cumbersome, especially with nested cases, where logic becomes harder to manage. Eventually, I realized that since the code is structured line by line, handling it on a per-line basis is the most efficient approach. Particularly since they are comments, which will ultimately be deleted, even if there’s indentation, we can simply trim the whitespace to match tags for processing. This streamlined my approach significantly; the pre-processing directive starting with #IFDEF
will remain true
, and the terminating #ENDIF
will switch it to false
. Our ultimate goal is merely to delete certain code sections, so we can return blank for non-qualifying lines. However, while handling nesting, it's vital to maintain a stack to track the indices of current processing directives’ start #IFDEF
when they push onto the stack, and pop them off when encountering #ENDIF
. We also need to keep track of the current processing state; if it's true
upon popping, we must determine if we need to mark it as false
to conclude processing for that block. Plus, we can utilize debug
to generate files post-processing specific modules.
In practical usage, taking the registration of Badge
as an example, you can execute the respective code for different platforms by using if
branches. Certainly, if there are similar definitions, you can easily redefine the variables directly.
One key feature of browser extensions is document_start
, meaning that the code injected by the browser executes before the site's own Js
code. This provides ample Hook
space for our code. Consider the potential if we could run the Js
code we wanted right as the page loads; we could manipulate the current page at will. Although we can't Hook
the creation of literal objects, we can always call the APIs provided by the browser. As long as we invoke the API
, we can find ways to intercept function calls and retrieve the data we desire. For instance, we could intercept the call to Function.prototype.call
, and for this function to work effectively, it needs to be supported by our intercepting function first across the entire page. Otherwise, if this function has already been called, intercepting again would be pointless.
So we might all wonder what the significance of this code implementation is. Let's take a simple example: in a certain library, all text is rendered through a canvas
. Since there is no DOM
, if we want to obtain the entire content of the document, we cannot directly copy it. A feasible solution is to hijack the document.createElement
function. When the created element is a canvas
, we can grab the canvas object in advance to obtain the ctx
. Moreover, since the actual rendering of text still requires calling the context2DPrototype.fillText
method, by hijacking this method, we can extract the rendered text. Following that, we can create the DOM
elsewhere, thus allowing for copying whenever we want.
Now, returning to the implementation of this issue, if we can ensure that the script runs first, we can accomplish nearly anything on the language level, such as modifying the window
object, hooking function definitions, altering prototype chains, blocking events, and so forth. This capability ultimately stems from browser extensions, and the challenge for the script manager is how to expose this capability of browser extensions to Web
pages. For our discussion, let's assume user scripts run on the browser page as Inject Scripts
rather than Content Scripts
. Based on this assumption, we are likely to write a dynamic/asynchronous loading implementation for JS
scripts, similar to the following:
Now, there's a clear problem: if we load the script after the body
tag has been constructed, roughly at the DOMContentLoaded
moment, we will definitely not achieve the goal of document-start
. Even handling it after the head
tag is complete is not effective, as many websites include some JS
resources within the head
. Loading it at that timing is also unsuitable. In reality, the biggest issue is still that the entire process is asynchronous — by the time the external script finishes loading, a lot of JS
code has already executed, preventing us from achieving our goal of “executing first.”
So let's explore the specific implementation. First is the v2
extension for browsers using the gecko
engine. For the entire page, the first element to load is definitely the html
tag. It’s clear that we just need to insert the script at the html
tag level. Coupling this with the chrome.tabs.executeScript
for dynamic code execution within the browser extension's background
and the Content Script
's "run_at": "document_start"
to establish message communication confirming the injected tab
, this method may seem very simple. Yet, this seemingly straightforward problem had me pondering for quite some time on how to achieve it.
This approach actually looks quite good; it can basically achieve document-start
. However, since we're saying it's only basic, it indicates that there are still some scenarios where issues may arise. If we take a closer look at the implementation of this code, there is a communication occurring which is Content Script --> Background
. Since this is communication, it involves asynchronous processing, and because it's asynchronous, it will take time. Once time is consumed, the user’s page might have already executed a significant amount of code, making it occasionally impossible to truly achieve document-start
, resulting in the script potentially failing to execute.
So, how can we address this issue? In v2
, we can clearly understand that Content Script
is entirely controlled with document-start
, but Content Script
is not the same as Inject Script
and cannot access the page's window
object. Therefore, it cannot effectively hijack the functions on the page. This problem appears complex, but once understood, the solution can actually be quite simple. We can build on the original Content Script
by introducing an additional Content Script
, the code of which is entirely equivalent to the original Inject Script
, but it will be attached to window
. We can utilize a packaging tool to write a plugin to accomplish this.
This piece of code indicates that we have attached a randomly generated key
to the same window
object of the Content Script
. This is where we previously mentioned potential conflicts might occur, and the content is actually the script we wish to inject into the page. However, although we can now access this function, how can we ensure it executes on the user’s page? Here, we utilize the same document.documentElement.appendChild
method to create the script. The implementation here is exceptionally clever: by using two Content Scripts
in combination with toString
, we obtain a string which is then directly injected into the page as code, thereby achieving true document-start
.
As previously mentioned, the Chrome
browser no longer permits submissions of v2
extensions, meaning we can only submit v3
code. However, v3
code comes with very strict Content Security Policy (CSP) restrictions, effectively disallowing the dynamic execution of code. Thus, the approaches we've outlined above become ineffective, leading us to write something akin to the following code.
Although it seems that we are immediately creating a Script
tag in the Content Script
and executing code, can this achieve our document-start
goal? Unfortunately, the answer is no. While it works when the page is first opened, subsequently this script effectively behaves as an external script. Consequently, Chrome
queues this script alongside the other scripts on the page, and due to strong caching, it’s uncertain which script will execute first. This instability is unacceptable, thus we definitely cannot meet the document-start
objective.
In fact, this alone indicates that v3
is not fully mature; many capabilities are not adequately supported. In response, the official team later devised several solutions to tackle this problem. However, given that we have no way to determine the user's browser version, many compatibility methods still need to be addressed.
Since Chrome V109
, support for chrome.scripting.registerContentScripts
has been introduced, and Chrome 111
allows scripts declared with world: 'MAIN'
directly in the Manifest
. However, compatibility issues still need to be handled by developers, especially when older browsers do not support world: 'MAIN'
, in which case the script will be treated as a Content Script
. I find this aspect somewhat challenging to manage.
Consider that many of our resource references are processed as strings, such as the icons
reference in manifest.json
, which is a string reference rather than referencing resources based on their actual paths as we would in our Web
applications. In such instances, the resources will not be incorporated as content by the packaging tool. The specific issue is that modifying resources does not trigger the HMR
of the packaging tool.
Therefore, for this part, we need to manually integrate these into the bundle dependencies. Additionally, it is necessary to copy the relevant files to the target packaging folder. This task is not overly complex; we simply need to create a plugin to accomplish this. Besides handling images and other static resources, we also need to manage locales
as language files.
As previously mentioned regarding the handling of static resources, there is a similar issue with the generation of the manifest.json
file. We also need to register it as a contextDependency
with the bundler. Additionally, recalling that we need to maintain compatibility with both Chromium
and Gecko
, we similarly need to ensure that manifest.json
is compatible. To avoid having two configuration files for this purpose, we can utilize ts-node
to dynamically generate manifest.json
. This allows us to write the configuration file dynamically through various logic.
In browser extensions, there are many modules, commonly including background/worker
, popup
, content
, inject
, devtools
, etc. Each module serves a different purpose and collaborates to enhance the functionality of the extension. Clearly, due to the various modules, each responsible for distinct functions, we need to establish communication capabilities among the related modules.
Since the entire project is built with TS
, we aim to implement a fully typed communication scheme, particularly during complex functional implementations where static type checking can help us avoid numerous issues. Here, we take Popup
to Content
as an example to create a unified data communication solution, requiring us to design relevant classes for each module that needs to communicate within the extension.
First, we need to define the communication key
values, as we must use the type
to determine the information type conveyed in the communication. To prevent value conflicts, we increase the complexity of our key
values using reduce
.
If we've used redux
, we might encounter a challenge regarding how to align the type
with the type carried by the payload
. For example, we may want TS
to automatically infer that when type
is A
, the payload
type is { x: number }
, and when type
is B
, the inferred type should be { y: string }
. A relatively straightforward declarative example would be as follows:
However, this approach is not very elegant; we might prefer a more refined declaration of types. Fortunately, we can utilize generics to accomplish this. Nonetheless, we may need to tackle the problem gradually, first by establishing a type Map
for type -> payload
, representing the mapping relationship. After that, we can transform this into a structure of type -> { type: T, payload: Map[T] }
, from which we can derive Tuple
.
Now, we can encapsulate this within a namespace
along with some basic methods for type conversion to make it easier for us to call.
In fact, to facilitate our function calls, we can also process the parameters by casting them internally to the required parameter types.
To clarify our type expressions, we will temporarily avoid using function parameter syntax. Instead, we'll denote types using an object format of type -> payload
. Since we have already defined the types for the requests here, we now need to define the data types for the returned responses in a similar type -> payload
format. The response type
matches the request type
.
Next, we will define the entire event communication Bridge
. Since we are sending data from Popup
to Content
, we must specify which Tab
we are sending data to, thus necessitating the inquiry of the currently active Tab
. The data communication method will use cross.tabs.sendMessage
, while receiving messages will involve cross.runtime.onMessage.addListener
. Given the potential variety of communication channels, we also need to check the message source, which can be done by assessing the sent key
.
It is important to note that even though the extended definition includes sendResponse
for asynchronous data responses, testing reveals that this function cannot be called asynchronously. This means it must be executed immediately within the callback. The term asynchronous here refers to the overall event communication process, so we'll define it in terms of immediate data response.
Furthermore, communication between content
and inject
requires a somewhat specialized encapsulation. The DOM
and event flow in Content Scripts
are shared with the Inject Script
, which means we can effectively implement communication in two ways:
window.addEventListener + window.postMessage
. However, one obvious issue with this approach is that messages can also be received in the Web
page. Although we can generate random tokens
to validate the source of the messages, this method can still be easily intercepted by the page itself, which is not very secure.document.addEventListener + document.dispatchEvent + CustomEvent
to create custom events. Here, it's crucial to ensure that the event names are random. By generating a unique random event name in the background
during the injection of the framework, and subsequently using that event name for communication between the Content Script
and the Inject Script
, we can prevent the messages generated from method calls from being intercepted by users.It is important to note that all transmitted data types must be serializable. If not, they will be considered cross-origin objects in Gecko
-based browsers, as they genuinely cross different Contexts
. Otherwise, it would be akin to directly sharing memory.
Earlier, we discussed the various limitations of Google's heavily promoted v3
. One significant restriction is its CSP - Content Security Policy
, which no longer allows for the dynamic execution of code. This means that tools like our DevServer's
HMR
cannot function properly, while hot updates are a feature we actually need. Consequently, we are left with less refined solutions.
We can create a plugin for our build tool that uses ws.Server
to start a WebSocket
server. Then, we can establish a connection from worker.js
, the Service Worker
we intend to start, to this WebSocket
server. By using new WebSocket
, we can connect and listen for messages. Upon receiving a reload
message from the server, we will execute chrome.runtime.reload()
to reload the extension.
In the active WebSocket
server, we need to send a reload
message to the client after each build completion, for example, within the afterDone
hook. This will allow for simple extension reload capabilities. However, this introduces another problem: in the v3
version, Service Workers
are not persistent. Therefore, the WebSocket
connection will also be terminated along with the destruction of the Service Worker
. This has caused many Chrome
extensions to struggle with transitioning smoothly from v2
to v3
, and it is likely that this capability will be improved in the future.
Here we have successfully implemented the entire hot update solution for the extension. At this point, we can leverage the extension's Install
event to re-execute the Content/Inject Script
code injection at this moment, thereby achieving a comprehensive hot update for the extension. Of course, we must ensure that the script injection is idempotent. It is important to note that there is no Uninstall
event in the extension, so we need to manage the removal of previously injected side effects through the convention of calling specific global methods.
Interestingly, the multilingual solution provided by the browser is not very effective. The files we store in locals
are merely placeholders, intended to let the extension marketplace recognize the languages supported by our browser extension. The actual multilingual implementation occurs within our Popup
. For example, the data in packages/force-copy/public/locales/zh_CN
looks like this:
In reality, there are many front-end multilingual solutions available. Here, since our extension does not contain much multilingual content to worry about—after all, it is just a Popup
layer—there is no need to create a separate index.html
page. However, if that were necessary, it would be worthwhile to adopt a community multilingual solution. For now, we will keep it simple.
First, we ensure complete type coverage. In our extension, we use English as the base language, so the configuration is also set in English. Since we want a better grouping scheme, there may be more deeply nested structures. Therefore, the type definition must be comprehensive to support our multilingual requirements.
Next, we define the I18n
class along with a global cache for languages. In the I18n
class, we implement functions for calling, generating multilingual configurations on demand, and retrieving multilingual configurations. When calling it, we instantiate with new I18n(cross.i18n.getUILanguage());
and can retrieve translations by calling i18n.t("Information.GitHub")
.
Developing browser extensions is quite a complex endeavor, especially when it needs to be compatible with both v2
and v3
. Many design considerations must be made to ensure functionality on v3
. The shift to v3
has reduced some flexibility, but it has also enhanced security to some extent. However, the inherent permissions of browser extensions remain significantly high; for instance, even in v3
, we can still utilize CDP - Chrome DevTools Protocol
on Chrome
to accomplish a wide array of tasks. There is an overwhelming number of capabilities that extensions can offer, making it daunting to install them without a clear understanding or when the source code is not open. High extension permissions can lead to severe issues, such as user data leaks. Even with a strict code upload requirement like that of Firefox
to tighten review processes, it is challenging to eliminate all potential risks.