Building a protobuf parser
Recently, I had to reverse engineer the schema of a protobuf message, and then parse it in client side javascript.
About protocol buffers ( protobuf )
For context, protobuf is a format created by Google similar to JSON or XML, but based on binary. That way you send smaller data chunks over the wire, which does make a difference when you're serving billions on billions of requests per year.
How does protobuf work?
Typically you write schemas for your protobuf messages. You then have an SDK that is made for tranforming those schemas into methods that automatically parse and unparse the data for you.
Since I didn't have the schemas for the messages, I couldn't generate the code to de-parse the messages.
To even get the schema, I had to look at how to reverse engineer the data.
Approach to reverse engineer it
I first outlined a general approach which I thought would look like this:
- Download some example messages
- Inspect the data, to see what's possible to do with it
- If feasible, make a parser that can be used to parse the messages
Downloading the messages
The messages I wanted to take a look at where not sent over the network requests. This particular data seemed to come from the webrtc connection in the web app we wanted to integrate with.
I had used an injection pattern in the past to inspect fetch and XMLHttpRequest requests previously, so that felt like a simple approach.
// Creating a copy of the RTCPeerConnection object with my own method to inspect messages.
class RTCPeerConnectionModified extends RTCPeerConnection {
constructor(configuration?: RTCConfiguration | undefined) {
// createDataChannel is the method we want to alter, therefore we're going to copy it
// This way, we can use it in our injected method below.
const originalCreateDataChannel = this.createDataChannel
this.createDataChannel = function (label, options) {
const dataChannel = originalCreateDataChannel.call(this, label, options)
dataChannel.addEventListener("message", async (event) => {
const blob = new Blob([event.data], {
type: "application/octet-stream",
});
const url = URL.createObjectURL(blob)
const a = document.createElement("a")
a.href = url
a.id = "download"
a.download = "data.pb"
document.body.appendChild(a) // Append anchor to body.
a.click() // Trigger a click on the anchor.
document.body.removeChild(a) // Clean up.
})
}}
}
// Overwriting the browser windows existing client with my modified one
window.RTCPeerConnection = RTCPeerConnectionModified
This worked great, and I could download a set of data.pb files right to my machine.
Inspecting the data
Once I had the data, it was not what I expected. The protobuf data I recieved was quite minimal, and the format itself is also not that complicated. So even opening the .pb file as a text file I could see the data I wanted. There were some symbols that obviously weren't parsed correctly, but I had high hopes!
I first, naively, tried to use regex to match special characters before and after the string I wanted from the data.
However I quickly found out the schema wasn't straight forward enough for that.
Protobuf.dev has an amazing write-up of the encoding in protobuf, but writing a dynamic parser was outside of the scope of the project.
Luckily, someone has created an inspector tool for protobuf which is amazing.
The output from the inspector looks like this:
root:
1 <varint> = 14823222113
2 <chunk> = "test-string"
7 <chunk> = "another test string"
10 <chunk> = empty chunk
12 <chunk> = message:
1 <varint> = 100
1 <varint> = 40000
With this output, it's possible to write a .proto file which could generate a parsing function!!
Writing the parser
I'm sure there is a way to make protobuf do all the work.
I had the schema after all, I could probably write a .proto schema file which would be able to parse it.
Unfortunately the code had to run in a browser environment, and all the protobuf libraries I tried didn't work in an browser environment.
Except pbf. It did allow me to read messages as a Buffer, but not read .proto files ( again, browser environment )...
Luckily pbf allows you to write the read method for your schema manually, which is what I ended up doing.
And it wasn't that bad!
All that was needed was to do was to cross-reference the write-up on internal protobuf encoding and the pbf packages custom reading capabilities.
The resulting code looks something like this:
const data = new Pbf(buffer).readFields(readRoot, {});
function readRoot(tag, data, pbf) {
if (tag === 1) {
data.varintField = pbf.readVarint();
return;
}
if (tag === 2 || tag === 7 || tag === 10) {
if (!data.chunkFields) data.chunkFields = {};
data.chunkFields[tag] = pbf.readString();
return;
}
if (tag === 12) {
data.nestedMessage = pbf.readMessage(readNestedMessage, {});
return;
}
}
function readNestedMessage(tag, message, pbf) {
if (tag === 1) {
if (!message.varintFields) message.varintFields = [];
message.varintFields.push(pbf.readVarint());
return;
}
}
// Example usage
const buffer = ...; // Your binary protobuf data
const parsedData = new Pbf(buffer).readFields(readRoot, {});
At times the schema wasn't completely as expected, meaning I had to check the wiretype as well using the encoding table:
The encoding table looks like this:
ID | Name | Used For |
---|---|---|
0 | VARINT | int32, int64, uint32, uint64, sint32, sint64, bool, enum |
1 | I64 | fixed64, sfixed64, double |
2 | LEN | string, bytes, embedded messages, packed repeated fields |
3 | SGROUP | group start (deprecated) |
4 | EGROUP | group end (deprecated) |
5 | I32 | fixed32, sfixed32, float |
I almost didn't publish this as I thought it would be too niche of a problem for someone to google and this actually rank for. If you found this helpful, consider sending me a message, would be fun to hear how!