Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reversible binary patch format #23379

Open
rodarima opened this issue Sep 24, 2024 · 8 comments
Open

Reversible binary patch format #23379

rodarima opened this issue Sep 24, 2024 · 8 comments
Milestone

Comments

@rodarima
Copy link

I would like to generate some binary patches that can be read in plain text, in the same way I do with diff(1) and patch(1). In particular, I want to be able to do these operations:

  • Generate a patch comparing two files (or two directories)
  • Read the addresses being modified in hexadecimal
  • Read the patch as plain text and be able to add some comments
  • Add/remove/modify some hunks in the patch manually
  • Email and VCS friendly
  • Apply the patch and generate a .rej file with rejected hunks
  • Revert an applied patch

The default radiff2(1) format is close to what I want.

% radiff2 v2.bin v3.bin
0x00189ca8 58731b => 004c1d 0x00189ca8

It has the benefit that it can be reversed by a simple awk(1) program:

% cat a.patch
0x00189ca8 58731b => 004c1d 0x00189ca8
% awk '{print $5,$4,$3,$2,$1}' < a.patch
0x00189ca8 004c1d => 58731b 0x00189ca8

However, AFAIK this format doesn't seem to be accepted by any tool.

The r2 format outputs radare2(1) commands, but they ignore what was in that address before:

% radiff2 -r v2.bin v3.bin
wx 004c1d @ 0x00189ca8

This is not enough, as I want to know if a give patch collides with another one. This also prevents from reverting an applied patch.

There is also the rapatch.md format, but it seems to be different than these two. And it also seems to have the same problem, it cannot be reverted.

Maybe this problem can be solved by implementing a reversible operator like wx that swap bytes instead of overwriting an address.

<swap> 004c1d 58731b @ 0x00189ca8

The problem of this approach is that when a swap command fails, it should output that hunk into a reject file, which is probably not what you want from a r2 session.

Maybe it would be a better idea to have another tool just for this workflow (which could also work with multiple files at once). You first perform all the changes you want with r2 w commands, then you save the file and generate a patch that can be further edited and applied/reversed:

% radiff2 a.bin b.bin > patch.txt
% $EDITOR patch.txt
% rapatch < patch.txt
% rapatch -R < patch.txt # Reverse the patch

Here is an example of what that patch may look like, which is very close to what patch(1) expects:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200

The following hunk changes the value 1799000 to 1920000 (in decimal)
at the address 0x00189ca8. Notice how I use LE to specify little endian,
so I can see the raw values clearly.

@@ -0x00189ca8,4 +0x00189ca8,4 @@
- LE 0x001b7358 # 1799000 comment after the # symbol, could be assembly
+ LE 0x001d4c00 # 1920000

The benefit of such format is that:

  • I can specify the data in other sizes and endianness that plain hex bytes
  • Changes are vertically aligned
  • Arbitrary comments which can include instructions
  • Hunks can be added/removed/modified by hand
  • Easily reverted
  • Modifications are grouped in hunks which can fail independently
  • Email and VCS friendly
  • Not specific to radare2, so it can also be used by other programs

This format can also be used to insert or remove bytes, leaving a different sized file. It also prevents the problem of using multiple write commands for the same memory location if the hunk addresses are sorted. It also resembles the patch format closely enough that it gets the syntax colors of normal patches on GitHub.

Patches of patches are also readable:

% diff -u patch1.txt patch2.txt
--- patch1.txt	2024-09-24 09:29:00.823234555 +0200
+++ patch2.txt	2024-09-24 09:29:18.273200860 +0200
@@ -1,10 +1,10 @@
 --- a.bin	2024-09-24 09:24:41.475235346 +0200
 +++ b.bin	2024-09-24 09:24:41.475235346 +0200

-The following hunk changes the value 1799000 to 1920000 (in decimal)
+The following hunk changes the value 1799000 to 1920001 (in decimal)
 at the address 0x00189ca8. Notice how I use LE to specify little endian,
 so I can see the raw values clearly.

 @@ -0x00189ca8,4 +0x00189ca8,4 @@
 - LE 0x001b7358 # 1799000 comment after the # symbol, could be assembly
-+ LE 0x001d4c00 # 1920000
++ LE 0x001d4c01 # 1920001

I think I could adapt radiff2.c to output such format, and maybe modify patch(1) to accept them.

@trufae
Copy link
Collaborator

trufae commented Sep 24, 2024

use the -r flag

@rodarima
Copy link
Author

As of 2578ff0, using radiff2 -r produces a patch that is not reversible:

% radiff2 -r v2.bin v3.bin
wx 004c1d @ 0x00189ca8

% radiff2 -v
radiff2 5.9.5 32634 @ linux-x86-64
birth: git.5.9.4-231-g2578ff0ac5 2024-09-24__17:40:00
commit: 2578ff0ac57765e0c5908fb6559bbbfd86252c12
options: gpl -O1 cs:5 cl:2 meson

@trufae
Copy link
Collaborator

trufae commented Sep 24, 2024

you can also use -1 output in Generic binary DIFF (0xd1ffd1ff magic header) as well as -X show two column hexII diffing. but i agree that all that should be probably unified into a single flag. what is your proposal? the output of -r is compatible with r2. its an r2 script. and ideally this script should work too with r0 (aka ired). but it's just stuff from radare. which other bindiffing tools are you caring?

The proposal to create a decent and standarized and extensible binary patching file format in plain text looks quite interesting to me, and i would love to have support in r2 for that. btw there's also support for rapatch. but its just part of r2, not a standalone tool. but for consistency with radiff2 it probably makes sense to have a rapatch2 tool instead of having an r2 uppercase flag.

$ r2 -h| grep -i patch
Usage: r2 [-ACdfjLMnNqStuvwzX] [-P patch] [-p prj] [-a arch] [-b bits] [-c cmd]
 -P [file]    apply rapatch file and quit

You can read more about this in doc/rapatch.md

@trufae trufae added this to the 6.0.0 milestone Sep 26, 2024
@trufae
Copy link
Collaborator

trufae commented Sep 26, 2024

just created a new tool that cant bemerged until r2-6.0

#23391

for now is just a dummy thing, but I agree that your proposal is important and should be treated as a first class tool, would you like to improve radiff2 to support this output?

i'm not 100% sure about the LE/BE values because radiff just spots changes in byte which may not really know if the underlying data is a word or qword. the patch format can specify that or maybe we can do some happy assumptions on this. i think we have time during 5.9.x until we reach 6.0 to break abi and provide such new tool with proper manpage and a working patch format for unified binary patching.

@rodarima
Copy link
Author

Thanks for taking a look.

I'll need to think about the patch format to come up with a spec that makes sense first.

You can always fall back to plain hex bytes if you don't know how the data type, but the patch format should allow you to use a more human friendly format.

The problem with the flow of saving the binary and then diffing with the original is that you miss information on how the user specified the changes. It may be better to generate a patch from r2 itself when you know how those changes were made. This way you can store in the patch comments such as the instruction being changed or ASCII.

I suspect that for any w command that you do with radare2, you can always find the opposite command that would revert that change, and specify it using the same value format.

For example, wvf 3.21 over a memory location that contains the 1.23 float, could generate this patch:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ -0x00100000,4 +0x00100000,4 @@
- wvf 1.23
+ wvf 3.21

This is nice for radare2 users because they will be already familiar with the commands, but it doesn't make a lot of sense for users of other tools. It also has the problem that there is not information about the byte order. Something like this:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ -0x00100000,4 +0x00100000,4 @@
- LE (float) 1.23
+ LE (float) 3.21

May be more understandable, specially if we use known types like C.

Another issue is that in the common case, you will always use the same LE/BE for a patch (although we should support the cases when that is not true), so you don't need to pollute every line with LE/BE. It is conveninent to define a byte order at the start:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00100000,4 +0x00100000,4 @@
- (float) 1.23
+ (float) 3.21

Now, there is the case in which a user may specify integers in different bases hex/dec/octal. In my above case:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,4 +0x00189ca8,4 @@
- (uint32_t) 1799000 # Using a type that can map a constant into bytes
+ (uint32_t) 1920000 # Notice how I use decimal here

I think it may be good to use the C format for numbers too: 0123 = octal, 123 = dec, 0x123 = hex.

This may also be valid:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,12 +0x00189ca8,12 @@
- (uint32_t []) { 1799000, 1799001, 1799002 }
+ (uint32_t []) { 1920000,   0x123,      07 }

But it starts to complicate the syntax. Also, if I want to add another number, I would need to modify the 12 in the hunk header. So this may be simpler:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,uint32_t +0x00189ca8,uint32_t @@
- 1799000 # Using a type that can map a constant into bytes
+ 1920000 # Notice how I use decimal here
--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ LE @@
@@ -0x00189ca8,uint32_t[3] +0x00189ca8,uint32_t[3] @@
- 1799000, 1799001, 1799002
+ 1920000,   0x123,    0777 # Notice 0777 is octal

All those cases can be mapped to the basic format, where everything is a simple hex string. I don't like to just specify an hex string that could be confused with a number. Also, I think "\x12\x23\x34\x45" contains a lot of noise.

So maybe we can use something like this:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ -0x00189ca8,char[4] +0x00189ca8,char[4] @@
- '00 1b 73 58'
+ '00 4c 1d 00'

Notice there is no byte order, as hex string don't need one. We can also probably specify that the default is LE, and only write BE when needed. There is also mixed endianness, but I think we can ignore those for now, and fall back to hex strings if needed.

This basic hex format is probably doable to be implemented in radiff2 without much effort, as we don't even need to output aligned words, just which bytes differ.

I can try to modify radiff2, but I'm not familiar with the codebase so it may take a while.

On a more advanced implementation, one could determine what type of data is placed on which addresses of a binary file, and then produce the appropriate representation in a patch when changing those bytes. Otherwise fall back to hex strings.

The hex format should allow you to split the lines as you want, so you can write instructions properly:

--- a.bin	2024-09-24 09:24:41.475235346 +0200
+++ b.bin	2024-09-24 09:24:41.475235346 +0200
@@ -0x00189ca8,char[4] +0x00189ca8,char[4] @@
- '01 46'       # mov r1, r0
- '68 46'       # mov r0, sp
+ '4f f2 ba fc' # bl 0x254526

It would be also nice if this format is a superset of the patch format, so you can also apply normal patches with rapatch2 (or even mix hunks). I think this can be easily done by using the "type" specifier of the hunk. So -0x00189ca8,char[4] specifies a binary patch of 4 bytes, while -00189308,4 specifies 4 lines at the 00189308 decimal line.

This may be useful if you have a mix of source code and blobs and you want to specify a patch to change both.

@rodarima
Copy link
Author

This oneliner more or less implements the hex diff (addresses are decimal and start at 1):

% bindiff() { diff -u0p <(od -An -vtx1 -w1 $1) <(od -An -vtx1 -w1 $2) | sed '/^@@/s/,\([0-9]*\)/,char[\1]/g' }
% bindiff v2.bin v3.bin
--- /proc/self/fd/11	2024-09-26 22:10:25.813980894 +0200
+++ /proc/self/fd/13	2024-09-26 22:10:25.813980894 +0200
@@ -1612969,char[3] +1612969,char[3] @@
- 58
- 73
- 1b
+ 00
+ 4c
+ 1d

@trufae
Copy link
Collaborator

trufae commented Sep 27, 2024

Let's discuss it in here https://hackmd.io/@BCdr4EkGSKO51w6pf-JUow/r1h5idQRC/edit

@rodarima
Copy link
Author

rodarima commented Oct 6, 2024

I have created a draft of the specification here: https://github.com/rodarima/xpatch/

For now I'm calling it "xpatch" as in extended patch. It is a superset of the patch(1) format, so you can mix plain text and binary patches in the same xpatch file.

There is a simple xdiff program as an example, but to implement a full xpatch I'll probably need to bring bison to parse the grammar properly and detect errors.

I'm still thinking a bit about the format, but so far it seems suitable for all my usecases. I've also added support to parse hex strings directly by specifying the printf-like parsing format %2x. So "aabbcc" will parse to 3 numbers as hex pairs, but "aabbc" is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants