Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems with bytea #827

Open
Broollie opened this issue May 6, 2024 · 22 comments
Open

Encoding problems with bytea #827

Broollie opened this issue May 6, 2024 · 22 comments

Comments

@Broollie
Copy link

Broollie commented May 6, 2024

Hello, I encountered an interesting issue. I have a table with column of type bytea. Data in this column is of fixed size so in order to ommit manual runtime size checks I implemented string_traits for std::array<std::byte, some size> to use this type directly in prepared statements. And it does work.. until it doesn't. Depending on contents, query may result in this error: "ERROR: invalid byte sequence for encoding "UTF8": 0xd6 0x5f CONTEXT: unnamed portal parameter $1" (0xd6, 0x5f just for an instance). And i m not sure what causes it, but apparently this error comes from libpq call, but I cannot be sure about this. Here is code that represents my case:

#include <pqxx/pqxx>
#include <array>
#include <iostream>
#include <format>

inline constexpr std::size_t DataSize = 8;

using Data = std::array<std::byte, DataSize>;

namespace pqxx
{
    template <>
    struct string_traits<Data>
    {
        // static constexpr bool converts_to_string{true};
        static constexpr bool converts_from_string{true};

        static constexpr auto DataSize = internal::size_esc_bin(std::tuple_size_v<Data>);

        static Data from_string(std::string_view text)
        {
            if (text.size() != DataSize)
            {
                throw pqxx::conversion_error(std::format("invalid data size: {}. Expected: {}", text.size(), DataSize));
            }

            Data buf;
            pqxx::internal::unesc_bin(text, reinterpret_cast<std::byte *>(buf.data()));
            return buf;
        }

        static zview to_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("to_buf: unable to fit data into buffer with size {}", end - begin));
            }

            return generic_to_buf(begin, end, value);
        }

        static char *into_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("into_buf: unable to fit data into buffer with size {}", end - begin));
            }

            for (auto i = 0; i < value.size(); i++)
            {
                begin[i] = static_cast<char>(value.at(i));
            }
            begin[value.size()] = '\0';

            return begin + DataSize;
        }

        static size_t size_buffer(Data const &value) noexcept
        {
            return DataSize;
        }
    };
}

namespace std
{
    template <>
    struct formatter<Data> : formatter<string>
    {
        auto format(const Data &data, format_context &ctx) const
        {
            int tmpData[DataSize];
            std::transform(data.begin(), data.end(), tmpData, [](auto elem)
                           { return static_cast<int>(elem); });
            return formatter<string>::format(
                std::format("\\x{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}",
                            tmpData[0], tmpData[1], tmpData[2], tmpData[3],
                            tmpData[4], tmpData[5], tmpData[6], tmpData[7]),
                ctx);
        }
    };
}

inline constexpr std::byte operator""_bt(unsigned long long c)
{
    return static_cast<std::byte>(c);
}

auto main(int argc, const char *argv[]) -> int
{
    static constexpr std::string_view ConnectionQuerySchema = "host={} port={} dbname={} user={} password={}";
    static constexpr std::string_view InsertionSchema = "INSERT INTO test_table VALUES ('{}')";

    static constexpr Data okData{0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt};
    static constexpr Data badData{0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt};

    try
    {
        pqxx::connection con(std::format(ConnectionQuerySchema, argv[1], argv[2], argv[3], argv[4], argv[5]));
        con.prepare("insert", "INSERT INTO test_table VALUES ($1)");

        pqxx::nontransaction nt(con);
        nt.exec0("CREATE TABLE IF NOT EXISTS test_table(data bytea NOT NULL)");

        std::cout << "Ok: " << std::boolalpha << nt.exec_prepared0("insert", okData).empty() << std::endl;
        const auto query = std::format(InsertionSchema, badData);
        std::cout << "Query: " << query << " Also ok: " << nt.exec0(query).empty() << std::endl;
        std::cout << "Bad: " << nt.exec_prepared0("insert", badData).empty() << std::endl;
    }
    catch (const std::exception &e)
    {
        std::cerr << e.what() << '\n';
    }
} 

Compiled as follows: g++-13 -std=c++20 test.cpp -o test -lpqxx -lpq

@tt4g
Copy link
Contributor

tt4g commented May 6, 2024

When you send prepared statement parameters with libpq, all parameters are converted once into strings.
If you only write $1 in the SQL, PostgreSQL will treat it as a string and will try to interpret it as client encoding UTF-8, which will fail.

Try replacing $1 with $1::bytea.

@Broollie
Copy link
Author

Broollie commented May 7, 2024

When you send prepared statement parameters with libpq, all parameters are converted once into strings. If you only write $1 in the SQL, PostgreSQL will treat it as a string and will try to interpret it as client encoding UTF-8, which will fail.

Try replacing $1 with $1::bytea.

Thank you for your suggestion, but I tried adding ::bytea earlier and it didn't work. But thanks to your explanation I ve got a soluction that does work. By passing input into pqxx::internal::esc_bin it will produce escaped string and will work correctly afterwards.

I would guess, that it also should not be too expensive by calling it this way:
pqxx::internal::esc_bin({badData.begin(), badData.end()})
as it will create a view for an input argument. But I wonder, if there is a better solution.

Anyways, i will also include my solution there in case somebody will find it useful:

#include <pqxx/pqxx>
#include <iostream>
#include <format>

inline constexpr std::size_t DataSize = 8;

using Data = std::array<std::byte, DataSize>;

namespace pqxx
{
    template <>
    struct string_traits<Data>
    {
        // static constexpr bool converts_to_string{true};
        static constexpr bool converts_from_string{true};

        static constexpr auto DataSize = internal::size_esc_bin(std::tuple_size_v<Data>);

        static Data from_string(std::string_view text)
        {
            if (text.size() + 1 != DataSize) //size_esc_bin adds 1 byte for \0 but std::string_view::size doesn't include it
            {
                throw pqxx::conversion_error(std::format("invalid data size: {}. Expected: {}", text.size(), DataSize));
            }

            Data buf;
            pqxx::internal::unesc_bin(text, reinterpret_cast<std::byte *>(buf.data()));
            return buf;
        }

        static zview to_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("to_buf: unable to fit data into buffer with size {}", end - begin));
            }

            return generic_to_buf(begin, end, value);
        }

        static char *into_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("into_buf: unable to fit data into buffer with size {}", end - begin));
            }

            for (auto i = 0; i < value.size(); i++)
            {
                begin[i] = static_cast<char>(value.at(i));
            }
            begin[value.size()] = '\0';

            return begin + DataSize;
        }

        static size_t size_buffer(Data const &value) noexcept
        {
            return DataSize;
        }
    };
}

namespace std
{
    template <>
    struct formatter<Data> : formatter<string>
    {
        auto format(const Data &data, format_context &ctx) const
        {
            int tmpData[DataSize];
            std::transform(data.begin(), data.end(), tmpData, [](auto elem)
                           { return static_cast<int>(elem); });
            return formatter<string>::format(
                std::format("{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}",
                            tmpData[0], tmpData[1], tmpData[2], tmpData[3],
                            tmpData[4], tmpData[5], tmpData[6], tmpData[7]),
                ctx);
        }
    };
}

inline constexpr std::byte operator""_bt(unsigned long long c)
{
    return static_cast<std::byte>(c);
}

auto main(int argc, const char *argv[]) -> int
{
    static constexpr std::string_view ConnectionQuerySchema = "host={} port={} dbname={} user={} password={}";
    static constexpr std::string_view InsertionSchema = "INSERT INTO test_table VALUES ('\\x{}')";

    static constexpr Data okData{0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt};
    static constexpr Data badData{0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt};

    try
    {
        pqxx::connection con(std::format(ConnectionQuerySchema, argv[1], argv[2], argv[3], argv[4], argv[5]));
        con.prepare("insert", "INSERT INTO test_table VALUES ($1)");
        con.prepare("view", "SELECT * FROM test_table");

        pqxx::nontransaction nt(con);
        nt.exec0("CREATE TABLE IF NOT EXISTS test_table(data bytea NOT NULL)");

        std::cout << "Ok: " << std::boolalpha << nt.exec_prepared0("insert", okData).empty() << std::endl;
        const auto query = std::format(InsertionSchema, badData);
        std::cout << "Query: " << query << " Also ok: " << nt.exec0(query).empty() << std::endl;
        std::cout << "Bad: " << nt.exec_prepared0("insert", pqxx::internal::esc_bin({badData.begin(), badData.end()})).empty() << std::endl;

        const auto result = nt.exec_prepared("view");

        for(auto row : result)
        {
            auto [data] = std::move(row.as<Data>());
            std::cout << std::format("{}", data) << std::endl; 
        }
    }
    catch (const std::exception &e)
    {
        std::cerr << e.what() << '\n';
    }
}

@tt4g
Copy link
Contributor

tt4g commented May 7, 2024

Glad It's solved.
I should have noticed earlier that unesc_bin() was being called and not esc_bin().

@jtv
Copy link
Owner

jtv commented May 7, 2024

Oops, and I'm only just trying to catch up! Thanks once again @tt4g for jumping in.

@Broollie I did notice this one little thing:

        static Data from_string(std::string_view text)
        {
            if (text.size() != DataSize)
            {
                throw pqxx::conversion_error(std::format("invalid data size: {}. Expected: {}", text.size(), DataSize));
            }
            // ...
        }

It looks to me here as if you're using DataSize for two very different quantities: on the one hand, it's the number of binary bytes in a Data value, but on the other, it's also the number of text bytes you expect in the SQL representation of a Data value. Those are two very different things!

@jtv
Copy link
Owner

jtv commented May 7, 2024

(I took the liberty to edit the code snippets you posted above: when you add "cxx" after the three backticks at the beginning, you get C++ syntax highlighting. :-)

@jtv
Copy link
Owner

jtv commented May 7, 2024

And I must admit that the mechanism by which libpq figures out the types of the parameters is mysterious to me as well.

By the way if you're working in C++20 @Broollie, do you actually need your own string_traits implementation? I think Data satisfies the binary concept defined in include/pqxx/strconv.hxx so you should have a conversion out of the box. ISTM you should be able to just use that, or if you want the size check in every conversion, then you could at least delegate to it.

Generally it's not a good idea to rely on items from the pqxx::internal:: namespace, because I reserve the right to change those on a whim. In particular, libpqxx 8 will require C++20 or better, and I'm hoping to replace a lot of these functions with ones that use std::span and what have you.

@Broollie
Copy link
Author

Broollie commented May 8, 2024

Oops, and I'm only just trying to catch up! Thanks once again @tt4g for jumping in.

It looks to me here as if you're using DataSize for two very different quantities: on the one hand, it's the number of binary bytes in a Data value, but on the other, it's also the number of text bytes you expect in the SQL representation of a Data value. Those are two very different things!

Hey, it s ok, i am myself always late :D Regarding DataSize, ye, that was my mistake on naming constants, didn't notice that I already have DataSize. It should be more like this

// ...
inline constexpr std::size_t DataSize = 8;
struct string_traits<Data>
{
    static constexpr auto EscDataSize = internal::size_esc_bin(std::tuple_size_v<Data>);
    // ...

By the way if you're working in C++20 @Broollie, do you actually need your own string_traits implementation? I think Data satisfies the binary concept defined in include/pqxx/strconv.hxx so you should have a conversion out of the box. ISTM you should be able to just use that, or if you want the size check in every conversion, then you could at least delegate to it.

That s actually what made me attempt to do it in the first place, because std::array satisfies std::ranges::contiguous_range
and so std::array<std::byte, N> should be usable for this. But it does require to define string_traits, specifically because it cannot resolve string_traits::from_string(text):

In file included from /home/linuxbrew/.linuxbrew/include/pqxx/internal/concat.hxx:7,
                 from /home/linuxbrew/.linuxbrew/include/pqxx/connection.hxx:36,
                 from /home/linuxbrew/.linuxbrew/include/pqxx/array.hxx:26,
                 from /home/linuxbrew/.linuxbrew/include/pqxx/pqxx:4,
                 from test.cpp:1:
/home/linuxbrew/.linuxbrew/include/pqxx/strconv.hxx: In instantiation of 'TYPE pqxx::from_string(std::string_view) [with TYPE = std::array<std::byte, 8>; std::string_view = std::basic_string_view<char>]':
/home/linuxbrew/.linuxbrew/include/pqxx/field.hxx:230:28:   required from 'T pqxx::field::as() const [with T = std::array<std::byte, 8>]'
/home/linuxbrew/.linuxbrew/include/pqxx/row.hxx:284:65:   required from 'auto pqxx::row::get_field() const [with TUPLE = std::tuple<std::array<std::byte, 8> >; long unsigned int index = 0]'
/home/linuxbrew/.linuxbrew/include/pqxx/row.hxx:278:53:   required from 'auto pqxx::row::get_tuple(std::index_sequence<__indices ...>) const [with TUPLE = std::tuple<std::array<std::byte, 8> >; long unsigned int ...indexes = {0}; std::index_sequence<__indices ...> = std::integer_sequence<long unsigned int, 0>]'
/home/linuxbrew/.linuxbrew/include/pqxx/row.hxx:192:42:   required from 'std::tuple<_UTypes ...> pqxx::row::as() const [with TYPE = {std::array<std::byte, 8>}]'
test.cpp:114:43:   required from here
/home/linuxbrew/.linuxbrew/include/pqxx/strconv.hxx:441:42: error: 'from_string' is not a member of 'pqxx::string_traits<std::array<std::byte, 8> >'
  441 |   return string_traits<TYPE>::from_string(text);

And it s not just pqxx::row::as(), but pqxx::row::reference::get<>() as well.

Generally it's not a good idea to rely on items from the pqxx::internal:: namespace, because I reserve the right to change those on a whim. In particular, libpqxx 8 will require C++20 or better, and I'm hoping to replace a lot of these functions with ones that use std::span and what have you.

Ye, looking through docs once again, i noticed convenient pqxx::binary_cast that not only works in my case, but also produces a view, so no copies (unlike pqxx::internal::esc_bin, which produces std::string)

@Broollie
Copy link
Author

Broollie commented May 8, 2024

Upon further investigation, i found out that PQXX_HAVE_CONCEPTS was not defined, so now I wonder if there are other defines, that may be missing and, if defined, should enable direct use of std::array<std::byte, N>

@jtv
Copy link
Owner

jtv commented May 8, 2024

The configure or cmake step to the libpqxx build should detect whether your compiler supports concepts.

I'm writing this on my phone where it's hard to look up past conversation... Did you select C++20 (or newer) when configuring the build? Because if so, I would definitely have expected it to detect concepts support.

By the way beware that esc_bin and binary_cast do entirely different things! The former converts binary data to SQL text format. The latter merely creates a view from a pointer, without converting.

@jtv
Copy link
Owner

jtv commented May 8, 2024

@Broollie when building libpqxx, did you use configure or cmake? If you used configure then the file config.log should show whether the configure script found __cpp_concepts defined or not. (And just in case: what matters is the C++ version you selected when configuring the build, i.e. when running either configure or cmake.)

@Broollie
Copy link
Author

Broollie commented May 8, 2024

@jtv Ok, i ve figured out what s going on. Mostly. So, basically in my main project i build libpqxx with cmake and all defines are present there. But for this test I use libpqxx installed with brew package manager. And defines are missing there because by default libpqxx v7.9 uses CMAKE_CXX_STANDARD 17. But it is not as important. So the reason why I can't use std::array<std::byte, N> directly is that string_traits already has implementation for std::array and it doesn't have from_string() defined. I checked if you can derive from internal::array_string_traits<> and simply implement from_string() and it does work in my case

@Broollie
Copy link
Author

Broollie commented May 8, 2024

But it also seems like I m missing something, because direct call to pqxx::string_traits<Data>::size_buffer(OkData) results in linkage error for pqxx::oops_forbidden_conversion<std::byte>(), which is not a great sign

@Broollie
Copy link
Author

Broollie commented May 8, 2024

By the way beware that esc_bin and binary_cast do entirely different things! The former converts binary data to SQL text format. The latter merely creates a view from a pointer, without converting.

Oh, and to clarify, I saw, that binary_cast simply creates a std::basic_string_view<std::byte> without escaping data and since its behavior is defined, it should work just fine. Seems to be correct

@jtv
Copy link
Owner

jtv commented May 8, 2024

Ah yes, the forbidden conversion - I guess that's two concepts clashing.

A span of contiguous std::byte should act as a bytea. And I was expecting a std::array of std::byte to act as one of those.

But more generically, a std::array of any other type should convert as a bunch of individual elements, wrapped in SQL array syntax. But that won't work for std::byte because by design there is no conversion for std::byte.

So we'll need a way to tell the compiler that we want the former in this case, not the latter.

@jtv
Copy link
Owner

jtv commented May 8, 2024

Or, given that the most specific specialisation should normally win out, perhaps it's a matter of removing a string_traits specialisation that has now become too specific.

@Broollie
Copy link
Author

Broollie commented May 8, 2024

That is indeed a clash of 2 while not completely different, but still contradicting concepts. There are enough ways of resolving this conflict while keeping backward compatibility (or you can ignore it in favor of implementing changes for 8.x version and some changes in traits API, if there are any). I would guess it may be solved either by creating more specialisations (like having different implementations for std::array<T, N> and std::array<std::byte, N>), that would make string_traits even more specific, or by going other way and creating more generic, yet more flexible traits API. But those are extremes of both sides, so a solution may lie between them

@jtv
Copy link
Owner

jtv commented May 21, 2024

@Broollie for now I'm just utterly stuck on getting readthedocs working again. I'm rebuilding that from scratch, because I see no other way at this point. Hopefully after that I can focus on actual progress again.

@Broollie
Copy link
Author

@jtv sure, we can just keep the issue open so that you can return to it once you r done

@jtv
Copy link
Owner

jtv commented Jun 6, 2024

The good news is that I got the documentation build working again. (With lots of scary warning messages, so I'll be chipping away at that pile still.) I've got one other small feature to build, and then hopefully I can swap this ticket back into my personal memory and see what I can do about it.

Repository owner deleted a comment from github-actions bot Aug 7, 2024
@jtv
Copy link
Owner

jtv commented Aug 7, 2024

Reading this back again, I think what I need to do about this ticket is write a test that does conversions of a std::array<std::byte, ...>... and then make that test work.

Of course that's a whole different ballgame in C++20 than it is in C++17, because I'll be counting on C++20 soon. It's not as easy as "make sure it works in C++20, we don't care about C++17."

So I'm planning to keep this ticket on the shelf until then. Right now I've got a really nice idea of where to take the exec functions, and I think that would be a good cutoff point to start 8.0 development.

@Broollie
Copy link
Author

Seems to be a reasonable test case and perhaps there should be a test for a std::span<std::byte, ...> as well. Godspeed on the v8.0

@jtv
Copy link
Owner

jtv commented Aug 15, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants