Adworse Codes Here

Programmer's travelogue
by Dima Ermilov Tagged: dive

The Curious Case of String.slice/3

Slicing a string isn’t tricky, right? Or is it? Elixir works wonderfully with strings. Some edge parts of it could be a little bit confusing though. This one led me to a rabbit hole deep down to IO devices and improper lists.

Say, we have some text data like a company name, some web page to show it, and some database to store it. Say, we’re not the one who designed both the web page and the DB, so they both have 80 characters limit, and there’s nothing we can do about it.

To keep things smooth, we would like to add some trimming to our changeset.

defmodule Naive do
  def trim_name(name) do
    name |> String.slice(0, 80)
  end
end

Looks nice and clean and works fine until it doesn’t:

iex> name = "समुद्र के भीतर अचानक जब बड़ी तेज़ हलचल होने लगती है तो उसमें तूफान उठता है जिससे ऐसी \
लंबी और बहुत ऊंची लहरों का रेला उठना शुरू हो जाता"
|> Naive.trim_name
"समुद्र के भीतर अचानक जब बड़ी तेज़ हलचल होने लगती है तो उसमें तूफान उठता है जिससे ऐसी लंबी और बहुत ऊंची लहरों का"

iex> name |> String.length
80

iex> Repo.insert(%Company{name: name})
** (Postgrex.Error) ERROR 22001 (string_data_right_truncation) value
    too long for type character varying(80)

Wait, what? Oh yeah. Elixir is very smart and Unicode-aware. So it knows about graphemes; it counts those as a single character just like a native reader would. Isn’t it great? It lets us show complex foreign texts without breaking them literally in the middle of a letter. The only downside of this approach is that databases are usually not that smart. They count characters by codepoints, not by graphemes.

iex> name |> String.codepoints() |> Enum.count
111

So some of these very beautiful letters are two codepoints, not one. OK, we can deal with it.

defmodule Naive do
  def trim_name do
    name
    |> String.codepoints
    |> Enum.take(80)
    |> to_string
  end
end

iex> Repo.insert(%Company{name: name |> Naive.trim_name})
{:ok, %Company{name: "समुद्र के भीतर अचानक जब बड़ी तेज़ हलचल होने लगती है तो उसमें तूफान उठता है जिससे"}

The downside is we’re keeping graphemes intact only if we’re lucky. So probably we should count by codepoints but cut by graphemes. Enum.reduce_while/3 is at your service.

defmodule LessNaive do
  def trim_string(string) do
    string
    |> String.graphemes()
    |> Enum.reduce_while("", fn grapheme, acc ->
      if counter(acc <> grapheme) <= 80,
        do: {:cont, acc <> grapheme},
        else: {:halt, acc}
    end)
  end

  defp counter(string), do: string |> String.codepoints |> Enum.count
end

This will keep adding graphemes to the acc while codepoints count of the resulting string is less or equal 80. The only downside is that we have O(n²) here, eww. Let’s fix it:

defmodule LessNaive do
  def trim_string(string) do
    string
    |> String.graphemes()
    |> Enum.reduce_while({0, ""}, fn grapheme, {count, acc} ->
      count = count + counter(grapheme)

      if count <= 80,
        do: {:cont, {count, acc <> grapheme}},
        else: {:halt, acc}
    end)
  end

  defp counter(string), do: string |> String.codepoints() |> Enum.count()
end

Are we done yet? Well, kind of yes. And kind of no.

Level Up: HTML maxlength and Other Animals

iex> trimmed = "Sunny! 𐍈𐍈 Yesterday my life was filled with rain, \
Oh Sunny 𐍈𐍈You smiled at me and really eased the pain"  \
|> LessNaive.trim_string
"Sunny! 𐍈𐍈 Yesterday my life was filled with rain, Oh Sunny 𐍈𐍈You smiled at me an"

iex> trimmed |> String.length
80

So far so good. We may check out Ruby or Python, they will count it the same way. Our new fancy trimmer works perfectly with any regular database, even with MySQL (if we have utf8mb4 charset for the table).

But sometimes, on very rare occasions, it fails spectacularly.

% node
Welcome to Node.js v14.5.0.
Type ".help" for more information.
> "Sunny! 𐍈𐍈 Yesterday my life was filled with rain, \
... Oh Sunny 𐍈𐍈You smiled at me re".length
84

These occasions are:

Javascript string length checkers in general and Node.js backends in particular, HTML input areas with maxlength attribute, and last but not least Salesforce API text fields. Same thing, only nastier, will happen if we try to insert this into the Salesforce Name field: STRING_TOO_LONG:Name: data value too large. How in the world is it possible? A bit of a googling gives us the answer:

There’s no UTF-8 when you consider JavaScript internal string representation, there’s UTF-16. JavaScript String length property returns the number of UTF-16 code units in the string’s internal representation. In most cases, there’s one UTF-16 code unit per one UTF-8 code point. But for some symbols, there’s two. They are called “surrogate pairs”. For example, every single-codepoint emoji is two UTF-16 code units. This family emoji 👨‍👩‍👧‍👧 is one grapheme, 7 UTF-8 code points, and 11 UTF-16 code units.

So what shall we do? Luckily, Elixir has access to Erlang’s :unicode module, which has a character to binary conversion. The only thing left is to count the number of 2-byte sequences so that the counter would look like this:

def counter(string) do
  string
  |> :unicode.characters_to_binary(:utf8, :utf16)
  |> byte_size()
  |> div(2)
end

Let’s make our trimmer a bit more universal by passing both the counter function and the trim length to it.

defmodule StringTrimmer do
  def trim_utf8(string, len), do: trim(string, len, &utf8_counter/1)

  def trim_utf16(string, len), do: trim(string, len, &utf16_counter/1)

  def trim(string, len, counter) do
    string
    |> String.graphemes()
    |> Enum.reduce_while({0, ""}, fn grapheme, {count, acc} ->
      count = count + counter.(grapheme)

      if count <= len,
        do: {:cont, {count, acc <> grapheme}},
        else: {:halt, {count, acc}}
    end)
    |> elem(1)
  end

  defp utf8_counter(string), do: string |> String.codepoints() |> Enum.count()

  defp utf16_counter(string) do
    string
    |> :unicode.characters_to_binary(:utf8, :utf16)
    |> byte_size()
    |> div(2)
  end
end

Et voilà. Now we can trim like Postgres and trim like SalesForce while having our graphemes intact.

Will It Bite Us?

Oh, sure it will. I can think at least of two things making me feel uncomfortable.

Say we have a 10K chars string and do the trimming to 5 chars. Creating a 10K elements list with String.codepoints/1 seems a bit overkill. Say we trim a 10K chars string to 9999 chars. String concatenation in Elixir is copy-on-write, so we effectively create 9999 strings, doing 9999 memory allocations. It looks like our function is both slow and gives too much work for GC. Luckily, Elixir has a perfect answer to both problems. IO.Stream works great for char sourcing, and IO List is ideal for string building. If you’re unfamiliar with the latter, it’s Erlang’s secret weapon and one of the things making Phoenix so blazingly fast. Here is a great in-depth explanation with easy-to-grasp examples.

A new, faster, and cleaner version (look how the use of the improper list makes [acc | grapheme] part more readable):

defmodule StringTrimmer do
  def trim_utf8(string, len), do: trim(string, len, &utf8_counter/1)

  def trim_utf16(string, len), do: trim(string, len, &utf16_counter/1)

  def trim(string, len, counter) do
    with {:ok, stringio} <- string |> StringIO.open(),
         stream <- stringio |> IO.stream(1),
         {_count, trimmed_iolist} <-
           stream
           |> Enum.reduce_while({0, []}, fn grapheme, {count, acc} ->
             count = count + counter.(grapheme)

             if count <= len,
               do: {:cont, {count, [acc | grapheme]}},
               else: {:halt, {count, acc}}
           end) do
      trimmed_iolist |> IO.iodata_to_binary
    end
  end

  defp utf8_counter(string),
    do: string |> String.codepoints() |> Enum.count()

  defp utf16_counter(string) do
    string
    |> :unicode.characters_to_binary(:utf8, :utf16)
    |> byte_size()
    |> div(2)
  end
end

Is it final? Well, not quite.

What a Nice Device

It passes all the tests, except this really stupid-looking one.

iex(1)> 1..1000000 |> Enum.each(fn _ -> StringTrimmer.trim_utf16("sdsdsds", 2) end)

13:20:32.197 [error] Too many processes

** (SystemLimitError) a system limit has been reached
    :erlang.spawn_opt(:proc_lib, :init_p, [#PID<0.110.0>, [#PID<0.83.0>], :gen, :init_it, [:gen_server, #PID<0.110.0>, #PID<0.110.0>, StringIO, {"sdsdsds", []}, []]], [:link])
    (stdlib 4.0.1) proc_lib.erl:191: :proc_lib.spawn_opt/4
    (stdlib 4.0.1) proc_lib.erl:349: :proc_lib.start_link/5
    iex:7: StringTrimmer.trim/3
    (elixir 1.13.4) lib/enum.ex:942: anonymous fn/3 in Enum.each/2
    (elixir 1.13.4) lib/enum.ex:4136: Enum.reduce_range/5
    (elixir 1.13.4) lib/enum.ex:2400: Enum.each/2

13:20:32.211 [error] Too many processes

13:20:32.211 [error] Process #PID<0.83.0> raised an exception
** (SystemLimitError) a system limit has been reached
    :erlang.spawn(:erlang, :apply, [#Function<0.74613015/0 in IEx.Server.loop/3>, []])
    :erlang.spawn/1
    (iex 1.13.4) lib/iex/server.ex:135: IEx.Server.loop/3

Oh, noes. Not something you’d like to run into in production (yes, I did). Of course, since an IO device is a (stateful) process, we need to close it after opening; either way, this process will keep running. I don’t like the idea of closing it inside the with macro, and luckily there is StringIO.open/2 for this. It accepts a function as a second argument and will close the device after the function returns a value or fails.

defmodule StringTrimmer do
  def trim_utf8(string, len), do: trim(string, len, &utf8_counter/1)

  def trim_utf16(string, len), do: trim(string, len, &utf16_counter/1)

  def trim(string, len, counter) do
    with {:ok, {_count, trimmed_iolist}} <-
         string
         |> StringIO.open(fn stringio ->
           stringio
           |> IO.stream(1)
           |> Enum.reduce_while({0, []}, fn grapheme, {count, acc} ->
             count = count + counter.(grapheme)

             if count <= len,
               do: {:cont, {count, [acc | grapheme]}},
               else: {:halt, {count, acc}}
           end)
         end) do
      trimmed_iolist |> IO.iodata_to_binary()
    end
  end

  defp utf8_counter(string),
    do: string |> String.codepoints() |> Enum.count()

  defp utf16_counter(string) do
    string
    |> :unicode.characters_to_binary(:utf8, :utf16)
    |> byte_size()
    |> div(2)
  end
end

iex(2)> 1..1000000 |> Enum.each(fn _ -> StringTrimmer.trim_utf16("sdsdsds", 2) end)
:ok

Phew.

Hopefully, not to be continued.

PS. As more senior colleagues pointed out, String.splitter fed to Enum.reduce_while would do the trick without spawning an additional process for StringIO. Cleaner and more efficient, but tinkering with IO was such fun that I just couldn’t help myself.