We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Adworse Codes HereProgrammer's travelogue |
|||
by Dima Ermilov | Tagged: dive |
The Curious Case of String.slice/3
Slicing a string isn’t tricky, right? Or is it? Elixir works wonderfully with strings. Some edge parts of it could be a little bit confusing though. This one led me to a rabbit hole deep down to IO devices and improper lists.
Say, we have some text data like a company name, some web page to show it, and some database to store it. Say, we’re not the one who designed both the web page and the DB, so they both have 80 characters limit, and there’s nothing we can do about it.
To keep things smooth, we would like to add some trimming to our changeset.
defmodule Naive do
def trim_name(name) do
name |> String.slice(0, 80)
end
end
Looks nice and clean and works fine until it doesn’t:
iex> name = "समुद्र के भीतर अचानक जब बड़ी तेज़ हलचल होने लगती है तो उसमें तूफान उठता है जिससे ऐसी \
लंबी और बहुत ऊंची लहरों का रेला उठना शुरू हो जाता"
|> Naive.trim_name
"समुद्र के भीतर अचानक जब बड़ी तेज़ हलचल होने लगती है तो उसमें तूफान उठता है जिससे ऐसी लंबी और बहुत ऊंची लहरों का"
iex> name |> String.length
80
iex> Repo.insert(%Company{name: name})
** (Postgrex.Error) ERROR 22001 (string_data_right_truncation) value
too long for type character varying(80)
Wait, what? Oh yeah. Elixir is very smart and Unicode-aware. So it knows about graphemes; it counts those as a single character just like a native reader would. Isn’t it great? It lets us show complex foreign texts without breaking them literally in the middle of a letter. The only downside of this approach is that databases are usually not that smart. They count characters by codepoints, not by graphemes.
iex> name |> String.codepoints() |> Enum.count
111
So some of these very beautiful letters are two codepoints, not one. OK, we can deal with it.
defmodule Naive do
def trim_name do
name
|> String.codepoints
|> Enum.take(80)
|> to_string
end
end
iex> Repo.insert(%Company{name: name |> Naive.trim_name})
{:ok, %Company{name: "समुद्र के भीतर अचानक जब बड़ी तेज़ हलचल होने लगती है तो उसमें तूफान उठता है जिससे"}
The downside is we’re keeping graphemes intact only if we’re lucky. So probably we should count by codepoints but cut by graphemes. Enum.reduce_while/3 is at your service.
defmodule LessNaive do
def trim_string(string) do
string
|> String.graphemes()
|> Enum.reduce_while("", fn grapheme, acc ->
if counter(acc <> grapheme) <= 80,
do: {:cont, acc <> grapheme},
else: {:halt, acc}
end)
end
defp counter(string), do: string |> String.codepoints |> Enum.count
end
This will keep adding graphemes to the acc while codepoints count of the resulting string is less or equal 80. The only downside is that we have O(n²) here, eww. Let’s fix it:
defmodule LessNaive do
def trim_string(string) do
string
|> String.graphemes()
|> Enum.reduce_while({0, ""}, fn grapheme, {count, acc} ->
count = count + counter(grapheme)
if count <= 80,
do: {:cont, {count, acc <> grapheme}},
else: {:halt, acc}
end)
end
defp counter(string), do: string |> String.codepoints() |> Enum.count()
end
Are we done yet? Well, kind of yes. And kind of no.
Level Up: HTML maxlength and Other Animals
iex> trimmed = "Sunny! 𐍈𐍈 Yesterday my life was filled with rain, \
Oh Sunny 𐍈𐍈You smiled at me and really eased the pain" \
|> LessNaive.trim_string
"Sunny! 𐍈𐍈 Yesterday my life was filled with rain, Oh Sunny 𐍈𐍈You smiled at me an"
iex> trimmed |> String.length
80
So far so good. We may check out Ruby or Python, they will count it the same way. Our new fancy trimmer works perfectly with any regular database, even with MySQL (if we have utf8mb4 charset for the table).
But sometimes, on very rare occasions, it fails spectacularly.
% node
Welcome to Node.js v14.5.0.
Type ".help" for more information.
> "Sunny! 𐍈𐍈 Yesterday my life was filled with rain, \
... Oh Sunny 𐍈𐍈You smiled at me re".length
84
These occasions are:
Javascript string length checkers in general and Node.js backends in particular, HTML input areas with maxlength attribute, and last but not least Salesforce API text fields. Same thing, only nastier, will happen if we try to insert this into the Salesforce Name field: STRING_TOO_LONG:Name: data value too large. How in the world is it possible? A bit of a googling gives us the answer:
There’s no UTF-8 when you consider JavaScript internal string representation, there’s UTF-16. JavaScript String length property returns the number of UTF-16 code units in the string’s internal representation. In most cases, there’s one UTF-16 code unit per one UTF-8 code point. But for some symbols, there’s two. They are called “surrogate pairs”. For example, every single-codepoint emoji is two UTF-16 code units. This family emoji 👨👩👧👧 is one grapheme, 7 UTF-8 code points, and 11 UTF-16 code units.
So what shall we do? Luckily, Elixir has access to Erlang’s :unicode module, which has a character to binary conversion. The only thing left is to count the number of 2-byte sequences so that the counter would look like this:
def counter(string) do
string
|> :unicode.characters_to_binary(:utf8, :utf16)
|> byte_size()
|> div(2)
end
Let’s make our trimmer a bit more universal by passing both the counter function and the trim length to it.
defmodule StringTrimmer do
def trim_utf8(string, len), do: trim(string, len, &utf8_counter/1)
def trim_utf16(string, len), do: trim(string, len, &utf16_counter/1)
def trim(string, len, counter) do
string
|> String.graphemes()
|> Enum.reduce_while({0, ""}, fn grapheme, {count, acc} ->
count = count + counter.(grapheme)
if count <= len,
do: {:cont, {count, acc <> grapheme}},
else: {:halt, {count, acc}}
end)
|> elem(1)
end
defp utf8_counter(string), do: string |> String.codepoints() |> Enum.count()
defp utf16_counter(string) do
string
|> :unicode.characters_to_binary(:utf8, :utf16)
|> byte_size()
|> div(2)
end
end
Et voilà. Now we can trim like Postgres and trim like SalesForce while having our graphemes intact.
Will It Bite Us?
Oh, sure it will. I can think at least of two things making me feel uncomfortable.
Say we have a 10K chars string and do the trimming to 5 chars. Creating a 10K elements list with String.codepoints/1 seems a bit overkill. Say we trim a 10K chars string to 9999 chars. String concatenation in Elixir is copy-on-write, so we effectively create 9999 strings, doing 9999 memory allocations. It looks like our function is both slow and gives too much work for GC. Luckily, Elixir has a perfect answer to both problems. IO.Stream works great for char sourcing, and IO List is ideal for string building. If you’re unfamiliar with the latter, it’s Erlang’s secret weapon and one of the things making Phoenix so blazingly fast. Here is a great in-depth explanation with easy-to-grasp examples.
A new, faster, and cleaner version (look how the use of the improper list makes [acc | grapheme]
part more readable):
defmodule StringTrimmer do
def trim_utf8(string, len), do: trim(string, len, &utf8_counter/1)
def trim_utf16(string, len), do: trim(string, len, &utf16_counter/1)
def trim(string, len, counter) do
with {:ok, stringio} <- string |> StringIO.open(),
stream <- stringio |> IO.stream(1),
{_count, trimmed_iolist} <-
stream
|> Enum.reduce_while({0, []}, fn grapheme, {count, acc} ->
count = count + counter.(grapheme)
if count <= len,
do: {:cont, {count, [acc | grapheme]}},
else: {:halt, {count, acc}}
end) do
trimmed_iolist |> IO.iodata_to_binary
end
end
defp utf8_counter(string),
do: string |> String.codepoints() |> Enum.count()
defp utf16_counter(string) do
string
|> :unicode.characters_to_binary(:utf8, :utf16)
|> byte_size()
|> div(2)
end
end
Is it final? Well, not quite.
What a Nice Device
It passes all the tests, except this really stupid-looking one.
iex(1)> 1..1000000 |> Enum.each(fn _ -> StringTrimmer.trim_utf16("sdsdsds", 2) end)
13:20:32.197 [error] Too many processes
** (SystemLimitError) a system limit has been reached
:erlang.spawn_opt(:proc_lib, :init_p, [#PID<0.110.0>, [#PID<0.83.0>], :gen, :init_it, [:gen_server, #PID<0.110.0>, #PID<0.110.0>, StringIO, {"sdsdsds", []}, []]], [:link])
(stdlib 4.0.1) proc_lib.erl:191: :proc_lib.spawn_opt/4
(stdlib 4.0.1) proc_lib.erl:349: :proc_lib.start_link/5
iex:7: StringTrimmer.trim/3
(elixir 1.13.4) lib/enum.ex:942: anonymous fn/3 in Enum.each/2
(elixir 1.13.4) lib/enum.ex:4136: Enum.reduce_range/5
(elixir 1.13.4) lib/enum.ex:2400: Enum.each/2
13:20:32.211 [error] Too many processes
13:20:32.211 [error] Process #PID<0.83.0> raised an exception
** (SystemLimitError) a system limit has been reached
:erlang.spawn(:erlang, :apply, [#Function<0.74613015/0 in IEx.Server.loop/3>, []])
:erlang.spawn/1
(iex 1.13.4) lib/iex/server.ex:135: IEx.Server.loop/3
Oh, noes. Not something you’d like to run into in production (yes, I did). Of course, since an IO device is a (stateful) process, we need to close it after opening; either way, this process will keep running. I don’t like the idea of closing it inside the with macro, and luckily there is StringIO.open/2 for this. It accepts a function as a second argument and will close the device after the function returns a value or fails.
defmodule StringTrimmer do
def trim_utf8(string, len), do: trim(string, len, &utf8_counter/1)
def trim_utf16(string, len), do: trim(string, len, &utf16_counter/1)
def trim(string, len, counter) do
with {:ok, {_count, trimmed_iolist}} <-
string
|> StringIO.open(fn stringio ->
stringio
|> IO.stream(1)
|> Enum.reduce_while({0, []}, fn grapheme, {count, acc} ->
count = count + counter.(grapheme)
if count <= len,
do: {:cont, {count, [acc | grapheme]}},
else: {:halt, {count, acc}}
end)
end) do
trimmed_iolist |> IO.iodata_to_binary()
end
end
defp utf8_counter(string),
do: string |> String.codepoints() |> Enum.count()
defp utf16_counter(string) do
string
|> :unicode.characters_to_binary(:utf8, :utf16)
|> byte_size()
|> div(2)
end
end
iex(2)> 1..1000000 |> Enum.each(fn _ -> StringTrimmer.trim_utf16("sdsdsds", 2) end)
:ok
Phew.
Hopefully, not to be continued.
PS. As more senior colleagues pointed out, String.splitter
fed to Enum.reduce_while
would do the trick without spawning an additional process for StringIO. Cleaner and more efficient, but tinkering with IO was such fun that I just couldn’t help myself.