summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorwm4 <wm4@nowhere>2013-08-15 21:14:00 +0200
committerwm4 <wm4@nowhere>2013-08-15 23:40:03 +0200
commit00f735d5cba22713ba9a377876b7cfd333c0b2b9 (patch)
tree02048f5ef40c05e8b4492aacd592f08bf4150369
parentacb51c9243c7861774af6ad592acc07490fa7e7c (diff)
downloadmpv-00f735d5cba22713ba9a377876b7cfd333c0b2b9.tar.bz2
mpv-00f735d5cba22713ba9a377876b7cfd333c0b2b9.tar.xz
bstr: make UTF-8 check stricter
Don't accept overlong sequences. Don't accept codepoints past the maximum unicode codepoint. Don't accept the UTF-16 surrogate codepoints. I'm not sure if there are more codepoints that are defined to be invalid, but we just want to make libavcodec happy, so this is enough. (libavcodec's subtitle converter checks for valid UTF-8 and throws up and dies if it's not - now we want to use bstr_sanitize_utf8_latin1() to force valid UTF-8, so the strictness of our UTF-8 parser has to match at least that of the libavcodec's check.) I'm not sure whether the min test is actually 100% correct. Note that libavcodec also treats BOM codepoints as invalid. This is definitely a bug: the BOM is really just "zero-width non-breaking space" redefined by Microsoft, but it is perfectly valid to appear in the middle of a string. Official Unicode has merely deprecated the old usage of the BOM codepoint, and didn't make it illegal. Besides, the string could be from the start of a file, so even this check doesn't make sense even with libavcodec's insane logic. We don't copy this bug.
-rw-r--r--mpvcore/bstr.c8
1 files changed, 8 insertions, 0 deletions
diff --git a/mpvcore/bstr.c b/mpvcore/bstr.c
index bbc3885b42..996edb7dfe 100644
--- a/mpvcore/bstr.c
+++ b/mpvcore/bstr.c
@@ -279,6 +279,14 @@ int bstr_decode_utf8(struct bstr s, struct bstr *out_next)
codepoint = (codepoint << 6) | (tmp & ~0xC0);
s.start++; s.len--;
}
+ if (codepoint > 0x10FFFF || (codepoint >= 0xD800 && codepoint <= 0xDFFF))
+ return -1;
+ // Overlong sequences - check taken from libavcodec.
+ // (The only reason we even bother with this is to make libavcodec's
+ // retarded subtitle utf-8 check happy.)
+ unsigned int min = bytes == 2 ? 0x80 : 1 << (5 * bytes - 4);
+ if (codepoint < min)
+ return -1;
}
if (out_next)
*out_next = s;