summaryrefslogtreecommitdiffstats
path: root/misc/charset_conv.c
Commit message (Collapse)AuthorAgeFilesLines
* osx: consistent normalisation when searching for external filesAkemi2017-02-021-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | several unicode characters can be encoded in two different ways, either in a precomposed (NFC) or decomposed (NFD) representation. everywhere besides on macOS, specifically HFS+, precomposed strings are being used. furthermore on macOS we can get either precomposed or decomposed strings, for example when not HFS+ formatted volumes are used. that can be the case for network mounted devices (SMB, NFS) or optical/removable devices (UDF). this can lead to an inequality of actual equal strings, which can happen when comparing strings from different sources, like the command line or filesystem. this makes it mainly a problem on macOS systems. one case that can potential break is the sub-auto option. to prevent that we convert the search string as well as the string we search in to the same normalised representation, specifically we use the decomposed form which is used anywhere else. this could potentially be a problem on other platforms too, though the potential of occurring is very minor. for those platforms we don't convert anything and just fallback to the input. Fixes #4016
* charset_conv: fallback to interpreting subs as latin1 if iconv failswm42017-01-221-1/+1
| | | | | | | | | | | | | | | For display purposes, it's better to show scrambled text - at least that's a more actionable failure mode than spamming the terminal with FFmpeg nonsense error messages. This avoids the obnoxious and pointless "Invalid UTF-8 in decoded subtitles text; maybe missing -sub_charenc option" FFmpeg error, which will be spammed on every single subtitle event. We don't even have a -sub-charenc option, fuck FFmpeg. Did I mention fuck FFmpeg yet? Because fuck FFmpeg.
* charset_conv: support minimum compatibility to utf8:... syntaxwm42017-01-221-1/+5
| | | | Because it's the most commonly used one, and trivial to support.
* options: drop deprecated --sub-codepage syntaxwm42017-01-191-70/+5
|
* charset_conv: fix "auto" fallback with uchardet not compiledwm42016-12-281-1/+3
| | | | | | | | | Tried to open iconv with "auto" as source codepage, instead of using the latin1 fallback. This also neutralizes the libavcodec dumbass UTF-8 check, which discards subtitles not in UTF-8 and shows an error message for ffmpeg CLI instead. Fixes #3954.
* charset_conv: simplify and change --sub-codepage optionwm42016-12-091-44/+49
| | | | | | | As documented in interface-changes.rst. This makes it much easier to follow what the heck is going on. Whether this is adequate for real-world use is unknown.
* charset_conv: drop enca and libguess supportwm42016-12-091-67/+0
| | | | | | | | Enca is dead. libguess is relatively useless due to not having an universal detection mode. On the other hand, libuchardet is actively developed. Manpages changes in the following commit.
* charset_conv: Use CP949 instead of EUC-KRJeong Woon Choi2016-09-021-0/+5
| | | | | | | | | | iconv distinguishes between euc-kr and cp949, while libguess and libuchardet doesn't (only returns euc-kr). EILSEQ occurs when the input encoding of iconv is set to euc-kr and if the subs contain letters not included in euc-kr. Since cp949 is a extension of euc-kr, choose cp949 instead. Signed-off-by: wm4 <wm4@nowhere>
* Relicense some non-MPlayer source files to LGPL 2.1 or laterwm42016-01-191-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This covers source files which were added in mplayer2 and mpv times only, and where all code is covered by LGPL relicensing agreements. There are probably more files to which this applies, but I'm being conservative here. A file named ao_sdl.c exists in MPlayer too, but the mpv one is a complete rewrite, and was added some time after the original ao_sdl.c was removed. The same applies to vo_sdl.c, for which the SDL2 API is radically different in addition (MPlayer supports SDL 1.2 only). common.c contains only code written by me. But common.h is a strange case: although it originally was named mp_common.h and exists in MPlayer too, by now it contains only definitions written by uau and me. The exceptions are the CONTROL_ defines - thus not changing the license of common.h yet. codec_tags.c contained once large tables generated from MPlayer's codecs.conf, but all of these tables were removed. From demux_playlist.c I'm removing a code fragment from someone who was not asked; this probably could be done later (see commit 15dccc37). misc.c is a bit complicated to reason about (it was split off mplayer.c and thus contains random functions out of this file), but actually all functions have been added post-MPlayer. Except get_relative_time(), which was written by uau, but looks similar to 3 different versions of something similar in each of the Unix/win32/OSX timer source files. I'm not sure what that means in regards to copyright, so I've just moved it into another still-GPL source file for now. screenshot.c once had some minor parts of MPlayer's vf_screenshot.c, but they're all gone.
* charset_conv: check for UTF-8 if uchardet returns unknownwm42015-12-201-0/+2
| | | | | | | | When libuchardet returns an empty string, it can be either ASCII, UTF-8, or an unknown encoding. Try to distinguish it from the unknown case by checking for UTF-8. This avoids an annoying message, and avoids unnecessary processing (we convert invalid UTF-8 sequences to latin1 to workaround libavcodec's braindead UTF-8 check).
* sub: detect charset in demuxerwm42015-12-171-0/+7
| | | | | | | | | | | | Slightly simpler, and removes the need to pre-read all subtitle packets. This still does the subtitle charset conversion on the packet level (instead converting when parsing the file), so in theory this still could provide a way to change the charset at runtime. But maybe even this should be removed, as FFmpeg is somewhat likely to get its own charset detection and conversion mechanism in the future. (Would have to keep the subtitle file in memory to allow changing the charset on the fly, I guess.)
* demux_libass: remove this demuxerwm42015-11-111-18/+0
| | | | | | | | | This loaded external .ass files via libass. libavformat's .ass reader is now good enough, so use that instead. Apparently libavformat still doesn't support fonts embedded into text .ass files, but support for this has been accidentally broken in mpv for a while anyway. (And only 1 person complained.)
* sub: fix --sub-codepage UTF-8 with fallbackwm42015-09-011-0/+4
| | | | | | | | Fixes e.g --sub-codepage=utf8:gb18030 if the subtitle us UTF-8. This was broken in commit e5d31808. Also log the detected charset in verbose mode.
* charset_conv: use our own UTF-8 check with ENCA onlywm42015-08-041-6/+5
| | | | | | | | | | Some charsets can look like valid UTF-8, but aren't UTF-8. One example is ISO-2022-JP. While ENCA apparently likes to get misdetect real UTF-8, this is not the case with uchardet. uchardet can detect ISO-2022-JP correctly, but didn't even get to try, because our own UTF-8 check succeeded. So run the UTF-8 check when using ENCA only. Fixes #2195.
* charset_conv: "auto" encoding detection now uses uchardet.Jehan2015-08-041-1/+3
| | | | | If mpv is not built with uchardet, "enca" is still the fallback default encoding detection.
* charset_conv: fix switched parameterswm42015-08-021-1/+1
| | | | Fixes #2186.
* charset_conv: add uchardet supportwm42015-08-021-0/+39
| | | | | | | | | | | | | | For now, it needs to be explicitly selected. ENCA is still the default. This assumes uchardet returns iconv names. This doesn't seem to be always the case, and the result are lots of iconv errors. So explicitly check for this situation, and print a warning if it occurs. It's entirely possible that uchardet support is actually useless, because names are not necessarily iconv-compatible (but uchardet doesn't seem to document whether it attempts to return iconv-compatible names if possible). Fixes #908.
* charset_conv: make it possible to return an allocated string as guesswm42015-08-011-4/+8
| | | | | | | | | uchardet is written in C++, and thus doesn't appreciate the value of using static strings, and internally stores the guessed charset as allocated std::string. Add a minimal hack to deal with this. (I don't appreciate that the code is potentially harder to understand by returning either a static or allocated string, but I do appreciate for not having to litter the existing code with strdups.)
* sub: add detection via BOMwm42014-07-221-4/+30
| | | | | | | | | | | Useful for Windows stuff. Actually, ENCA support should catch this, but, well, whatever, everyone seems to hate ENCA. Detection with BOM is trivial, although it needs some hackery to integrate it with the existing autodetection support. For one, change the default value of --sub-codepage to make this easier. Probably fixes issue #937 (the second part).
* build: include <strings.h> for strcasecmp()wm42014-07-101-0/+1
| | | | | | | It happens to work without strings.h on glibc or with _GNU_SOURCE, but the POSIX standard requires including <strings.h>. Hopefully fixes OSX build.
* charset_conv: mp_msg conversionswm42013-12-211-20/+21
|
* Split mpvcore/ into common/, misc/, bstr/wm42013-12-171-0/+287