SSA subtitle encoding issue

For questions and discussion that is NOT (I repeat NOT) specific to a certain Operating System.
setht
New Cone
New Cone
Posts: 4
Joined: 10 Aug 2020 07:53

SSA subtitle encoding issue

Postby setht » 10 Aug 2020 09:30

In a mkv file the subtitles lines including music notes (♪) are displaying as (♪)

I have tried setting "default encoding" language under settings to "default (windows-1252)" and all utf variations.

Using mkvextract --fullraw I see "C3 A2 E2 84 A2 C2 AA" which is being interpreted as utf and turned into ♪
But when I use

Code: Select all

mkvextract <file> tracks -c windows-1252 2:subs.srt
and load subs.srt it works correctly (but not if I use -c utf8)

EDIT: Why does VLC's reading of this line not match mkvextract / iconv (see 2nd post)?

---

Debug info

Code: Select all

vlc --version VLC media player 3.0.8 Vetinari (revision 3.0.8-0-gf350b6b5a7)

Code: Select all

mkvextract <file> tracks -c windows-1252 2:subs.srt [Script Info] ; Script generated by FFmpeg/Lavc58.55.100 ScriptType: v4.00+ PlayResX: 384 PlayResY: 288 [V4+ Styles] Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding Style: Default,Arial,16,&Hffffff,&Hffffff,&H0,&H0,0,0,0,0,100,100,0,0,1,1,0,2,10,10,10,0 [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:01.24,0:00:03.91,Default,,0,0,0,,[bright tone] ... Dialogue: 0,0:03:48.91,0:03:50.50,Default,,0,0,0,,♪ ♪
The encoding from Styles is 0 which according to http://moodub.free.fr/video/ass-specs.doc means
"Encoding. This specifies the font character set or encoding and on multi-lingual Windows installations it provides access to characters used in multiple than one languages. It is usually 0 (zero) for English (Western, ANSI) Windows."


If I use

Code: Select all

mkvextract <file> tracks -c windows-1252 2:subs.srt
gedit shows the correct character (and vlc displays them correctly)
"Dialogue: 0,0:03:48.91,0:03:50.50,Default,,0,0,0,,♪ ♪"


If I use

Code: Select all

mkvextract <file> tracks --fullraw 2:test.txt
ghex shows
"113,0,Default,,0,0,0,,......."
and bytes (hand transcribed so ~5% error rate)
"31 31 33 2C 30 2C 44 65 66 61 75 6C 74 2C 2C 30 2C 30 2C 30 2C 2C C3 A2 E2 84 A2 C2 AA 20 C3 A2 E2 84 A2 C2 AA"
the text portion is "C3 A2 E2 84 A2 C2 AA" which interpreted as utf[1] turns into ♪



[1]https://utf8-chartable.de/unicode-utf8- ... 28&names=-
https://utf8-chartable.de/unicode-utf8- ... 28&names=-

Code: Select all

mediainfo <file>` Text ID : 3 Format : ASS Codec ID : S_TEXT/ASS Codec ID/Info : Advanced Sub Station Alpha Compression mode : Lossless Writing library : Lavc58.55.100 ssa Language : English Default : Yes Forced : No Statistics Tags Issue : mkvmerge v24.0.0 ('Beyond The Pale') 64-bit 2019-08-17 01:03:24 / Lavf58.30.100 FromStats_BitRate : 103 FromStats_Duration : 00:27:52.958000000 FromStats_FrameCount : 796 FromStats_StreamSize : 21706
Last edited by setht on 10 Aug 2020 10:23, edited 3 times in total.

setht
New Cone
New Cone
Posts: 4
Joined: 10 Aug 2020 07:53

Re: SSA subtitle encoding issue

Postby setht » 10 Aug 2020 10:20

I followed this a little deeper by building mkvextract from source and adding some debug statements.

Code: Select all

$ echo "113,0,Default,,0,0,0,,♪ ♪" > test.txt $ hexdump -n 180 test.txt # Verify this matches "31 31 33 2C 30 2C 44 65 66 61 75 6C 74 2C 2C 30 2C 30 2C 30 2C 2C C3 A2 E2 84 A2 C2 AA 20 C3 A2 E2 84 A2 C2 AA" 00000000 31 31 33 2c 30 2c 44 65 66 61 75 6c 74 2c 2c 30 |113,0,Default,,0| 00000010 2c 30 2c 30 2c 2c c3 a2 e2 84 a2 c2 aa 20 c3 a2 |,0,0,,....... ..| 00000020 e2 84 a2 c2 aa 0a |......| $ cat test.txt | iconv -t utf-8 113,0,Default,,0,0,0,,♪ ♪ $ cat test.txt | iconv -t windows-1252 113,0,Default,,0,0,0,,♪ ♪
So it looks like this only works when iconv (what mkvextract uses under the covers) is pointed at windows-1252.

setht
New Cone
New Cone
Posts: 4
Joined: 10 Aug 2020 07:53

Re: SSA subtitle encoding issue

Postby setht » 10 Aug 2020 10:27

Searching some more I found this comment (from 2010) saying you can't change the encoding on SSA/ASS subtitles?

https://forum.videolan.org/viewtopic.ph ... 91#p249717

Rémi Denis-Courmont
Developer
Developer
Posts: 15312
Joined: 07 Jun 2004 16:01
VLC version: master
Operating System: Linux
Contact:

Re: SSA subtitle encoding issue

Postby Rémi Denis-Courmont » 10 Aug 2020 18:43

The ♪ character does not event exists in Windows-1252, so for sure that's not the correct encoding. That sequence C3 A3 ... AA is UTF-8 encoding for ♪, or Windows-1252 for ♪. In any case, I think this is handled by libass, not VLC.
Rémi Denis-Courmont
https://www.remlab.net/
Private messages soliciting support will be systematically discarded

setht
New Cone
New Cone
Posts: 4
Joined: 10 Aug 2020 07:53

Re: SSA subtitle encoding issue

Postby setht » 10 Aug 2020 20:06

Thanks, could you confirm that VLC is not passing encoding to libass?
I spent sometime looking at this file ([url=https://code.videolan.org/videolan/vlc/ ... c/libass.c]modules/codec/libass.c[/url)] and I don't see any encoding being passed to libass here, but I obviously don't know all the other places it could happen.

Rémi Denis-Courmont
Developer
Developer
Posts: 15312
Joined: 07 Jun 2004 16:01
VLC version: master
Operating System: Linux
Contact:

Re: SSA subtitle encoding issue

Postby Rémi Denis-Courmont » 10 Aug 2020 22:18

As far as I know, ASS text is always UTF-8 encoded. There's nothing for VLC to give or libass to take in this respect. And I don't see how that would fix the problem.

The author botched the encoding by opening the UTF-8 text as if it were Windows-1252 and then converting the result to UTF-8.

The correct byte sequence for the music note sign is E2 99 AA.
Rémi Denis-Courmont
https://www.remlab.net/
Private messages soliciting support will be systematically discarded


Return to “General VLC media player Troubleshooting”

Who is online

Users browsing this forum: No registered users and 47 guests