[BUG] Lister: Automatic detection of OEM charset

icfu · Post by *icfu » 2006-07-12, 15:59 UTC

Well, to make it short:
It simply doesn't work!

No matter which NFO/ASCII art files I am displaying in lister, lister always starts in ANSI mode.

What exactly is that function trying to do on autodetection?

Icfu

XPEHOPE3KA · Post by *XPEHOPE3KA » 2006-07-12, 16:30 UTC

I confirm it & I confirm that it never worked at my place.

Sombra · Post by *Sombra » 2006-07-12, 16:35 UTC

Yes, it is confirmed for me and for many people also (i think)

http://ghisler.ch/board/viewtopic.php?p=87726&highlight=#87726
http://ghisler.ch/board/viewtopic.php?t=4033#30659

No matter which NFO/ASCII..

...and diz and bla, bla

frenky · Post by *frenky » 2006-07-12, 19:15 UTC

Id does not answer your question, but you could try with http://www.totalcmd.net/plugring/nfoview.html

Be aware, there is one BUG, if you select text and drag it to the TC, it throws and exception.

Post by *ghisler(Author) » 2006-07-13, 13:48 UTC

I'm sorry but real ANSI/OEM detection in all languages is simply not possible/reliable. Lister tries to do what it can, but it is far from optimum.

icfu · Post by *icfu » 2006-07-13, 14:15 UTC

Lister tries to do what it can, but it is far from optimum.

I repeat my question:
What exactly is that function trying to do on autodetection?

"Far from optimum" sounds rather euphemistic regarding that I have never found a single file in the past years that would have been properly detected.

You could autodetect REPETITION of characters for example, #248 (dec) is the black square which is very specific for NFO files, or you could make it possible to define specific file types for which lister is started with OEM font, like NFO.

Icfu

XPEHOPE3KA · Post by *XPEHOPE3KA » 2006-07-13, 18:45 UTC

2ghisler(Author)
Is your algorythm language-dependant?

Can't it be that the algorythm became a deadcode in your code one happy day?

Clo · Post by *Clo » 2006-07-14, 06:36 UTC

2icfu

Hi Jeff !

• Is it necessary that I add my confirmation? Indeed, that never worked …

You could autodetect REPETITION of characters for example, #248 (dec)

- I thought to something in that painting…
- However, maybe for several specific characters, not only the #248 (which is normally used a lot in Danish with ANSI : ø) and like XPEHOPE3KA is wondering about, according to the current language stated in the <wincmd.ini>…
…that is not absolutely perfect :
- In example, this "ø" doesn't exist in French and many other languages,
but sometimes it can be used as an abbreviation for "diameter" ( I do, and with #216 too…).

…Far from optimum" sounds rather euphemistic…

Right ! In French, this is amusingly so-called «A sweet euphemism»

VG
Claude
Clo

Post by *ghisler(Author) » 2006-07-14, 09:32 UTC

So do you know of any good detection routines?

Clo · Post by *Clo » 2006-07-14, 10:20 UTC

2ghisler(Author)

Good afternoon,

So do you know of any good detection routines?

Personally, "0" or almost at programming !

• But just an idea : Alextp-friend has to work for the same problem, like he said on his To-Do list,
about his new alternative Lister viewer tool…
Maybe could you exchange ideas ?

VG
Claude
Clo

icfu · Post by *icfu » 2006-07-14, 12:04 UTC

@Clo and interested developers:

- However, maybe for several specific characters, not only the #248 (which is normally used a lot in Danish with ANSI : ø) and like XPEHOPE3KA is wondering about, according to the current language stated in the <wincmd.ini>…
…that is not absolutely perfect :]

It IS perfect when, like I have proposed, you are using REPETITION of characters. Are there any Danish words containing "øøø" for example?

I think that only with that specific character appearance you can catch about 99% of all NFO files correctly. This is about a 10000000000% increase over the present situation, roughly estimated of course.

Icfu

XPEHOPE3KA · Post by *XPEHOPE3KA » 2006-07-14, 13:14 UTC

1. Heh, it (the current algorythm) seldom works.
2. But only for files WITHOUT pseudographics characters.
3. And only for files which begin from a sequence of codes > 128. However, much twolingual readmes have English beginning, but end in Russian (may be in OEM).

So, Christian, you need to:
1. Analyze both the beginning & the end of files - for accuracy of prediction.
2. Leave almost unchanged the current algorythm, but insert there some code about repetition of pseudographics characters. This prediction should be done only for "horizontal" characters. Such characters as (this is ONE group) ─,┌,┐,└,┘,├,┤,┬,┴,┼,╓,╖,╙,╜,╥,╨,╫,╟,╢ should be nearby to each other in a string. So these: ═,╒,╔,╕,╗,╘,╚,╛,╝,╞,╠,╡,╣,╤,╦,╧,╩,╬,╪ are the second group. The other pseudographics characters should be followed by themselves like ▒▒▒.

Clo · Post by *Clo » 2006-07-14, 16:43 UTC

2icfu

Are there any Danish words containing "øøø" for example?

I don't know, but the writer can be a stammerer …

- Seriously : Certainly, there is a way around this to improve that feature.
Anyway, it was a nail which needed to be hammered again !

VG
Claude
Clo

AlleyKat · Post by *AlleyKat » 2006-07-15, 00:31 UTC

I admittedly often do write 'øøøh' and 'æææh' along with 'hmm' on my own forum...

nah just kidding, well almost anyway...

I don't see the big issue having to press s (or whatever the right shortcut is), but honestly feel that utf-8 problems in Compare are worse? I dunno...

Alextp · Post by *Alextp » 2006-07-22, 23:06 UTC

My suggestion of auto-detection.
Works good with all NFO files.
This may be used in the new version of ATViewer.

Code: Select all

{
The idea of detection is the following:
NFO files contain pseudo-graphic that is chars between $B0 and $DF.
And really only 5-12 chars from this interval are used,
so the char with frequency of at last 15-20% is exist.
Normal ANSI texts will not be detected, because maximal frequency of text
letters is about 10-12% (and most letters are outside of interval $B0-$DF).
}

type
  TOemFreqTable = array[$80..$FF] of integer;

function IsTableOem(const Table: TOemFreqTable): boolean;
var
  i: integer;
begin
  Result:= false;
  for i:= $B0 to $DF do
    if Table[i]>=20 then
      begin Result:= true; Break end;
end;

function CalcTable(const fn: string; var Table: TOemFreqTable): boolean;
var
  h: THandle;
  Buffer: string;
  FSize, BytesRead: DWORD;
  i, n: integer;
  TableSize: integer;
begin
  Result:= false;
  TableSize:= 0;
  FillChar(Table, SizeOf(Table), 0);

  h:= FFileOpen(fn);
  if h=INVALID_HANDLE_VALUE then Exit;

  try
    FSize:= FFileSize(fn);
    SetLength(Buffer, FSize);
    if not ReadFile(h, Buffer[1], FSize, BytesRead, nil) then Exit;

    for i:= 1 to BytesRead do
      begin
      n:= Ord(Buffer[i]);
      if (n>=Low(Table)) and (n<=High(Table)) then
        begin
        Inc(TableSize);
        Inc(Table[n]);
        end;
      end;
  finally
    CloseHandle(h);
  end;

  for i:= Low(Table) to High(Table) do
    begin
    if TableSize=0
      then Table[i]:= 0
      else Table[i]:= Table[i]*100 div TableSize;
    end;

  Result:= true;
end;

procedure WriteTable(const Table: TOemFreqTable);
var
  i: integer;
begin
  for i:= Low(Table) to High(Table) do
    if Table[i]<>0 then
      Writeln('"', Chr(i), '": ', Table[i]);
end;

Total Commander

[BUG] Lister: Automatic detection of OEM charset

[BUG] Lister: Automatic detection of OEM charset

Indeed---

Exchange ideas ?

If a stammerer…