Read File - non-English language

Hello all,

I try to Read File a non-English based file and it turn out garbled.
Seems like I need to edit the parser?

Text File:
zh-TW.html (30.4 KB)

The html header says content encoding is ISO-8559-1, but that will give the garbled output. Using encoding of UTF-8 or UTF-16 gives garbled output.

So to figure out the correct encoding I created (15.5 KB)

The definition contains a small C# script compontent that has the following code:

private void RunScript(string fileName, int encoding, ref object text, ref object encodingUsed, ref object availableEncodings)
    if(System.IO.File.Exists(fileName)) {
      var encodings = System.Text.Encoding.GetEncodings();
      List<string> encodingNames = new List<string>();
      System.Text.Encoding enc = System.Text.Encoding.UTF8;
      foreach(var e in encodings)
      if(encoding >= 0 && encoding < encodings.Length)
        enc = encodings[encoding].GetEncoding();
      encodingUsed = enc.EncodingName;
      availableEncodings = encodingNames;
      text = System.IO.File.ReadAllText(fileName, enc);
      Print(String.Format("{0}", encodings.Length));

Note that the script will give the textual content as one item instead of split into lines. It is easy to do, but I leave that to you as a bit of practice (:

1 Like

If you don’t want to code


Thanks, I’ll look into it!

Thanks a lot, I can split that by myself.

Just another thing… notice that the html file contents png images, do you think it is possible to extract those through GH and save the image individually? I’ve done some research and found something like webpack …?

Regex is enough

Indeed you should be able to use Regex. Create a regex pattern something like <img.*?src="(.*)".*?>, use that to find all matches and extract the group. No doubt the regex can be improved.

Sorry for the confusion, I ment the image file itself from the html. (Sorry I know its totally different topic than just text)

I also notice the html file I upload didn’t embed the png… might stored at my local end…

The images are indeed not embedded in the HTML file, they are merely referenced. You can get the names of the files from the HTML using that regex approach. Then you can construct yourself the paths to those files when you have them locally. Or construct the URL when you need to download them with code.

1 Like