Read File - non-English language

ctu6 · November 3, 2021, 6:49am

Hello all,

I try to Read File a non-English based file and it turn out garbled.
Seems like I need to edit the parser?

Text File:
zh-TW.html (30.4 KB)

nathanletwory · November 3, 2021, 9:16am

The html header says content encoding is ISO-8559-1, but that will give the garbled output. Using encoding of UTF-8 or UTF-16 gives garbled output.

So to figure out the correct encoding I created

read_file_with_specific_encoding.gh (15.5 KB)

The definition contains a small C# script compontent that has the following code:

private void RunScript(string fileName, int encoding, ref object text, ref object encodingUsed, ref object availableEncodings)
  {
    if(System.IO.File.Exists(fileName)) {
      var encodings = System.Text.Encoding.GetEncodings();
      List<string> encodingNames = new List<string>();
      System.Text.Encoding enc = System.Text.Encoding.UTF8;
      foreach(var e in encodings)
      {
        encodingNames.Add(e.GetEncoding().EncodingName);
      }
      if(encoding >= 0 && encoding < encodings.Length)
      {
        enc = encodings[encoding].GetEncoding();
      }
      encodingUsed = enc.EncodingName;
      availableEncodings = encodingNames;
      text = System.IO.File.ReadAllText(fileName, enc);
      Print(String.Format("{0}", encodings.Length));
    }
  }

Note that the script will give the textual content as one item instead of split into lines. It is easy to do, but I leave that to you as a bit of practice (:

gankeyu · November 3, 2021, 9:18am

If you don’t want to code

ctu6 · November 3, 2021, 10:33am

Thanks, I’ll look into it!

ctu6 · November 3, 2021, 10:41am

Thanks a lot, I can split that by myself.

Just another thing… notice that the html file contents png images, do you think it is possible to extract those through GH and save the image individually? I’ve done some research and found something like webpack …?

gankeyu · November 3, 2021, 10:45am

Regex is enough

nathanletwory · November 3, 2021, 10:53am

Indeed you should be able to use Regex. Create a regex pattern something like <img.*?src="(.*)".*?>, use that to find all matches and extract the group. No doubt the regex can be improved.

ctu6 · November 3, 2021, 11:07am

Sorry for the confusion, I ment the image file itself from the html. (Sorry I know its totally different topic than just text)

I also notice the html file I upload didn’t embed the png… might stored at my local end…

nathanletwory · November 3, 2021, 11:34am

The images are indeed not embedded in the HTML file, they are merely referenced. You can get the names of the files from the HTML using that regex approach. Then you can construct yourself the paths to those files when you have them locally. Or construct the URL when you need to download them with code.

Topic		Replies	Views
How Import Text from File with "culture" Settings? Grasshopper windows	2	1012	June 19, 2021
GH - Read File Parser - "No valid parser" error [SOLVED] Grasshopper Developer rhino5	2	1179	July 13, 2017
How to read GH_file in c#? Grasshopper windows , csharp	2	621	July 23, 2021
GH_Panel.UserText Grasshopper Developer csharp	4	1229	October 30, 2023
Grasshopper Python default encoding - how to set UTF-8? Scripting	1	1539	January 8, 2014

Read File - non-English language

Related topics