Wrong character encoding of JSON response


#1

Hi,

I query a website which responds with UTF-8 encoded JSON, but when I use JSON.parse($('body pre').text()) the encoding is messed up as (apparently) Windows-1252, and actually the page screenshot also displays it as Windows-1252 (and not iso-8859-1 as I first thought, “é” becoming “é” in both, but “’” (U+2019) becoming “’” only in the former).

How can I fix that ? Is there a way to force the reading or parsing as UTF-8 ?

Edit : I tried the unescape(encodeURIComponent(s)) / decodeURIComponent(escape(s)) fix without success.


#2

Finally got it to work using decodeURIComponent(escape(mystring)). It fails on many fields of the JSON response with

Error invoking user-provided ‘pageFunction’: Error: URIError: URI error

The solution was to run it not on the whole response but only on the field I’m interested in. Although there are still a few errors, the rate is quite limited and I guess I may use a few replace() to eradicate such errors.


#3

Nah, it seems to work only for iso-8859-1 / UTF-8 conversion, and here for some reason we have the Windows-1252 charset (part of which overlaps iso-8859-1, hence the improvements).

Anyway, my browser is able to detect the right charset, so Apify should as well, shouldn’t it ? Or maybe there’s a way to force the encoding of the response of the scraped URL ? Content-type is only set to application/json, no charset defined.


#4

Ok, I ended up with the following :

decodeURIComponent(escape(mystring).replace('%u20AC','%80').replace('%u201A','%82').replace('%u0192','%83').replace('%u201E','%84').replace('%u2026','%85').replace('%u2020','%86').replace('%u2021','%87').replace('%u02C6','%88').replace('%u2030','%89').replace('%u0160','%8A').replace('%u2039','%8B').replace('%u0152','%8C').replace('%u017D','%8E').replace('%u2018','%91').replace('%u2019','%92').replace('%u201C','%93').replace('%u201D','%94').replace('%u2022','%95').replace('%u2013','%96').replace('%u2014','%97').replace('%u02DC','%98').replace('%u2122','%99').replace('%u0161','%9A').replace('%u203A','%9B').replace('%u0153','%9C').replace('%u017E','%9E').replace('%u0178','%9F'))  

Not the most elegant solution, but it works.

(Basically, there is a confusion between the 8- and 9- rows of the windows-1252 characters and UTF8 ones, so escape(mystring) converts the wrongly windows-1252 string into UTF8 hexadecimal percent-encoding, but that’s not what we want so we use multiple replace() to replace the UTF8 %-encoded characters into their windows-1252 version, and finally decodeURIComponent() reencodes the whole string into correct UTF8.)