Sign Up

Have an account? Sign In Now

Sign In

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

Sorry, you do not have a permission to ask a question, You must login to ask question.

Forgot Password?

Need An Account, Sign Up Here
Sign InSign Up

ErrorCorner

ErrorCorner Logo ErrorCorner Logo

ErrorCorner Navigation

  • Home
  • Contact Us
  • About Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Contact Us
  • About Us
Home/ Questions/Q 443
Next
Answered
Kenil Vasani
Kenil Vasani

Kenil Vasani

  • 646 Questions
  • 567 Answers
  • 77 Best Answers
  • 26 Points
View Profile
  • 1
Kenil Vasani
Asked: December 11, 20202020-12-11T20:37:06+00:00 2020-12-11T20:37:06+00:00In: Python

How to fix ”UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 29815: character maps to ”?

  • 1

At the moment, I am trying to get a Python 3 program to do some manipulations with a text file filled with information, through the Spyder IDE/GUI. However, when trying to read the file I get the following error:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

The code of the program is as follows:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)
decodefile-iopythonsqliteunicode
  • 1 1 Answer
  • 8 Views
  • 0 Followers
  • 0
Answer
Share
  • Facebook

    1 Answer

    • Voted
    1. Rohit Patel

      Rohit Patel

      • 0 Questions
      • 98 Answers
      • 0 Best Answers
      • 0 Points
      View Profile
      Best Answer
      Rohit Patel
      2020-12-11T20:35:26+00:00Added an answer on December 11, 2020 at 8:35 pm

      As you see from https://en.wikipedia.org/wiki/Windows-1252, the code 0x9D is not defined in CP1252.

      The “error” is e.g. in your open function: you do not specify the encoding, so python (just in windows) will use some system encoding. In general, if you read a file that maybe was not create in the same machine, it is really better to specify the encoding.

      I recommend to put also a coding also on your open for writing the csv. It is really better to be explicit.

      I do no know the original file format, but adding to open , encoding='utf-8' is usually a good thing (and it is the default in Linux and MacOs).

      • 4
      • Share
        Share
        • Share on Facebook
        • Share on Twitter
        • Share on LinkedIn
        • Share on WhatsApp

    You must login to add an answer.

    Forgot Password?

    Sidebar

    Ask A Question
    • Popular
    • Kenil Vasani

      SyntaxError: invalid syntax to repo init in the AOSP code

      • 5 Answers
    • Kenil Vasani

      xlrd.biffh.XLRDError: Excel xlsx file; not supported

      • 3 Answers
    • Kenil Vasani

      Homebrew fails on MacOS Big Sur

      • 3 Answers
    • Kenil Vasani

      runtimeError: package fails to pass a sanity check for numpy ...

      • 3 Answers
    • Kenil Vasani

      Could not find tools.jar. Please check that /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home contains ...

      • 2 Answers

    Explore

    • Most Answered
    • Most Visited
    • Most Voted
    • Random

    © 2020-2021 ErrorCorner. All Rights Reserved
    by ErrorCorner.com