在 Python 中运行一个大型函数时遇到 TypeError 异常。
- 该函数用于处理一个包含大量文本的大文件,将文件分割成发言人和演讲,然后进一步处理演讲中的各个段落。
- 在运行 driver 函数时出现错误:
Traceback (most recent call last): File "<pyshell#159>", line 1, in <module>\ndriver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv') File "<pyshell#158>", line 9, in driver\nspeaker = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing) File "C:\Python27\lib\re.py", line 177, in findall\n return _compile(pattern, flags).findall(string)TypeError: expected string or buffer
- 解决方案
- 分析错误信息,发现错误与函数 driver 中调用 re.findall() 函数有关。
- re.findall() 函数期望第二个参数是一个字符串或缓冲区,但在代码中,第二个参数是一个列表 hearing。
- 将 hearing 转换为字符串,即可解决问题。
- 由于代码中还有其他问题,因此对代码进行了重构和优化,使其更加清晰和易于理解。
- 最终,提供了两种可能的解决方案,供用户根据自己的需要选择。
代码例子:
# 解决方案 1
def driver(folder, input_filename, output_filename1, output_filename2):
os.chdir(folder)
with open(input_filename, 'r') as f:
Hearing = f.read()
hearing = BeautifulSoup(Hearing)
hearing = hearing.get_text()
hearing = hearing.split("RESPONSE TO WRITTEN")
hearing = str(hearing) # 将 hearing 转换为字符串
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
speakers = list(set(speakers))
# ...
# 代码的其余部分
# 解决方案 2
# 对代码进行了重构和优化,使其更加清晰和易于理解
def load_hearing_response(fname, split_on=' Present:'):
with open(fname, 'rU') as inf:
html = inf.read()
txt = BeautifulSoup(html).get_text()
return txt.rsplit(split_on, 1)[-1]
def un_hard_wrap(txt, reg=HARD_WRAP):
return reg.sub('', txt)
def get_speeches(txt):
speakers = [Speaker(NAME(sp), sp.start(), sp.end()) for sp in SPEAKERS.finditer(txt)]
speakers.append(Speaker('', len(txt), None)) # tail sentinel for pairwise processing
return [(this.name, txt[this.name_end:nxt.name_start]) for this,nxt in pairwise(speakers)]
def write_csv(fname, data, header=None):
with open(fname, 'wb') as outf:
out_csv = csv.writer(outf)
if header is not None:
out_csv.writerow(header)
out_csv.writerows(data)
def main():
# get text of Congressional hearing responses
DIR = r'C:\Users\Documents\Congressional Hearings\NHTF Project\Test Set'
txt = load_hearing_response(os.path.join(DIR, 'CHRG-107hhrg70750.htm'))
txt = un_hard_wrap(txt)
# break into speeches
speeches = get_speeches(txt)
# write (speaker, speech) pairs to a .csv file
write_csv(os.path.join(DIR, 'CHRG-107hhrg70750.csv'), speeches, ['Speaker', 'Speech'])
# write paragraphs of speeches to a .csv file
paragraphs = ([para.strip()] for speaker,speech in speeches for para in speech.split('\n') if para.strip())
write_csv(os.path.join(DIR, 'Paragraphs.csv'), paragraphs, ['Paragraphs'])
if __name__=="__main__":
main()
文章评论