0%

CSV与EXCEL的文件处理

处理CSV

  • 示例文件的如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ head  TFWP_2020Q1_Positive_EN.csv
"Employers Who Were Issued a Positive Labour Market Impact Assessment (LMIA) by Program Stream, National Occupational Classification (NOC) 2011 and Business Location, January to March 2020",,,,,
Province/Territory,Program Stream,Employer ,Address,Occupation,Approved Positions
Newfoundland and Labrador, High Wage,Anglo Eastern Ship Managment Ltd,"Wanchai, A0A0A0","2273-Deck officers, water transport",4
Newfoundland and Labrador, High Wage,Anglo Eastern Ship Managment Ltd,"Wanchai, A0A0A0","2274-Engineer officers, water transport",4
Newfoundland and Labrador, High Wage,Anglo Eastern Ship Managment Ltd,"Wanchai, A0A0A0",7242-Industrial electricians,1
Newfoundland and Labrador, High Wage,Anglo Eastern Ship Managment Ltd,"Wanchai, A0A0A0",7532-Water transport deck and engine room crew,9
Newfoundland and Labrador, High Wage,Anglo Eastern Ship Managment Ltd,"Wanchai, A0A0A0",7612-Other trades helpers and labourers,1
Newfoundland and Labrador, High Wage,Bailey Veterinary Surgical Specialty Ltd.,"St. John's, A1N3J7",3114-Veterinarians,1
Newfoundland and Labrador, High Wage,Eastern Regional Health Authority,"Mount Pearl, A1N3J5",3111-Specialist physicians,1
Newfoundland and Labrador, High Wage,WesTower Communications Ltd.,"St. John's, A1A5G6",7245-Telecommunications line and cable workers,2

[....]

  • 如上面所示,第一行应该算是一个标题,第二行是CSV文件的列字段,以逗号间隔.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [34]: import pandas as pd
# 这里跳过了第一行的读取,也可以使用如 skiprows=1,nrows=8814,从第二开始读取8814行.
In [35]: a = pd.read_csv("/home/michael/Documents/TFWP_2020Q1_Positive_EN.csv",encoding="latin",skiprows=1)
In [37]: a.head()
Out[37]:
Province/Territory Program Stream Employer Address Occupation Approved Positions
0 Newfoundland and Labrador High Wage Anglo Eastern Ship Managment Ltd Wanchai, A0A0A0 2273-Deck officers, water transport 4.0
1 Newfoundland and Labrador High Wage Anglo Eastern Ship Managment Ltd Wanchai, A0A0A0 2274-Engineer officers, water transport 4.0
2 Newfoundland and Labrador High Wage Anglo Eastern Ship Managment Ltd Wanchai, A0A0A0 7242-Industrial electricians 1.0
3 Newfoundland and Labrador High Wage Anglo Eastern Ship Managment Ltd Wanchai, A0A0A0 7532-Water transport deck and engine room crew 9.0
4 Newfoundland and Labrador High Wage Anglo Eastern Ship Managment Ltd Wanchai, A0A0A0 7612-Other trades helpers and labourers 1.0

# 如:直接输出成HTML或EXCEL.
In [38]: a.to_html("name.html")

In [39]: a.to_excel("name.xlsx")

  • 过滤一些特定字段,如下面,查找出Occupation字段中,以2175-Web开头的所有行记录.
1
2
3
4
5
6
7
8
9
10
11
12
a[a.Occupation.str.startswith('2175-Web')]
Out[85]:
Province/Territory Program Stream Employer Address Occupation Approved Positions
53 Nova Scotia High Wage 10094277 Canada Inc Halifax, B3J2T9 2175-Web designers and developers 2
124 Nova Scotia Global Talent Stream 3rDi Laboratory Inc. Wolfville, B4P3R6 2175-Web designers and developers 2
207 Quebec High Wage 213A Studio Créatif Inc. Montréal, H2S3X3 2175-Web designers and developers 1
249 Quebec High Wage 9122-4790 Québec Inc. Laval, H7M5Y6 2175-Web designers and developers 1
511 Quebec High Wage Géoplus Inc. Laval, H7L5B7 2175-Web designers and developers 1
609 Quebec High Wage les produits de fenetres sol-r (2000) inc. montreal, H4N1H8 2175-Web designers and developers 1
668 Quebec High Wage Ossiaco Inc Montreal, H3C2G9 2175-Web designers and developers 1


错误处理

1
2
3
4
In [2]: pd.read_csv("/home/michael/Documents/TFWP_2020Q1_Positive_EN.csv")
[.....]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 1: invalid continuation byte

  • 尝试使用encoding="latin"读取,如:pd.read_csv("/home/michael/Documents/TFWP_2020Q1_Positive_EN.csv",encoding="latin")

谢谢支持

  • 微信二维码: