# Actual Data Analysis----Statistical Analysis of Beijing Rental Housing Data

2022-11-24 21:34:06

# 2.1 数据分析实战----Beijing housing data statistical analysis

## 学习目标

• 掌握 Pandas的读写操作
• Using pretreatment technology filter data.
• 会使用 MatplotlibLibrary to draw all kinds of chart.
• Will be based on the data of independent analysis.

In recent years, with the rapid development of economy,The resources of the first-tier cities and employment opportunity attracted a lot of floating population,Make it gradually become one of the densely populated cities.据统计,2017In the Beijing resident foreign population has reached the2170.7万人,The vast majority of whom is solving the problem of living in the form of rent.

This article will rent website in Beijing rent data as a reference,Using the learned knowledge of data analysis in front of the,Lead the people together for analysis of real data,And get the following statistical indicators in the form of chart：

• (1)Statistics of each area the total number of houses,And make the housing location distribution histogram analysis.
• (2)Use the bar chart analysis which the largest number of family、更受欢迎.
• (3)Sections of statistical average rent,Combined with the histogram and line chart quantity analysis of regional housing and rent.
• (4)Statistical area range of market share,And use the pie chart drawing the proportion of each interval.

## 1 数据基本介绍

On the network at present have a lot of rental platform,比如自如、爱屋吉屋、房天下、链家等,其中,HOME LINK is currently the highest market share of the company,Through the HOME LINK platform can provide reliable housing information is convenient and comprehensive.

Will climb to the data downloaded to the local,并保存在“链家北京租房数据.csv”文件中,Open the file after you can see there are a lot of article（This case crawl data in8224条)信息,具体如下图所示.

## 2 数据读取

import pandas as pd
import numpy as np



## 3 数据预处理

Although the HOME LINK website directly to climb down most of the data is a neat,But more or less will still has some problems,Cannot be directly used for data analysis.为此,Of the data they need before using a series of testing and processing,Including processing duplicate values and missing value、The unified data type, etc,To ensure the availability of data has higher.

### 3.1Duplicate values and null value processing

The first two steps of preparation is to check the missing value and duplicate values.If you want to check whether there is in the preparation of data repetitive data,则可以通过 Pandas中的 duplicated()方法完成.接下来,通过 duplicated()Methods to test the Beijing housing data,As long as there is duplicate data will be mapped toTrue,具体代码如下.

# 重复数据检测
file_data.duplicated()


Because of the relative amount of data is more,所以在 Jupyter NoteBookTool will omit some of the data shows,But can still be seen from the output of multiple return results asTrue的数据,This suggests that there are repeated data.这里,Treatment of duplicate data is to delete it.接下来,使用 drop_duplicates()Methods direct delete duplicate data,具体代码如下.

# 删除重复数据
file_data = file_data.drop_duplicates()


Compared with the last output lines,You can clearly see article reduces a lot of data,只剩下了5773条数据.

To duplicate detection data is completed,Can detect whether there is a missing value in data,我们可以直接使用 dropna()Methods to detect and remove the missing data,具体代码如下.

# 删除缺失数据
file_data = file_data.dropna()


After missing data detection,Can be found that the current data of the total number of rows compared with before without any change.So we conclude that prepared data does not exist in the missing data.

### 3.2 数据转换类型

In this set of data that rent a house,“面积( m 2 m^2 )”A list of data there are Chinese characters,This column data are of type string.In order to facilitate subsequent mathematical operation data of the area,所以需要将“面积(m)”A column of data type conversion forfloat类型,具体代码如下.

# 创建一个全是0的数组
data_new = np.zeros(file_data.shape[0])
# 取出“面积”一列数据,Remove the Chinese characters at the end of each data fild_data.info()

data_area = file_data["面积(㎡)"].values

for i,value in enumerate(data_area):
data_new[i] = np.array(value[:-2],dtype=np.float64)
# Replace with the new data
file_data.loc[:,'面积(㎡)']= data_new


# 获取“户型”一列数据
housetype_data = file_data['户型']
temp_list = []
# 通过replace()方法进行替换
for i in housetype_data:
new_info = i.replace('房间','室')
temp_list.append(new_info)
file_data.loc[:,'户型'] = temp_list


By comparing before and after processing the data can be found,索引为8219Door model data is by the“4房间2卫”变成“4室2卫”,That data to replace success.

## 4 图表分析

Data after pretreatment,You can use them to do analysis,In order to be more intuitive to see the data change,这里,We adopt the way of chart to assist in the analysis.

### 4.1房源数量、位置分布分析

# 创建一个DataFrame对象,The object only two columns of data：Area and quantity

new_df = pd.DataFrame({
'区域':file_data['区域'].unique(),'数量':[0]*13})


# 按“区域”列将file_data进行分组,And statistics the number of each group

groupy_area = file_data.groupby(by='区域').count()
new_df['数量'] = groupy_area.values


# 按“数量”A column from the order

new_df.sort_values(by=['数量'], ascending=False)


Can be seen through the output of the sorting result,Number of houses in the area, respectively, in front of the chaoyang district、海淀区、丰台区.

### 4.2 户型数量分析

# 定义函数,用于计算各户型的数量
def all_house(arr):
key = np.unique(arr)
result = {
}
for k in key:
v = arr_new.size
result[k] = v
return result

# Access door model the data
house_array = file_data['户型']
house_info = all_house(house_array)


Program output a dictionary,其中,The keys of the dictionary said the kinds of family,Value indicates the number of the family.

Use a dictionary derivated the family number greater than50的元素筛选出来,And the results after filtering into DataFrame对象,具体代码如下.

# 使用字典推导式
house_type = dict((key, value) for key, value
in house_info.items() if value > 50)
show_houses = pd.DataFrame({
'户型':[x for x in  house_type.keys()],
'数量':[x for x in house_type.values()]})


In order to be more intuitive to see the differences between family number,We can use a bar chart to display,其中,The bar chart vertical axis coordinate on behalf of the family type,The abscissa represents the number of body code is as follows

# 图形展示房屋类型

plt.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False
house_type = show_houses["户型"]
house_type_num = show_houses["数量"]

plt.bar(range(11), house_type_num)

plt.xticks(range(11), house_type)
#ylim:设置y轴范围
plt.ylim(0, 2500)

plt.title("北京市各区域租房数量统计")
plt.xlabel("房屋类型")
plt.ylabel("数量")

# 给每个条上面添加具体数字
#plt.text(x,y,string):设置说明文字(x：x轴位置;y：y轴位置;string：表示说明文字)
for x, y in enumerate(house_type_num):
# print(x, y)
plt.text(x-0.3,y+50, "%s" %y)

plt.show()


### 4.3 平均租金分析

# 新建一个DataFrame对象,Set the rent amount and total area of the initial value for0

df_all = pd.DataFrame({
'区域':file_data['区域'].unique(),
'房租总金额':[0]*13,
'总面积(㎡)':[0]*13})


# Total amount and total area

sum_price = file_data['价格(元/月)'].groupby(file_data['区域']).sum()
sum_area = file_data['面积(㎡)'].groupby(file_data['区域']).sum()
df_all['房租总金额'] = sum_price.values
df_all['总面积(㎡)'] = sum_area.values


Total amount of the rent is calculated for each area and total area,Can rent per square metre to calculate.在df_allOn the basis of object up a list of,The name of the column, as“The rent per square metre(元)”,Data to obtain the average price of every square metre of,具体代码如下.

# Calculated per square meter in each region of the rent price,并保留两位小数

df_all['每平米租金(元)'] = round(df_all['房租总金额'] / df_all ['总面积(㎡)'], 2)


In order to more fully understand the amount of rent in different regions and the average rent,Before we can create new_df对象(The regional housing number)与df_allObject to merge show,Because the two objects are included in the“区域”一列,So it can adopt the way of the primary key to merge,也就是说通过 merge()函数来实现,具体代码如下.

# 合并new_df与df_all

df_merge = pd.merge(new_df, df_all)


After merge data,Can use chart to display various areas housing information,其中,The number of homes said article can use the histogram of the column,The rent per square meter can use the line chart of the point that,具体代码如下.

# 图形可视化

num = df_merge["数量"]
price = df_merge["每平米租金(元)"]
x_label = df_merge["区域"]
x = [i for i in range(13)]

fig = plt.figure(figsize=(10, 8), dpi=100)

# 显示折线图
#'or-':oOn behalf of the circlemarker,r代表红色,-代表实线
ax1.plot(x, price, "or-", label="价格")
for i, (_x, _y) in enumerate(zip(x, price)):
plt.text(_x+0.2, _y, _y)
ax1.set_ylim([0, 160])
ax1.set_ylabel("价格")
plt.legend(loc="upper right")

# 显示条形图
#twinx():Produce a mirror coordinate
#alpha：透明度
ax2 = ax1.twinx()
plt.bar(x, num, label="数量", alpha=0.2, color="green")
ax2.set_ylabel("数量")
plt.legend(loc="upper left")
plt.xticks(x, x_label)

plt.show()


### 4.4 面积区间分析

Below we will be in the area of the building data according to certain rules into multiple range,Have a look at each area on the interval of the case,To facilitate analysis of what kind of house type is better in rental market rent,Which area of interval of phase room number

Data to be divided into several interval,则可以使用Pame中的cut()函数来实现,首先,使用max()与min()Methods respectively to calculate the building area of the maximum and the minimum,具体代码如下.

# 查看房屋的最大面积和最小面积
print('房屋最大面积是%d平米'%(file_data['面积(㎡)'].max()))
print('房屋最小面积是%d平米'%(file_data['面积(㎡)'].min()))

# 查看房租的最高值和最小值
print('房租最高价格为每月%d元'%(file_data['价格(元/月)'].max()))
print('房屋最低价格为每月%d元'%(file_data['价格(元/月)'].min()))


# 面积划分
area_divide = [1, 30, 50, 70, 90, 120, 140, 160, 1200]
area_cut = pd.cut(list(file_data['面积(㎡)']), area_divide)
area_cut_data = area_cut.describe()


area_percentage = (area_cut_data['freqs'].values)*100

labels  = ['30平米以下', '30-50平米', '50-70平米', '70-90平米',
'90-120平米','120-140平米','140-160平米','160平米以上']

plt.figure(figsize=(20, 8), dpi=100)
plt.axes(aspect=1)  # Shows a circular,如果不加,是椭圆形