当前位置:网站首页>Actual Data Analysis----Statistical Analysis of Beijing Rental Housing Data

Actual Data Analysis----Statistical Analysis of Beijing Rental Housing Data

2022-11-24 21:34:06Her han rain Chen

2.1 数据分析实战----Beijing housing data statistical analysis

学习目标

  • 掌握 Pandas的读写操作
  • Using pretreatment technology filter data.
  • 会使用 MatplotlibLibrary to draw all kinds of chart.
  • Will be based on the data of independent analysis.

In recent years, with the rapid development of economy,The resources of the first-tier cities and employment opportunity attracted a lot of floating population,Make it gradually become one of the densely populated cities.据统计,2017In the Beijing resident foreign population has reached the2170.7万人,The vast majority of whom is solving the problem of living in the form of rent.

This article will rent website in Beijing rent data as a reference,Using the learned knowledge of data analysis in front of the,Lead the people together for analysis of real data,And get the following statistical indicators in the form of chart:

  • (1)Statistics of each area the total number of houses,And make the housing location distribution histogram analysis.
  • (2)Use the bar chart analysis which the largest number of family、更受欢迎.
  • (3)Sections of statistical average rent,Combined with the histogram and line chart quantity analysis of regional housing and rent.
  • (4)Statistical area range of market share,And use the pie chart drawing the proportion of each interval.

1 数据基本介绍

On the network at present have a lot of rental platform,比如自如、爱屋吉屋、房天下、链家等,其中,HOME LINK is currently the highest market share of the company,Through the HOME LINK platform can provide reliable housing information is convenient and comprehensive.

通过网络爬虫技术,Crawl HOME LINK sites listed in the rental information(Crawl over time as2018年9月10日),Including area、小区名称、房屋、价格、房屋面积、户型.需要说明的是,HOME LINK website does not provide pinggu、怀柔、密云、Yanqing in remote areas such as data that rent a house,So this case analysis will not involved in these four areas.

Will climb to the data downloaded to the local,并保存在“链家北京租房数据.csv”文件中,Open the file after you can see there are a lot of article(This case crawl data in8224条)信息,具体如下图所示.

租房.csv

2 数据读取

准备好数据后,我们便可以使用 Pandas读取保存在CSV文件的数据,并将其转换成DataFrame对象展示,To facilitate processing these data.

首先,读取数据:

import pandas as pd
import numpy as np

# Read HOME LINK Beijing rent information
file_data = pd.read_csv('./data/1.csv')
file_data.head()

读取效果如下:

image-20200114164037896

3 数据预处理

Although the HOME LINK website directly to climb down most of the data is a neat,But more or less will still has some problems,Cannot be directly used for data analysis.为此,Of the data they need before using a series of testing and processing,Including processing duplicate values and missing value、The unified data type, etc,To ensure the availability of data has higher.

3.1Duplicate values and null value processing

The first two steps of preparation is to check the missing value and duplicate values.If you want to check whether there is in the preparation of data repetitive data,则可以通过 Pandas中的 duplicated()方法完成.接下来,通过 duplicated()Methods to test the Beijing housing data,As long as there is duplicate data will be mapped toTrue,具体代码如下.

# 重复数据检测
file_data.duplicated()

Because of the relative amount of data is more,所以在 Jupyter NoteBookTool will omit some of the data shows,But can still be seen from the output of multiple return results asTrue的数据,This suggests that there are repeated data.这里,Treatment of duplicate data is to delete it.接下来,使用 drop_duplicates()Methods direct delete duplicate data,具体代码如下.

# 删除重复数据
file_data = file_data.drop_duplicates()

Compared with the last output lines,You can clearly see article reduces a lot of data,只剩下了5773条数据.

To duplicate detection data is completed,Can detect whether there is a missing value in data,我们可以直接使用 dropna()Methods to detect and remove the missing data,具体代码如下.

# 删除缺失数据
file_data = file_data.dropna()

After missing data detection,Can be found that the current data of the total number of rows compared with before without any change.So we conclude that prepared data does not exist in the missing data.

3.2 数据转换类型

In this set of data that rent a house,“面积( m 2 m^2 m2)”A list of data there are Chinese characters,This column data are of type string.In order to facilitate subsequent mathematical operation data of the area,所以需要将“面积(m)”A column of data type conversion forfloat类型,具体代码如下.

# 创建一个全是0的数组
data_new = np.zeros(file_data.shape[0])
# 取出“面积”一列数据,Remove the Chinese characters at the end of each data fild_data.info()

data_area = file_data["面积(㎡)"].values

for i,value in enumerate(data_area):
    data_new[i] = np.array(value[:-2],dtype=np.float64)
# Replace with the new data
file_data.loc[:,'面积(㎡)']= data_new

数据类型转换1

除此之外,在“户型”一列中,大部分数据显示的是“室*厅”,只有个别数据显示的是"*房间*卫”(比如索引8219对应的一行).为了方便后期的使用,需要将“房间"替换成"室",以保证数据的一致性.

接下来,使用 Pandas的 replace()Complete method is used to replace the operation of the data,具体代码如下.

# 获取“户型”一列数据
housetype_data = file_data['户型']
temp_list = []
# 通过replace()方法进行替换
for i in housetype_data:
    new_info = i.replace('房间','室')
    temp_list.append(new_info)
file_data.loc[:,'户型'] = temp_list

By comparing before and after processing the data can be found,索引为8219Door model data is by the“4房间2卫”变成“4室2卫”,That data to replace success.

4 图表分析

Data after pretreatment,You can use them to do analysis,In order to be more intuitive to see the data change,这里,We adopt the way of chart to assist in the analysis.

4.1房源数量、位置分布分析

如果希望统计各个区域的房源数量,以及查看这些房屋的分布情况,则需要先获取各个区的房源.为了实现这个需求,可以将整个数据按照“区域”一列进行分组.

为了能够准确地看到各区域的房源数量,这里只需要展示“区域”与“数量”这两列的数据即可.因此,先创建一个空的 DataFrame对象,然后再将各个区域计算的总数量作为该对象的数据进行展示,具体代码如下.

# 创建一个DataFrame对象,The object only two columns of data:Area and quantity

new_df = pd.DataFrame({
    '区域':file_data['区域'].unique(),'数量':[0]*13})

Regional housing

接下来,通过 Pandas的 groupby()方法将 file data对象按照“区域”一列进行分组,并利用count()Methods statistic the number of each group,具体代码如下.

# 按“区域”列将file_data进行分组,And statistics the number of each group

groupy_area = file_data.groupby(by='区域').count()
new_df['数量'] = groupy_area.values

Regional group assignment

通过 sort_values()方法对new_df对象排序,In the order from big to small for ranking,具体代码如下.

# 按“数量”A column from the order

new_df.sort_values(by=['数量'], ascending=False)

Area number ordering

Can be seen through the output of the sorting result,Number of houses in the area, respectively, in front of the chaoyang district、海淀区、丰台区.

4.2 户型数量分析

随着人们生活水平的提高,以及各住户的生活需求,开发商设计出了各种各样的户型供人们居住.接下来,我们来分析一下户型,统计租房市场中哪种户型的房源数量偏多,并筛选出数量大于50的户型.

首先,We define a function to calculate the amount of various family,具体代码如下.

# 定义函数,用于计算各户型的数量
def all_house(arr):
    key = np.unique(arr)
    result = {
    }
    for k in key:
        mask = (arr == k)
        arr_new = arr[mask]
        v = arr_new.size
        result[k] = v
    return result

# Access door model the data
house_array = file_data['户型']
house_info = all_house(house_array)

户型数量

Program output a dictionary,其中,The keys of the dictionary said the kinds of family,Value indicates the number of the family.

Use a dictionary derivated the family number greater than50的元素筛选出来,And the results after filtering into DataFrame对象,具体代码如下.

# 使用字典推导式
house_type = dict((key, value) for key, value 
in house_info.items() if value > 50)
show_houses = pd.DataFrame({
    '户型':[x for x in  house_type.keys()],
                  '数量':[x for x in house_type.values()]})

Screen door

In order to be more intuitive to see the differences between family number,We can use a bar chart to display,其中,The bar chart vertical axis coordinate on behalf of the family type,The abscissa represents the number of body code is as follows

# 图形展示房屋类型

plt.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False   
house_type = show_houses["户型"]
house_type_num = show_houses["数量"]

plt.bar(range(11), house_type_num)

plt.xticks(range(11), house_type)
#ylim:设置y轴范围
plt.ylim(0, 2500)

plt.title("北京市各区域租房数量统计")
plt.xlabel("房屋类型")
plt.ylabel("数量")


# 给每个条上面添加具体数字
#plt.text(x,y,string):设置说明文字(x:x轴位置;y:y轴位置;string:表示说明文字)
for x, y in enumerate(house_type_num):
    # print(x, y)
    plt.text(x-0.3,y+50, "%s" %y)

plt.show()

4.3 平均租金分析

为了进一步剖析房屋的情况,接下来,Let us analysis the regional average rent situation now.Calculate the average rent in each region of the price and method of calculating the regional model number the same,首先创建一个 DataFrame对象,具体代码如下.

# 新建一个DataFrame对象,Set the rent amount and total area of the initial value for0

df_all = pd.DataFrame({
    '区域':file_data['区域'].unique(),
                         '房租总金额':[0]*13,
                         '总面积(㎡)':[0]*13})

平均租金

接下来,按照“区域”一列进行分组,然后调用sum()Methods respectively for the sum calculation rent amount and building area,具体代码如下:

# Total amount and total area

sum_price = file_data['价格(元/月)'].groupby(file_data['区域']).sum()
sum_area = file_data['面积(㎡)'].groupby(file_data['区域']).sum()
df_all['房租总金额'] = sum_price.values
df_all['总面积(㎡)'] = sum_area.values

The average rent group

Total amount of the rent is calculated for each area and total area,Can rent per square metre to calculate.在df_allOn the basis of object up a list of,The name of the column, as“The rent per square metre(元)”,Data to obtain the average price of every square metre of,具体代码如下.

# Calculated per square meter in each region of the rent price,并保留两位小数

df_all['每平米租金(元)'] = round(df_all['房租总金额'] / df_all ['总面积(㎡)'], 2)

The rent per square metre

In order to more fully understand the amount of rent in different regions and the average rent,Before we can create new_df对象(The regional housing number)与df_allObject to merge show,Because the two objects are included in the“区域”一列,So it can adopt the way of the primary key to merge,也就是说通过 merge()函数来实现,具体代码如下.

# 合并new_df与df_all

df_merge = pd.merge(new_df, df_all)

Comprehensive rental

After merge data,Can use chart to display various areas housing information,其中,The number of homes said article can use the histogram of the column,The rent per square meter can use the line chart of the point that,具体代码如下.

# 图形可视化

num = df_merge["数量"]
price = df_merge["每平米租金(元)"]
x_label = df_merge["区域"]
x = [i for i in range(13)]

fig = plt.figure(figsize=(10, 8), dpi=100)

# 显示折线图
ax1 = fig.add_subplot(111)
#'or-':oOn behalf of the circlemarker,r代表红色,-代表实线
ax1.plot(x, price, "or-", label="价格")
for i, (_x, _y) in enumerate(zip(x, price)):
    plt.text(_x+0.2, _y, _y)
ax1.set_ylim([0, 160])   
ax1.set_ylabel("价格")
plt.legend(loc="upper right")

# 显示条形图
#twinx():Produce a mirror coordinate
#alpha:透明度
ax2 = ax1.twinx()
plt.bar(x, num, label="数量", alpha=0.2, color="green")
ax2.set_ylabel("数量")
plt.legend(loc="upper left")
plt.xticks(x, x_label)


plt.show()

4.4 面积区间分析

Below we will be in the area of the building data according to certain rules into multiple range,Have a look at each area on the interval of the case,To facilitate analysis of what kind of house type is better in rental market rent,Which area of interval of phase room number

Data to be divided into several interval,则可以使用Pame中的cut()函数来实现,首先,使用max()与min()Methods respectively to calculate the building area of the maximum and the minimum,具体代码如下.

# 查看房屋的最大面积和最小面积
print('房屋最大面积是%d平米'%(file_data['面积(㎡)'].max()))
print('房屋最小面积是%d平米'%(file_data['面积(㎡)'].min()))

# 查看房租的最高值和最小值
print('房租最高价格为每月%d元'%(file_data['价格(元/月)'].max()))
print('房屋最低价格为每月%d元'%(file_data['价格(元/月)'].min()))

在这里,We refer to HOME LINK website interval to define the area of,The building area is divided into8个区间.然后使用describe()Methods according to the number of occurrences of each phase( counts表示)以及频率(freps表示),具体代码如下.

# 面积划分
area_divide = [1, 30, 50, 70, 90, 120, 140, 160, 1200]
area_cut = pd.cut(list(file_data['面积(㎡)']), area_divide)
area_cut_data = area_cut.describe()

Frequency and frequency

接着,Use pie chart to show the distribution of the area range,具体代码如下.

area_percentage = (area_cut_data['freqs'].values)*100

labels  = ['30平米以下', '30-50平米', '50-70平米', '70-90平米',
'90-120平米','120-140平米','140-160平米','160平米以上']

plt.figure(figsize=(20, 8), dpi=100)
plt.axes(aspect=1)  # Shows a circular,如果不加,是椭圆形
plt.pie(x=area_percentage, labels=labels, autopct='%.2f %%', shadow=True)
plt.legend(loc='upper right')
plt.show()

运行结果如图所示:

饼状图

通过上图可以看出,50-70Square meters of housing in the rental market share the biggest.总体看来,The tenant mainly120Square meters of houses for rent object,其中50~70Square meters of housing for the tenant preferred object.

原网站

版权声明
本文为[Her han rain Chen]所创,转载请带上原文链接,感谢
https://chowdera.com/2022/328/202211242127009868.html

随机推荐