スクレイピングの訓練 Python+Jupyterで

競馬のデータを取得する時に軽くスクレイピングをすることはあったのですが、もうちょっと色々な事が出来ないかと思い、スキルアップも兼ねてやってみることにしました。

こちらを参考

５年弱前と少し古いですが、良い腕試しの問題になりそうで頑張ってみました。

Pythonクローリング&スクレイピング練習問題

とりあえず１日だけ時間を取ったので、４問やってみました。今回はその１回目「初級編」です

１．[初級] QiitaアドベントカレンダーのURL一覧を取得する

スクレイピングの時、まず私が考えるのはXPathとBeautifulSoupで、どちらを使ったほうが簡単に出来るかな・・・ということですが、

基本はXPathを使うのですが、確か直下のタグに挟まれているテキストしか取得できないはずなので、取得する部分が多い時はBeautifulSoupも使ったりすることがあります。

今回はBeautifulSoupを使うことにしました。

実行はJupyter Labです。スクレイピングの時は先サイトに頻繁にアクセスは避けたいので、Jupyterを使ってます。

import requests

from bs4 import BeautifulSoup
from lxml import html
import pandas as pd

url = "http://qiita.com/advent-calendar/2016/crawler"
# _scraping = pd.read_html(url)
# _scraping
r = requests.get(url)
r

root = html.fromstring(r.content)

html_table = BeautifulSoup(r.text).find('table')
r.close()

author = html_table.find_all('div', 'adventCalendarCalendar_author')
title = html_table.find_all('div', 'adventCalendarCalendar_comment')
link = html_table.find_all(class_='adventCalendarCalendar_comment')

# print(author[0].text, title[0].text, link[0].a.get('href'))
# print(len(author), len(title), len(link))
authors = [_a.text.replace('\xa0', '') for _a in author]
titles = [_t.text for _t in title]
links = [_L.a.get('href') for _L in link]

for a, t, L in zip(authors, titles, links):
    print(f'{L} {t} [{a}]')

こんな感じで出力されて完成です。

http://amacbee.hatenablog.com/entry/2016/12/01/210436 scrapy-splashを使ってJavaScript利用ページを簡単スクレイピング  [amacbee]
http://qiita.com/Azunyan/items/9b3d16428d2bcc7c9406 Python Webスクレイピング 実践入門 [Azunyan1111]
http://blog.takuros.net/entry/2016/12/05/082533 非エンジニアでも何とか出来るクローラー／Webスクレイピング術  [takuros]
・・・

第２回に続きます。

今回は以上になります。またお会いしましょう

CyberMameCAN

鹿児島県の出水市という所に住んでいまして、インターネット周辺で色々活動して行きたいと思ってるところです。 Webサイト作ったり、サーバ設定したり、プログラムしたりしている、釣りと木工好きなMacユーザです。今はデータサイエンスに興味を持って競馬AI予想を頑張ってます。

sultan889 より:

2024年4月4日 03:15

It is appropriate time to make some plans for the future and it is time
to be happy. I’ve read this post and if I could I wish to suggest
you some interesting things or tips. Perhaps you could write
next articles referring to this article. I wish to read more things about it!

コメントを残す

このサイトはスパムを低減するために Akismet を使っています。コメントデータの処理方法の詳細はこちらをご覧ください。

こちらを参考

関連記事:

1件のコメント

コメントを残す