웹 스크래핑1

웹 스크래핑: 원하는 것만 쏙쏙 빼오기

웹 크롤링: 전부 다 쓸어오기

웹: html css javascript (스크래핑은 html이랑 관련있음)

xpath: html 경로 (/html/body/div/span/a... 이런 거 ->html id속성 값을 이용하면 더 간단함)

*requests 라이브러리*

res=requests.get("---사이트----")

res.raise_for_status() #문제 생기면 바로 종료 -> res.status_code 검사해서 응답코드 확인

print(res.text) #사이트 내용 출력

with open("mysite.html", "w", encoding="utf8") as f: f.write(res.text) #읽어온 값 파일로 저장

*정규식*

import re

p=re.compile("ca.e") #. : 하나의 문자 ^:문자열의 시작 $:문자열의 끝

m=p.match("care") #1 match: 문자열 처음부터 일치 여부 확인

if m:

print(m.group())

print(m.string)

print(m.start())

print(m.end())

print(m.span())

else:

print("매칭되지 않음")

m=p.search("good care") #2 search: 주어진 문자열 중에 일치하는게 있는지 확인

lst=p.findall("good care cafe") #3 findall: 일치하는 모든 것을 리스트 형태로 반환

1. p=re.compile("원하는 형태")

2. m=p.match("비교할 문자열") :주어진 문자열의 처음부터 일치하는지 확인

3. m=p.search("비교할 문자열") :주어진 문자열 중에 일치하는게 있는지 확인

4. lst=p.findall("비교할 문자열") :일치하는 모든 것들을 "리스트" 형태로 반환

+w3schools > learn python > RegEx

+python re

*user agent*

웹 request.get()하는 과정에서 발생하는 403에러 해결

user agent string-what is my user agent?

url = "---사이트 주소---"

headers = {"User-Agent": "---내 user agent 값---"}

res=requests.get(url, headers=headers)

to be V