이런 경우 파싱이나 크롤링이 가능할까요?
본문
꼭 필요한 기능을 만들고 싶어 크롤링이나 파싱을 통해 해결하려고 합니다.
이용할 URL은
https://pcmap.place.naver.com/restaurant/list?level=top&entry=pll&query=부산맛집&rank=요즘뜨는
입니다.
네이버의 api값을 이용하면 업체의 리스트를 받을 수 있는데 이 값이 sort되지 않고 default가 정해져 나오네요.
위 URL에서 rank=*** 값만 변경하면 "많이찾는", "요즘뜨는", "저장많은", "리뷰많은" 의 쿼리 값으로 리스트 나오는데
이걸 파싱이나 크롤링을 하려고 하니 자꾸 빈 값만 나오고 있습니다.
해본 방법으로는
1. snoopy를 이용해보았는데 사용 방법이 틀려서 그런건지 값이 구해지질 않았습니다.
2. 단순하게 불러보니 글이 깨져서 나오는데 인코딩 문제는 아닌 것 같아 해결하지 못했습니다.
아래는 마지막까지 해보았던 소스입니다.
쿼리 값을 "부산맛집", "리뷰많은"으로 고정해서 진행해보았습니다.
고수님들의 조언 부탁드립니다.
require_once(G5_THEME_PATH.'/Snoopy.class.php');
$snoopy = new Snoopy;
$sch_keyword = urlencode("부산맛집");
$sch_keyword2 = urlencode("리뷰많은");
$snoopy->agent = $_SERVER['HTTP_USER_AGENT'];
$snoopy->referer = 'https://pcmap.place.naver.com/restaurant/list?level=top&entry=pll&query='.$sch_keyword.'&rank='.$sch_keyword2.'&order=off';
$snoopy->fetch('https://pcmap.place.naver.com/restaurant/list?level=top&entry=pll&query='.$sch_keyword.'&rank='.$sch_keyword2.'&order=off');
preg_match('/<!doctype html>(.*?)<\/html>/is', $snoopy->results, $html);
echo $html[0];
!-->
답변 5
안녕하세요?
말씀하신 페이지는 소위 '동적 웹페이지'입니다.
https://pcmap-api.place.naver.com/graphql
여기에 JSON으로 POST 전송을 하시면 됩니다.
급하게 작성하느라 깔끔하지는 않지만 파이썬으로 작성해봤습니다 :)
from requests_html import HTMLSession
import json
with HTMLSession() as s:
html = s.get('https://pcmap.place.naver.com/restaurant/list?level=top&entry=pll&query=%EB%B6%80%EC%82%B0%EB%A7%9B%EC%A7%91&rank=%EC%9A%94%EC%A6%98%EB%9C%A8%EB%8A%94').content
data = [
{
"operationName": "getRestaurants",
"variables": {
"input": {
"query": "부산맛집",
"rank": "요즘뜨는",
"x": 원하는좌표를입력하세요,
"y": 원하는좌표를입력하세요,
"display": 50,
"start": 1,
"isNmap": False,
"deviceType": "pcmap"
},
"isNmap": False,
"isBounds": True
},
"query": "query getRestaurants($input: RestaurantsInput, $isNmap: Boolean!, $isBounds: Boolean!) {\n restaurants(input: $input) {\n total\n items {\n ...RestaurantItemFields\n easyOrder {\n easyOrderId\n easyOrderCid\n businessHours {\n weekday {\n start\n end\n __typename\n }\n weekend {\n start\n end\n __typename\n }\n __typename\n }\n __typename\n }\n baemin {\n businessHours {\n deliveryTime {\n start\n end\n __typename\n }\n closeDate {\n start\n end\n __typename\n }\n temporaryCloseDate {\n start\n end\n __typename\n }\n __typename\n }\n __typename\n }\n yogiyo {\n businessHours {\n actualDeliveryTime {\n start\n end\n __typename\n }\n bizHours {\n start\n end\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n }\n nlu {\n ...NluFields\n __typename\n }\n brand {\n name\n isBrand\n type\n menus {\n order\n id\n images {\n url\n desc\n __typename\n }\n name\n desc\n price\n isRepresentative\n detailUrl\n orderType\n catalogId\n source\n menuId\n nutrients\n allergies\n __typename\n }\n __typename\n }\n optionsForMap @include(if: $isBounds) {\n maxZoom\n minZoom\n includeMyLocation\n maxIncludePoiCount\n center\n spotId\n __typename\n }\n queryString\n siteSort\n __typename\n }\n}\n\nfragment RestaurantItemFields on RestaurantSummary {\n id\n dbType\n name\n businessCategory\n category\n description\n hasBooking\n hasNPay\n x\n y\n distance\n imageUrl\n imageUrls\n imageCount\n phone\n virtualPhone\n routeUrl\n streetPanorama {\n id\n pan\n tilt\n lat\n lon\n __typename\n }\n roadAddress\n address\n commonAddress\n blogCafeReviewCount\n bookingReviewCount\n totalReviewCount\n bookingReviewScore\n bookingUrl\n bookingHubUrl\n bookingHubButtonName\n bookingBusinessId\n talktalkUrl\n options\n promotionTitle\n agencyId\n businessHours\n microReview\n tags\n priceCategory\n broadcastInfo {\n program\n date\n menu\n __typename\n }\n michelinGuide {\n year\n star\n comment\n url\n hasGrade\n isBib\n alternateText\n __typename\n }\n broadcasts {\n program\n menu\n episode\n broadcast_date\n __typename\n }\n tvcastId\n naverBookingCategory\n saveCount\n uniqueBroadcasts\n isDelivery\n isCvsDelivery\n markerLabel @include(if: $isNmap) {\n text\n style\n __typename\n }\n imageMarker @include(if: $isNmap) {\n marker\n markerSelected\n __typename\n }\n isTableOrder\n isPreOrder\n isTakeOut\n bookingDisplayName\n bookingVisitId\n bookingPickupId\n popularMenuImages {\n name\n price\n bookingCount\n menuUrl\n menuListUrl\n imageUrl\n isPopular\n usePanoramaImage\n __typename\n }\n visitorReviewCount\n visitorReviewScore\n detailCid {\n c0\n c1\n c2\n c3\n __typename\n }\n streetPanorama {\n id\n pan\n tilt\n lat\n lon\n __typename\n }\n newOpening\n __typename\n}\n\nfragment NluFields on Nlu {\n queryType\n user {\n gender\n __typename\n }\n queryResult {\n ptn0\n ptn1\n region\n spot\n tradeName\n service\n selectedRegion {\n name\n index\n x\n y\n __typename\n }\n selectedRegionIndex\n otherRegions {\n name\n index\n __typename\n }\n property\n keyword\n queryType\n nluQuery\n businessType\n cid\n branch\n franchise\n titleKeyword\n location {\n x\n y\n default\n longitude\n latitude\n dong\n si\n __typename\n }\n noRegionQuery\n priority\n showLocationBarFlag\n themeId\n filterBooking\n repRegion\n repSpot\n dbQuery {\n isDefault\n name\n type\n getType\n useFilter\n hasComponents\n __typename\n }\n type\n category\n menu\n context\n __typename\n }\n __typename\n}\n"
},
{
"operationName": "getRestaurantSubFilters",
"variables": {
"input": {
"query": "부산맛집",
"x": 원하는좌표를입력하세요,
"y": 원하는좌표를입력하세요,
"rank": "요즘뜨는",
"isPcmap": True,
"isNmap": False
}
},
"query": "query getRestaurantSubFilters($input: RestaurantFiltersInput) {\n restaurantFilters(input: $input) {\n hideFilterPopup\n isDeliveryFilter\n sub {\n index\n name\n value\n multiSelectable\n items {\n index\n name\n value\n selected\n representative\n clickCode\n laimCode\n __typename\n }\n __typename\n }\n __typename\n }\n}\n"
},
{
"operationName": "getAdBusinessList",
"variables": {
"input": {
"query": "부산맛집",
"start": 1,
"localQueryString": "pr=place_pcmap&version=1.1.3§ion=site§ion=query&in_enc=utf-8&site_start=1&site_display=50&force_use_center_coord=1&query_rank=0&query=%EB%B6%80%EC%82%B0%EB%A7%9B%EC%A7%91&site_cid=220036&site_sort=7&ip=221.148.27.30&boost_partner=1",
"deviceType": "pcmap",
"siteSort": "7",
"businessType": "restaurant",
"x": 원하는좌표를입력하세요,
"y": 원하는좌표를입력하세요,
"isDefaultLocation": True
},
"isNmap": False
},
"query": "query getAdBusinessList($input: AdBusinessesInput, $isNmap: Boolean!) {\n adBusinesses(input: $input) {\n total\n isExpandedType\n ... on RestaurantAdsResult {\n items {\n ...RestaurantAdItemFields\n __typename\n }\n __typename\n }\n ... on HospitalAdsResult {\n items {\n ...HospitalAdItemFields\n __typename\n }\n __typename\n }\n ... on PlaceAdsResult {\n items {\n ...PlaceAdItemFields\n __typename\n }\n __typename\n }\n ... on AttractionAdsResult {\n items {\n ...AttractionAdItemFields\n __typename\n }\n __typename\n }\n __typename\n }\n}\n\nfragment RestaurantAdItemFields on RestaurantAdSummary {\n adId\n adClickLog {\n clickUrl\n smartOrderClickUrl\n trackingParameters {\n n_ad_group_type\n n_query\n __typename\n }\n __typename\n }\n impressionEventUrl\n adDescription\n id\n dbType\n name\n businessCategory\n category\n description\n hasBooking\n hasNPay\n x\n y\n distance\n imageUrl\n imageCount\n phone\n virtualPhone\n routeUrl\n streetPanorama {\n id\n pan\n tilt\n lat\n lon\n __typename\n }\n roadAddress\n address\n commonAddress\n blogCafeReviewCount\n bookingReviewCount\n totalReviewCount\n bookingUrl\n bookingBusinessId\n talktalkUrl\n detailCid {\n c0\n c1\n c2\n c3\n __typename\n }\n options\n promotionTitle\n agencyId\n businessHours\n markerLabel @include(if: $isNmap) {\n text\n style\n __typename\n }\n imageMarker @include(if: $isNmap) {\n marker\n markerSelected\n __typename\n }\n imageUrls\n bookingReviewScore\n bookingHubUrl\n bookingHubButtonName\n microReview\n tags\n priceCategory\n broadcastInfo {\n program\n date\n menu\n __typename\n }\n michelinGuide {\n year\n star\n comment\n url\n hasGrade\n isBib\n alternateText\n __typename\n }\n broadcasts {\n program\n menu\n episode\n broadcast_date\n __typename\n }\n tvcastId\n naverBookingCategory\n saveCount\n uniqueBroadcasts\n isDelivery\n isCvsDelivery\n isTableOrder\n isPreOrder\n isTakeOut\n bookingDisplayName\n bookingVisitId\n bookingPickupId\n popularMenuImages {\n name\n price\n bookingCount\n menuUrl\n menuListUrl\n imageUrl\n isPopular\n usePanoramaImage\n __typename\n }\n visitorReviewCount\n visitorReviewScore\n newOpening\n __typename\n}\n\nfragment HospitalAdItemFields on HospitalAdSummary {\n adId\n adClickLog {\n clickUrl\n smartOrderClickUrl\n trackingParameters {\n n_ad_group_type\n n_query\n __typename\n }\n __typename\n }\n impressionEventUrl\n adDescription\n id\n dbType\n name\n businessCategory\n category\n description\n hasBooking\n hasNPay\n x\n y\n distance\n imageUrl\n imageCount\n phone\n virtualPhone\n routeUrl\n streetPanorama {\n id\n pan\n tilt\n lat\n lon\n __typename\n }\n roadAddress\n address\n commonAddress\n blogCafeReviewCount\n bookingReviewCount\n totalReviewCount\n bookingUrl\n bookingBusinessId\n talktalkUrl\n detailCid {\n c0\n c1\n c2\n c3\n __typename\n }\n options\n promotionTitle\n agencyId\n businessHours\n markerLabel @include(if: $isNmap) {\n text\n style\n __typename\n }\n imageMarker @include(if: $isNmap) {\n marker\n markerSelected\n __typename\n }\n medicalNo\n visitorReviewCount\n visitorReviewScore\n talktalkUrl\n fullAddress\n __typename\n}\n\nfragment PlaceAdItemFields on PlaceAdSummary {\n adId\n adClickLog {\n clickUrl\n smartOrderClickUrl\n trackingParameters {\n n_ad_group_type\n n_query\n __typename\n }\n __typename\n }\n impressionEventUrl\n adDescription\n id\n dbType\n name\n businessCategory\n category\n description\n hasBooking\n hasNPay\n x\n y\n distance\n imageUrl\n imageCount\n phone\n virtualPhone\n routeUrl\n streetPanorama {\n id\n pan\n tilt\n lat\n lon\n __typename\n }\n roadAddress\n address\n commonAddress\n blogCafeReviewCount\n bookingReviewCount\n totalReviewCount\n bookingUrl\n bookingBusinessId\n talktalkUrl\n detailCid {\n c0\n c1\n c2\n c3\n __typename\n }\n options\n promotionTitle\n agencyId\n businessHours\n markerLabel @include(if: $isNmap) {\n text\n style\n __typename\n }\n imageMarker @include(if: $isNmap) {\n marker\n markerSelected\n __typename\n }\n medicalNo\n normalizedName\n categoryCodeList\n daysOff\n poiInfo {\n polyline {\n shapeKey {\n id\n name\n version\n __typename\n }\n boundary {\n minX\n minY\n maxX\n maxY\n __typename\n }\n details {\n totalDistance\n arrivalAddress\n departureAddress\n __typename\n }\n __typename\n }\n polygon {\n shapeKey {\n id\n name\n version\n __typename\n }\n boundary {\n minX\n minY\n maxX\n maxY\n __typename\n }\n __typename\n }\n __typename\n }\n subwayId\n oilPrice @include(if: $isNmap) {\n gasoline\n diesel\n lpg\n __typename\n }\n isPublicGas\n isDelivery\n isTableOrder\n isPreOrder\n isTakeOut\n isCvsDelivery\n naverBookingCategory\n bookingDisplayName\n bookingVisitId\n bookingPickupId\n easyOrder {\n easyOrderId\n easyOrderCid\n businessHours {\n weekday {\n start\n end\n __typename\n }\n weekend {\n start\n end\n __typename\n }\n __typename\n }\n __typename\n }\n baemin {\n businessHours {\n deliveryTime {\n start\n end\n __typename\n }\n closeDate {\n start\n end\n __typename\n }\n temporaryCloseDate {\n start\n end\n __typename\n }\n __typename\n }\n __typename\n }\n yogiyo {\n businessHours {\n actualDeliveryTime {\n start\n end\n __typename\n }\n bizHours {\n start\n end\n __typename\n }\n __typename\n }\n __typename\n }\n isPollingStation\n visitorReviewCount\n visitorReviewScore\n naverBookingHubId\n bookingHubUrl\n bookingHubButtonName\n newOpening\n fullAddress\n __typename\n}\n\nfragment AttractionAdItemFields on AttractionAdSummary {\n adId\n adClickLog {\n clickUrl\n smartOrderClickUrl\n trackingParameters {\n n_ad_group_type\n n_query\n __typename\n }\n __typename\n }\n impressionEventUrl\n adDescription\n id\n dbType\n name\n businessCategory\n category\n description\n hasBooking\n hasNPay\n x\n y\n distance\n imageUrl\n imageCount\n phone\n virtualPhone\n routeUrl\n streetPanorama {\n id\n pan\n tilt\n lat\n lon\n __typename\n }\n roadAddress\n address\n commonAddress\n blogCafeReviewCount\n bookingReviewCount\n totalReviewCount\n bookingUrl\n bookingBusinessId\n talktalkUrl\n detailCid {\n c0\n c1\n c2\n c3\n __typename\n }\n options\n promotionTitle\n agencyId\n businessHours\n markerLabel @include(if: $isNmap) {\n text\n style\n __typename\n }\n imageMarker @include(if: $isNmap) {\n marker\n markerSelected\n __typename\n }\n cid\n tags\n visitorReviewCount\n poiInfo {\n polyline {\n shapeKey {\n id\n __typename\n }\n __typename\n }\n polygon {\n shapeKey {\n id\n __typename\n }\n __typename\n }\n __typename\n }\n isDelivery\n isTakeOut\n isPreOrder\n isTableOrder\n newOpening\n __typename\n}\n"
}
]
headers = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'ko',
'Content-Type': 'application/json',
'referer': 'https://pcmap.place.naver.com/restaurant/list?level=top&entry=pll&query=%EB%B6%80%EC%82%B0%EB%A7%9B%EC%A7%91&rank=%EC%9A%94%EC%A6%98%EB%9C%A8%EB%8A%94'
}
result = s.post('https://pcmap-api.place.naver.com/graphql', headers=headers, data=json.dumps(data)).text
json_result = json.loads(result)
names = []
for i in json_result[0]['data']['restaurants']['items']:
names.append(i['name'])
결과는 다음과 같이 50건이 잘 출력되고 네이버 검색 결과와 동일함을 확인할 수 있습니다 ^^
['어느멋진날', '해목', '톤쇼우', '웨이브온 커피', '스케줄해운대', '그라노데카페', '해운대 오복돼지국밥', '메르데쿠르', '해운대암소갈비집', '프루터리포레스트', '아나브린', '해운대 가야밀면', '다리집 본점', '오구카페', '수변최고돼지국밥', '바릇식당', '라푀유크로와상', '굿올데이즈카페', '소담한우', '천일녹즙', '상국이네', '페리데스 하이엔드', '듀스포레', '신발원', '비아조', '갈삼구이', '골목카레', '수월경화', '카페이정원', '부산꼴통라면', '어밤부', '미포집', '마루팥빙수단팥죽', '카페 인 부산', '티앤북스광안점', '쿠카이야', '에테르', '기장끝집', '신기숲', '광안리 만빙고 제면소', '이재모피자', '카린 영도 플레이스', '버거베이', '라운더즈도넛', '본전돼지국밥', '제이스 생텀커피', '버거샵', '탐복', '버거인뉴욕', '모모스커피']
파이썬 스크립트를 PHP로도 구현이 가능합니다 :)
크롬 개발자도구를 보면서 브라우저의 전송 및 응답을 흉내내는 것이기 때문에 어렵지 않아요~
만약 어렵게 느껴지시면 반드시 파이썬이 아니더라도
해당 언어에서 구현되는 Selenium을 이용하여 구현하시면 훨씬 간단하게 작성하실 수 있어요!
그럼 즐거운 주말 되세요 ^-^
!-->바트 파싱기로 가능한 부분이있지만, 네이버의 경우는 자주 차단되거나 수집 장애가 발생하는 경우가 있기 때문에 파이썬 Selenium으료 작업하시는 편이 좋다고 생각합니다.
크롤링은 파이썬이 더 좋다고 들어서.. 파이썬으로 해보신게?
바트로는 잘되네요.
~<li\sclass=".+?<span\sclass=".+?">(.+?)</span>~isx
Array
(
[0] => Array
(
[0] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">어느멋진날</span>
[1] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">해목</span>
[2] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">톤쇼우</span>
[3] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">웨이브온 커피</span>
[4] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">스케줄해운대</span>
[5] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">그라노데카페</span>
[6] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">해운대 오복돼지국밥</span>
[7] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">메르데쿠르</span>
[8] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">해운대암소갈비집</span>
[9] => <li class="_3t81n _1l5Ut"><div class="Ow5Yt"><a href="#" target="_self" role="button" class="_12E67"><div class="_1z7ih"><div class="_2p53Z"><span class="_3Yilt">프루터리포레스트</span>
)
[1] => Array
(
[0] => 어느멋진날
[1] => 해목
[2] => 톤쇼우
[3] => 웨이브온 커피
[4] => 스케줄해운대
[5] => 그라노데카페
[6] => 해운대 오복돼지국밥
[7] => 메르데쿠르
[8] => 해운대암소갈비집
[9] => 프루터리포레스트
)
)
php의 파싱법이 자바스크립트는 실행을 못합니다. 해당 페이지를 불러온후 https://pcmap-api.place.naver.com/graphql 로 ajax로 데이터를 받아오기때문에 빈화면이 나올거에요