Skip to content

tolf7544/crawiling-youtube-video-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

43 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

youtube video data crawling by axios

crawiling

  • video url(https://youtu.be/[id]) ๋ฅผ axios์„ ํ™œ์šฉํ•˜์—ฌ html ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ด
  • ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ์—์„œ ์œ ํŠœ๋ธŒ ์˜์ƒ ๊ด€๋ จ ์ •๋ณด๋“ค์„ jsonParsing ํ•จ

์–ป์„์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค

  • ํ•ด๋‹น id ์˜์ƒ ์ •๋ณด (ex. artist information , video data, thumbnail , title , LIKE number , playTime, recommendVideoData, description . . . )
  • ์ถ”์ฒœ ์˜์ƒ ์ •๋ณด (playTime,artist information, title,description,videoId,url,thumbnail,playTime(second) . . . )

how to get specific string from html data | ํŠน์ •๋ฌธ์ž์—ด์„ html ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€์ ธ์˜ค๋Š” ๋ฒ•

 const searchStart = '"category":"'; //start of the string | ๋ฌธ์ž์—ด์˜ ์‹œ์ž‘๋ถ€๋ถ„
 const searchEnd = '"'; //end of the string | ๋ฌธ์ž์—ด์˜ ๋๋ถ€๋ถ„

 const indexS = data.indexOf(searchStart); //searchStart location (Number) | ๊ฒ€์ƒ‰ํ• ๋ ค๋Š” ๋ฌธ์ž์—ด์˜ ์•ž๋ถ€๋ถ„ ์œ„์น˜ (Number)

 if (indexS < 0) return `Error` //if indexS is not found, it returns `Error` | searchStart ์˜ ์œ„์น˜๋ฅผ ์ฐพ์ง€ ๋ชปํ•œ๋‹ค๋ฉด `Error` ๋ฆฌํ„ด (๋ฌธ์ž์—ด์ด ์กด์žฌํ•˜์ง€์•Š์„์‹œ ์—๋Ÿฌ ๋ฐœ์ƒ)

 var content = data.slice(indexS + searchStart.length); // remove from data to indexS | data๋ฅผ indexS ๋งŒํผ ์ œ๊ฑฐ (๊ฒ€์ƒ‰ํ• ๋ ค๋Š” ๋ฌธ์ž์—ด์˜ ์•ž๋ถ€๋ถ„ ์ œ๊ฑฐ)

 const indexE = content.indexOf(searchEnd); //searchEnd location (Number) | searchEnd ์œ„์น˜ (๊ฒ€์ƒ‰ํ• ๋ ค๋Š” ๋ฌธ์ž์—ด์˜ ๋๋ถ€๋ถ„์˜ ์œ„์น˜)
 content = content.slice(0, indexE); // remove strings out of range from 0 to indexE in content | content์—์„œ 0๋ฒˆ์งธ์™€ indexE๋ฒˆ์งธ๊นŒ์ง€ ๋ฌธ์ž์—ด ์ด์™ธ์— ๊ฒƒ์„ ์ œ๊ฑฐํ•จ
 
 return JSON.parse(content)
  • ์ฆ‰ "category":" ๋กœ ์‹œ์ž‘ํ•˜๊ณ  " ๋กœ ๋๋‚˜๋Š” ๋ถ€๋ถ„์„ axios๋กœ ๊ฐ€์ ธ์˜จ html.data์—์„œ ๊ฒ€์ƒ‰ํ•˜์—ฌ ์ถ”์ถœํ•œ stringํ˜•์‹์„ json์œผ๋กœ ํŒŒ์‹ฑํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.

  • ์ด ๊ณผ์ •์€ ์˜ˆ์‹œ ์ด์™ธ์— ๋ชจ๋“  ๋ฌธ์ž์—ด์—์„œ ํŠน์ • ๋ฌธ์ž์—ด์„ ์˜ค์ฐจ์—†์ด ํŒŒ์‹ฑํ• ์ˆ˜ ์žˆ์œผ๋ฉฐ ์•„๋ž˜ ์˜ˆ์‹œ์—์„œ๋Š” ์ •๊ทœํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •๊ตํ•˜๊ฒŒ ํŒŒ์‹ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ๋ ค์ค€๋‹ค.

ํ•ด๋‹น ๊ตฌ๋ฌธ์€ videoData๋ฅผ ์ •๊ทœํ‘œํ˜„์‹์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ์†Œ์Šค์ฝ”๋“œ์ด๋ฉฐ ์œ„์ชฝ์˜ ์˜ˆ์‹œ์—์„œ ์•„๋žซ ๋ถ€๋ถ„๋งŒ ๋ณ€๊ฒฝํ•˜์—ฌ ์‚ฌ์šฉํ•จ

//data = html.data

const regex = /<script nonce="(.+?)">var ytInitialPlayerResponse =/g; // (.+?) ์€ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฌธ์ž์—ด์„ ํฌํ•จํ• ์ˆ˜์žˆ๋‹ค.

const searchStart = (data.match(regex))[0] //data์—์„œ regex๋ธŒ๋ถ„์„ ์ถ”์ถœํ•˜์—ฌ searchStart์— ์ €์žฅ (๊ฒ€์ƒ‰ํ• ๋ ค๋Š” ๋ฌธ์ž์—ด์˜ ์‹œ์ž‘๋ถ€๋ถ„)

const searchEnd = ';</script><div id="player" class="skeleton flexy">'; //๊ฒ€์ƒ‰ํ• ๋ ค๋Š” ๋ฌธ์ž์—ด์˜ ๋๋ถ€๋ถ„
  • 2์ฃผ๋™์•ˆ ์œ ํŠœ๋ธŒ data๋ฅผ ๋ถ„์„ํ•˜๋ฉฐ ๋А๋‚€์ 

์ง์ ‘ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•ด๊ฐ€๋ฉฐ ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ด์™€ ๋™์‹œ์— ์ •๊ทœํ‘œํ˜„์‹์„ ์ตํž์ˆ˜์žˆ์œผ๋ฉฐ ๊ณต์ ์ธ ์šฉ๋„๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ๋ฉด ์ถฉ๋ถ„ํžˆ ํ˜ผ์ž์„œ ๋งŒ๋“ค์–ด ํƒ€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ณด๋‹ค ๋น ๋ฅธ ์†๋„๋กœ ํฌ๋กค๋ง์„ ํ• ์ˆ˜์žˆ๋‹ค. ๋‹ค๋งŒ html์˜ body data์˜ ํŠน์ง•์„ ๋นจ๋ฆฌ ํŒŒ์•…ํ•˜๊ธฐ ํž˜๋“ค๋‹ค๋ฉด ๊ทธ๋ƒฅ ytdl-core๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ์ •์‹ ๊ฑด๊ฐ•์— ์ข‹๋‹ค.

About

Crawiling youtube video data by jsonParsing

Topics

Resources

Stars

Watchers

Forks