国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
Detailed explanation of the method of crawling and analyzing web pages with PHP, detailed explanation of php crawling and analysis
Home Backend Development PHP Tutorial Detailed explanation of how to crawl and analyze web pages with PHP, detailed explanation of php crawling and analysis_PHP tutorial

Detailed explanation of how to crawl and analyze web pages with PHP, detailed explanation of php crawling and analysis_PHP tutorial

Jul 12, 2016 am 08:53 AM
php

Detailed explanation of the method of crawling and analyzing web pages with PHP, detailed explanation of php crawling and analysis

This article describes the method of crawling and analyzing web pages with PHP. Share it with everyone for your reference, the details are as follows:

Crawling and analyzing a file is very simple. This tutorial will take you step by step through an example to implement it. Let's get started!

First, we must decide which URL addresses we will crawl. This can be set in a script or passed via $QUERY_STRING. For simplicity, let's set the variable directly in the script.

<&#63;php
$url = 'http://www.php.net';
&#63;>

In the second step, we grab the specified file and store it in an array through the file() function.

<&#63;php
$url = 'http://www.php.net';
$lines_array = file($url);
&#63;>

Okay, now we have the files in the array. However, the text we want to analyze may not all be in one line. To resolve this file, we can simply convert the array $lines_array into a string. We can use the implode(x,y) function to achieve this. If you want to use explode later (array of string variables), it may be better to set x to "|" or "!" or other similar delimiter. But for our purposes, it's better to set x to a space. y is another necessary parameter because it is the array you want to process with implode().

<&#63;php
$url = 'http://www.php.net';
$lines_array = file($url);
$lines_string = implode('', $lines_array);
&#63;>

Now that the crawling work is done, it’s time to analyze. For the purposes of this example, we want to get everything between and . In order to parse out the string, we also need something called a regular expression.

<&#63;php
$url = 'http://www.php.net';
$lines_array = file($url);
$lines_string = implode('', $lines_array);
eregi("<head>(.*)</head>", $lines_string, $head);
&#63;>

Let’s take a look at the code. As you can see, the eregi() function is executed in the following format:

eregi("<head>(.*)</head>", $lines_string, $head);

"(.*)" means everything, which can be interpreted as, "analyze everything between and ". $lines_string is the string we are analyzing, and $head is the array where the analyzed results are stored.

Finally, we can enter the data. Since there is only one instance between and , we can safely assume that there is only one element in the array, and it is the one we want. Let's print it out.

<&#63;php
$url = 'http://www.php.net';
$lines_array = file($url);
$lines_string = implode('', $lines_array); eregi("<head>(.*)</head>", $lines_string, $head);
echo $head[0];
&#63;>

That’s all the code.

<&#63;php
//獲取所有內(nèi)容url保存到文件
function get_index ( $save_file , $prefix = "index_" ){
   $count = 68 ;
   $i = 1 ;
  if ( file_exists ( $save_file )) @ unlink ( $save_file );
   $fp = fopen ( $save_file , "a+" ) or die( "Open " . $save_file . " failed" );
  while( $i < $count ){
     $url = $prefix . $i . ".htm" ;
    echo "Get " . $url . "..." ;
     $url_str = get_content_url ( get_url ( $url ));
    echo " OK/n" ;
     fwrite ( $fp , $url_str );
    ++ $i ;
  }
   fclose ( $fp );
}
//獲取目標(biāo)多媒體對(duì)象
function get_object ( $url_file , $save_file , $split = "|--:**:--|" ){
  if (! file_exists ( $url_file )) die( $url_file . " not exist" );
   $file_arr = file ( $url_file );
  if (! is_array ( $file_arr ) || empty( $file_arr )) die( $url_file . " not content" );
   $url_arr = array_unique ( $file_arr );
  if ( file_exists ( $save_file )) @ unlink ( $save_file );
   $fp = fopen ( $save_file , "a+" ) or die( "Open save file " . $save_file . " failed" );
  foreach( $url_arr as $url ){
    if (empty( $url )) continue;
    echo "Get " . $url . "..." ;
     $html_str = get_url ( $url );
    echo $html_str ;
    echo $url ;
    exit;
     $obj_str = get_content_object ( $html_str );
    echo " OK/n" ;
     fwrite ( $fp , $obj_str );
  }
   fclose ( $fp );
}
//遍歷目錄獲取文件內(nèi)容
function get_dir ( $save_file , $dir ){
   $dp = opendir ( $dir );
  if ( file_exists ( $save_file )) @ unlink ( $save_file );
   $fp = fopen ( $save_file , "a+" ) or die( "Open save file " . $save_file . " failed" );
  while(( $file = readdir ( $dp )) != false ){
    if ( $file != "." && $file != ".." ){
      echo "Read file " . $file . "..." ;
       $file_content = file_get_contents ( $dir . $file );
       $obj_str = get_content_object ( $file_content );
      echo " OK/n" ;
       fwrite ( $fp , $obj_str );
    }
  }
   fclose ( $fp );
}
//獲取指定url內(nèi)容
function get_url ( $url ){
   $reg = '/^http:////[^//].+$/' ;
  if (! preg_match ( $reg , $url )) die( $url . " invalid" );
   $fp = fopen ( $url , "r" ) or die( "Open url: " . $url . " failed." );
  while( $fc = fread ( $fp , 8192 )){
     $content .= $fc ;
  }
   fclose ( $fp );
  if (empty( $content )){
    die( "Get url: " . $url . " content failed." );
  }
  return $content ;
}
//使用socket獲取指定網(wǎng)頁(yè)
function get_content_by_socket ( $url , $host ){
   $fp = fsockopen ( $host , 80 ) or die( "Open " . $url . " failed" );
   $header = "GET /" . $url . " HTTP/1.1/r/n" ;
   $header .= "Accept: */*/r/n" ;
   $header .= "Accept-Language: zh-cn/r/n" ;
   $header .= "Accept-Encoding: gzip, deflate/r/n" ;
   $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)/r/n" ;
   $header .= "Host: " . $host . "/r/n" ;
   $header .= "Connection: Keep-Alive/r/n" ;
   //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-/r/n/r/n";
   $header .= "Connection: Close/r/n/r/n" ;
   fwrite ( $fp , $header );
   while (! feof ( $fp )) {
     $contents .= fgets ( $fp , 8192 );
   }
   fclose ( $fp );
   return $contents ;
}
//獲取指定內(nèi)容里的url
function get_content_url ( $host_url , $file_contents ){
   //$reg = '/^(#|<a href="http://lib.csdn.net/base/18" class='replace_word' title="JavaScript知識(shí)庫(kù)" target='_blank' style='color:#df3434; font-weight:bold;'>JavaScript</a>.*&#63;|ftp:////.+|http:////.+|.*&#63;href.*&#63;|play.*&#63;|index.*&#63;|.*&#63;asp)+$/i';
   //$reg = '/^(down.*&#63;/.html|/d+_/d+/.htm.*&#63;)$/i';
   $rex = "/([hH][rR][eE][Ff])/s*=/s*['/"]*([^>'/"/s]+)[/"'>]*/s*/i" ;
   $reg = '/^(down.*&#63;/.html)$/i' ;
   preg_match_all ( $rex , $file_contents , $r );
   $result = "" ; //array();
   foreach( $r as $c ){
    if ( is_array ( $c )){
      foreach( $c as $d ){
        if ( preg_match ( $reg , $d )){ $result .= $host_url . $d . "/n" ; }
      }
    }
  }
  return $result ;
}
//獲取指定內(nèi)容中的多媒體文件
function get_content_object ( $str , $split = "|--:**:--|" ){
   $regx = "/href/s*=/s*['/"]*([^>'/"/s]+)[/"'>]*/s*(.*&#63;<//b>)/i" ;
   preg_match_all ( $regx , $str , $result );
  if ( count ( $result ) == 3 ){
     $result [ 2 ] = str_replace ( "多媒體: " , "" , $result [ 2 ]);
     $result [ 2 ] = str_replace ( " " , "" , $result [ 2 ]);
     $result = $result [ 1 ][ 0 ] . $split . $result [ 2 ][ 0 ] . "/n" ;
  }
  return $result ;
}
&#63;>

Readers who are interested in more PHP-related content can check out the special topics of this site: "Summary of PHP Regular Expression Usage", "Summary of PHP Ajax Skills and Applications", "Summary of PHP Operations and Operator Usage", "PHP Network Summary of Programming Skills", "Introduction Tutorial on PHP Basic Syntax", "Summary of PHP Office Document Operation Skills (Including Word, Excel, Access, PPT)", "Summary of PHP Date and Time Usage", "Introduction Tutorial on PHP Object-Oriented Programming" , "php string (string) usage summary", "php mysql database operation introductory tutorial" and "php common database operation skills summary"

I hope this article will be helpful to everyone in PHP programming.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/1123830.htmlTechArticleDetailed explanation of PHP crawling and analysis of web pages, detailed explanation of PHP crawling analysis. This article describes PHP crawling and analysis with examples. web method. Share it with everyone for your reference, the details are as follows: Crawl and analyze...
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to access a character in a string by index in PHP How to access a character in a string by index in PHP Jul 12, 2025 am 03:15 AM

In PHP, you can use square brackets or curly braces to obtain string specific index characters, but square brackets are recommended; the index starts from 0, and the access outside the range returns a null value and cannot be assigned a value; mb_substr is required to handle multi-byte characters. For example: $str="hello";echo$str[0]; output h; and Chinese characters such as mb_substr($str,1,1) need to obtain the correct result; in actual applications, the length of the string should be checked before looping, dynamic strings need to be verified for validity, and multilingual projects recommend using multi-byte security functions uniformly.

How Do Generators Work in PHP? How Do Generators Work in PHP? Jul 11, 2025 am 03:12 AM

AgeneratorinPHPisamemory-efficientwaytoiterateoverlargedatasetsbyyieldingvaluesoneatatimeinsteadofreturningthemallatonce.1.Generatorsusetheyieldkeywordtoproducevaluesondemand,reducingmemoryusage.2.Theyareusefulforhandlingbigloops,readinglargefiles,or

How to prevent session hijacking in PHP? How to prevent session hijacking in PHP? Jul 11, 2025 am 03:15 AM

To prevent session hijacking in PHP, the following measures need to be taken: 1. Use HTTPS to encrypt the transmission and set session.cookie_secure=1 in php.ini; 2. Set the security cookie attributes, including httponly, secure and samesite; 3. Call session_regenerate_id(true) when the user logs in or permissions change to change to change the SessionID; 4. Limit the Session life cycle, reasonably configure gc_maxlifetime and record the user's activity time; 5. Prohibit exposing the SessionID to the URL, and set session.use_only

PHP get the first N characters of a string PHP get the first N characters of a string Jul 11, 2025 am 03:17 AM

You can use substr() or mb_substr() to get the first N characters in PHP. The specific steps are as follows: 1. Use substr($string,0,N) to intercept the first N characters, which is suitable for ASCII characters and is simple and efficient; 2. When processing multi-byte characters (such as Chinese), mb_substr($string,0,N,'UTF-8'), and ensure that mbstring extension is enabled; 3. If the string contains HTML or whitespace characters, you should first use strip_tags() to remove the tags and trim() to clean the spaces, and then intercept them to ensure the results are clean.

PHP get the last N characters of a string PHP get the last N characters of a string Jul 11, 2025 am 03:17 AM

There are two main ways to get the last N characters of a string in PHP: 1. Use the substr() function to intercept through the negative starting position, which is suitable for single-byte characters; 2. Use the mb_substr() function to support multilingual and UTF-8 encoding to avoid truncating non-English characters; 3. Optionally determine whether the string length is sufficient to handle boundary situations; 4. It is not recommended to use strrev() substr() combination method because it is not safe and inefficient for multi-byte characters.

How to URL encode a string in PHP with urlencode How to URL encode a string in PHP with urlencode Jul 11, 2025 am 03:22 AM

The urlencode() function is used to encode strings into URL-safe formats, where non-alphanumeric characters (except -, _, and .) are replaced with a percent sign followed by a two-digit hexadecimal number. For example, spaces are converted to signs, exclamation marks are converted to!, and Chinese characters are converted to their UTF-8 encoding form. When using, only the parameter values ??should be encoded, not the entire URL, to avoid damaging the URL structure. For other parts of the URL, such as path segments, the rawurlencode() function should be used, which converts the space to . When processing array parameters, you can use http_build_query() to automatically encode, or manually call urlencode() on each value to ensure safe transfer of data. just

How to set and get session variables in PHP? How to set and get session variables in PHP? Jul 12, 2025 am 03:10 AM

To set and get session variables in PHP, you must first always call session_start() at the top of the script to start the session. 1. When setting session variables, use $_SESSION hyperglobal array to assign values ??to specific keys, such as $_SESSION['username']='john_doe'; it can store strings, numbers, arrays and even objects, but avoid storing too much data to avoid affecting performance. 2. When obtaining session variables, you need to call session_start() first, and then access the $_SESSION array through the key, such as echo$_SESSION['username']; it is recommended to use isset() to check whether the variable exists to avoid errors

How to prevent SQL injection in PHP How to prevent SQL injection in PHP Jul 12, 2025 am 03:02 AM

Key methods to prevent SQL injection in PHP include: 1. Use preprocessing statements (such as PDO or MySQLi) to separate SQL code and data; 2. Turn off simulated preprocessing mode to ensure true preprocessing; 3. Filter and verify user input, such as using is_numeric() and filter_var(); 4. Avoid directly splicing SQL strings and use parameter binding instead; 5. Turn off error display in the production environment and record error logs. These measures comprehensively prevent the risk of SQL injection from mechanisms and details.

See all articles