How to Use Wget With Proxy

Flipnode on Jun 16 2023

blog-image

Wget, a widely used command-line tool, offers the ability to download files from the web. Developed as part of the GNU Project, it is often included in various Linux distributions.

In this comprehensive guide, we will provide you with a step-by-step walkthrough of the installation process for Wget. Additionally, we will explore how to download files using Wget, both with and without proxies, while addressing various scenarios. Throughout the article, you can expect practical examples that demonstrate the functionality of Wget in action.

What is Wget

Wget, also known as GNU Wget, is a free software package that allows users to retrieve files from the internet using HTTP(S) and FTP(S) protocols. Developed as part of the GNU Project, Wget provides a versatile and powerful tool for file retrieval. It's worth noting that the capitalization of Wget (or wget) is optional, allowing flexibility in its usage.

How to install Wget

To obtain Wget, you have multiple installation options depending on your operating system. We recommend utilizing package managers for a more streamlined installation process, although manual downloads are also available.

For Ubuntu/Debian users, open the terminal and execute the following command to install Wget:

sudo apt-get install wget

If you're using CentOS/RHEL, access the terminal and enter the following command:

yum install wget

macOS users are encouraged to leverage the Homebrew package manager. Simply open the terminal and execute this command:

brew install wget

Windows users can opt for the Chocolatey package manager. Run the following command from either the command line or PowerShell:

choco install wget

Finally, to confirm the successful installation of Wget, execute the following command:

wget --version

This command will display the installed version of Wget, along with additional relevant information.

Running Wget

The Wget command can be executed in any command-line interface, and for this tutorial, we'll be using the terminal. To run the Wget command, simply open the terminal and enter the following:

wget -h

Executing this command will display a comprehensive list of options that can be used with Wget. The options are categorized into sections such as Startup, Logging, Download, and more.

Downloading a single file

To download a single file using Wget, simply run the Wget command followed by the complete URL of the file. For instance, let's say you want to download the Wget binary file from the URL https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz. In the terminal, enter the following command:

wget https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz

When executing this command, Wget will provide detailed information about the file being downloaded. You'll see a progress bar indicating the download completion, information about each step of the download process, the total file size, and its MIME type, among other details.

Changing the User-Agent

When connecting to a web service, every program, including web browsers, sends specific headers. One of the essential headers is the User-Agent, which contains a string that identifies the program.

To observe the variations in User-Agent across different applications, you can open a specific URL in various installed browsers.

To determine the User-Agent used by Wget, you can request the following URL:

wget https://httpbin.org/user-agent

Executing this command will download a file named "user-agent" without an extension. To view the file's contents, you can use the "cat" command on macOS and Linux, or the "type" command on Windows.

$ cat user-agent
{
"user-agent": "wget/1.21.2"
}

By default, the User-Agent can be modified using the "--header" option. The syntax is as follows:

wget --header "user-agent: DESIRED USER AGENT" URL-OF-FILE

The following example illustrates this further:

$ wget --header "user-agent: Mozilla/5.0 (Macintosh)" https://httpbin.org/user-agent
$ cat user-agent
{
"user-agent": "Mozilla/5.0 (Macintosh)"
}

As demonstrated, the User-Agent has been changed. If you want to send additional headers, you can include more "--header" options, following the format "HeaderName: HeaderValue".

Downloading multiple files

There are two approaches to download multiple files using Wget. The first method involves providing all the URLs to Wget, separated by spaces. For instance, the following command will download files from three URLs:

$ wget http://example.com/file1.zip http://example.com/file2.zip http://example.com/file3.zip

To try a practical example, you can use the following command:

$ wget https://ftp.gnu.org/gnu/wget/wget2-2.0.0.tar.lz https://ftp.gnu.org/gnu/wget/wget2-1.99.2.tar.lz

With this method, the files will be downloaded one at a time.

While this approach works fine for a small number of files, it can become challenging to manage as the number of files increases. In such cases, the second method becomes more convenient.

The second method involves creating a file that contains all the URLs and utilizing the -i or --input-file option. For example, to read the URLs from the urls.txt file, you can run either of the following commands:

$ wget --input-file=urls.txt
$ wget -i urls.txt

The advantage of this option is that if any of the URLs do not work, Wget will continue to download the remaining functional URLs.

Extracting links from a webpage

The --input-file option in the Wget command can be extended to extract links from a webpage.

In its basic form, you can provide a URL that contains the links to the files. For instance, if a webpage has links to downloadable content, you can download all the files from that URL by running the following command:

$ wget --input-file=https://ftp.gnu.org/gnu/wget

However, this command alone may not be sufficient without further customization. There are several reasons for this.

By default, Wget does not overwrite existing files. If a download would result in overwriting a file, Wget appends a numerical suffix to the new file's name. For example, if there is already a file named compressed.gif, Wget would create new files with names like compressed.gif, compressed.gif.1, compressed.gif.2, and so on.

To modify this behavior and skip duplicate files, you can use the --no-clobber switch.

Additionally, you may want to download files recursively by using the --recursive switch.

You can also skip downloading certain files by specifying their extensions as a comma-separated list with the --reject switch.

Similarly, if you only want to download specific files while ignoring everything else, you can use the --accept switch with a list of extensions separated by commas.

Two other useful switches are --no-directories and --no-parent, which prevent the creation of directories and restrict Wget from traversing to a parent directory, respectively.

For example, to download all files with the .sig extension, you can use the following command:

$ wget --recursive --no-parent --no-directories --no-clobber --accept=sig --input-file=https://ftp.gnu.org/gnu/wget

Using proxies with Wget

There are two methods for integrating proxies with Wget. The first method involves using command-line switches to specify the proxy server and authentication details.

To verify your current IP address before specifying a proxy server, you can run the following commands:

$ wget https://ip.flipnode.io
# Output of wget here
$ cat index.html
11.22.33.44 # Prints actual IP address

The first command fetches the index.html file containing the IP address, and the cat command (or type command for Windows) prints the file contents.

The same result can be achieved by running Wget in quiet mode and redirecting the output to the terminal without downloading the file:

$ wget --quiet --output-document=- https://ip.flipnode.io

A shorter version of the same command is:

$ wget -q -O - https://ip.flipnode.io

To use a proxy that doesn't require authentication, you can use two -e or --execute switches. The first enables the proxy, and the second specifies the proxy server's URL.

The following commands enable the proxy and specify the proxy server's IP (12.13.14.15) and port (1234):

$ wget -q -O- -e use_proxy=yes -e http_proxy=12.13.14.15:1234 https://ip.flipnode.io

In the above example, the proxy doesn't require authentication. If the proxy server requires user authentication, you can set the proxy username using the --proxy-user switch and set the proxy password using the --proxy-password switch:

$ wget -q -O- -e use_proxy=yes -e http_proxy=12.13.14.15:1234 --proxy-user=your_username --proxy-password=your_password https://ip.flipnode.io

As evident here, the command can become quite long. However, it's useful when you don't want to use a proxy all the time.

The second method is to use the .wgetrc configuration file. This file stores proxy configurations that Wget reads.

The configuration file is located in the user's home directory and is named .wgetrc. Alternatively, you can use any file as the configuration file by using the --config switch.

In the ~/.wgetrc file, you can enter the following lines:

use_proxy = on
http_proxy = http://12.13.14.15:1234

If you also need to set user authentication for the proxy, modify the file as follows:

use_proxy = on
http_proxy = http://your_username:[email protected]:1234

From now on, every time Wget runs, it will use the specified proxy.

$ wget -q -O- http://httpbin.org/ip
# Prints IP of the proxy server

Proxies can also be set using environment variables like http_proxy. However, this applies to the entire network traffic and is not specific to Wget, making it unsuitable for this task.

cURL vs Wget

cURL, also known as Curl, is another command-line tool for downloading files, and it is available as open-source software for free.

While cURL and Wget share many similarities, they also have important distinctions that make them more suitable for specific purposes.

Let's start by highlighting their similarities:

  • Both are open-source command-line tools for downloading content from HTTP(S) and FTP(S).
  • They can send HTTP GET and POST requests.
  • Support for cookies.
  • Designed to run in the background.

Now, let's discuss the features that are unique to cURL:

  • It is available as a library, allowing it to be incorporated into other programs.
  • Supports a wider range of protocols beyond HTTP and FTP.
  • Provides better SSL support.
  • Offers more HTTP authentication methods.
  • Includes support for SOCKS proxies.
  • Provides enhanced support for HTTP POST requests.

On the other hand, Wget has its own advantages, such as:

  • Support for recursive downloads, which allows you to download files recursively using the --mirror switch and create a local copy of a website.
  • Ability to resume interrupted downloads.
  • More details about cURL can be found in the referenced article, and a comparison table is available for a detailed analysis of the differences between cURL and Wget.

Considering the differences mentioned above, you can choose the tool that best suits your specific scenario. For example, if you need recursive downloads, Wget would be a better choice. On the other hand, if you require SOCKS proxy support, cURL would be more suitable.

Neither tool is definitively better than the other. Select the one that aligns with your specific requirements at any given time.

Conclusion

This article provided a comprehensive guide on configuring Wget, covering everything from installation and downloading files to utilizing proxies. Additionally, a comparison between cURL and Wget was presented, highlighting their differences in terms of functionality and suitability for specific use cases.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

Subscribe

Related articles

thumbnail
ProxiesHow to Use Chrome Browser Proxy Settings

Learn how to configure and utilize Chrome browser proxy settings effectively for enhanced privacy and optimized web browsing.

Flipnode
author avatar
Flipnode
8 min read
thumbnail
ScrapersWeb Scraping With RegEx

Regular Expressions (RegEx) are powerful pattern matching tools that allow you to filter and extract specific combinations of data, providing the desired output.

Flipnode
author avatar
Flipnode
5 min read
thumbnail
How to Use DataXPath vs CSS Selectors

Read this article to learn what XPath and CSS selectors are and how to create them. Find out the differences between XPath vs CSS, and know which option to choose.

Flipnode
author avatar
Flipnode
12 min read