Webscraper tutorial

4/28/2023

We now need to identify in the DOM structure. Great, so we have a way of identifying which posts are inbound and which are outbound.

Logically, we can assume that ‘self’ refers to the Reddit root directory, and ‘.datascience’ refers to the subreddit. The posts that we’re interested in is followed by ‘(self.datascience)’. So, we know what we want, how do we go about extracting it? If you look at the title of the posts, you can see that it’s followed some text in brackets.

Posts that contain text written by users, not just links to other websites. This is important because we only want the inbound links. Inbound links are links pointing towards content on Reddit itself, and outbound is the exact opposite. What exactly do we need in all of this? Well, upon probing all of the links, you’ll find that Reddit posts are of two types: Upon opening the link, you’re met with a flux of information overload. It’s possible to extract this information from the new site as well, but for the sake of simplicity, we’ll be using the older version which lays out everything bare. The newer version hides away some crucial information in the underbelly of the webpage. Note that we’ll be using the older version of the subreddit for our scraper. Open a web browser, and go to the subreddit in question. Before we start writing the script, there’s some field work we need to do. For our tutorial, we're using Reddit's 'datascience' subreddit. Next, we need the url for the webpage that we want to scrape. So we have our environment set up and ready. This is legal in Python, and though it is generally frowned upon, it's not exactly against the law. That is, we downloaded a package called beautifulsoup4, but we imported from a module called bs4. You may have noticed something quirky in the snippet above. We will be using Python's built-in csv module to write our results to a CSV file. You can install these packages with pip of course, like so: pip install package_nameĪfter you're done downloading the packages, go ahead and import them into your code. In our scraper, we will be using the following packages: In fact, the first part of writing any Python script: imports. Now that that's done with, we can move onto the first part of making our web scraper. That being said, the concepts used here are very minimal, and you can get away with a very little know-how of Python. You can learn the skills above in DataCamp's Python beginner course. Running Python scripts in your computer.This tutorial assumes you know the following things: Note: We'll be using the older version of Reddit's website because it is more lightweight to load, and hence less strenuous on your machine. Extracting information from raw HTML with BeautifulSoup.Analyzing web pages in browser for information.We want to know who posted it, as well as how many likes and comments it has. We want to get the first 1000 posts on the subreddit and export them to a CSV file. We're interested in the datascience subreddit. In our tutorial, we'll be using Python and the BeautifulSoup 4 package to get information from a subreddit. The internet is a vast repository of all of mankind's history and knowledge, and you have the means of extracting anything you want and doing with that information what you will. For any data analysis, the first step is data acquisition. Anything you can see on the internet with your browser, including this tutorial, can be scraped onto your local hard drive. Right, so what exactly is web scraping? As the name implies, it's a method of 'scraping' or extracting data from webpages. You can find a finished working example of the script we will write here.

0 Comments

Webscraper tutorial

Leave a Reply.

Author

Archives

Categories