Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / NewStats: 3,205,610 members, 7,993,081 topics. Date: Monday, 04 November 2024 at 05:14 AM |
Nairaland Forum / Science/Technology / Programming / Scraping Jiji Ideas (1445 Views)
The Future Of Web Scraping / Jumia Black Friday Web Scraping Program / Help Needed Scraping Asp.net Website. (2) (3) (4)
Scraping Jiji Ideas by Devaro: 6:32pm On Jan 05, 2023 |
Anyone have ideas on how to scrape data like name, business name, phone number |
Re: Scraping Jiji Ideas by YoungCabal: 6:54pm On Jan 05, 2023 |
Devaro:why do you want to scrape data from the site ? if you are willing to pay for my time, I can cookup a solution for you |
Re: Scraping Jiji Ideas by LittleBigDick(m): 5:09am On Jan 06, 2023 |
Beautiful soup can do that for you 2 Likes |
Re: Scraping Jiji Ideas by Devaro: 5:49am On Jan 06, 2023 |
LittleBigDick: Have any resources? |
Re: Scraping Jiji Ideas by chim14(m): 7:24am On Jan 06, 2023 |
YoungCabal: Cook up Beautiful Soup 1 Like |
Re: Scraping Jiji Ideas by YoungCabal: 8:14am On Jan 06, 2023 |
chim14:just because there is a python library that eases the job a little doesn't mean my time should be always free, don't you agree ? OP is clearly scraping the site for his personal business or intends to resell it, he should be willing to foot the bill if he really wants a professional job 3 Likes |
Re: Scraping Jiji Ideas by chim14(m): 10:28pm On Jan 06, 2023 |
YoungCabal: Of course you can't do it for free now, you have to bill him well. I was just humoring on words. 1 Like |
Re: Scraping Jiji Ideas by Felixitie(m): 11:36pm On Jan 06, 2023 |
I have done a project on it before, beautifulsoup will not handle the jiji site due to the infinite scrolling pattern of the website. You have to use selenium + Bs4 + page rendering to render the javascript before scraping. |
Re: Scraping Jiji Ideas by YoungCabal: 3:26am On Jan 07, 2023 |
Felixitie:It's not even the infinite scrolling alone, you have to click on some data to unhide them, beautiful soup is not the right tool, even with selenium, it won't be an easy task because you either go category by category or build a mini js enabled crawler to index the site I laughed when I saw someone comment he can show OP how to do it with beautiful soup, |
Re: Scraping Jiji Ideas by Felixitie(m): 3:40pm On Jan 07, 2023 |
YoungCabal: Impossible for Bs4 alone, but selenium will work for sure, the clicking of buttons etc., depending on what you want to scrape from the site..not that complex.. |
Re: Scraping Jiji Ideas by Nobody: 12:05am On Jan 08, 2023 |
Try puppeteerJs or Nightmarejs using NodeJs |
Re: Scraping Jiji Ideas by YoungCabal: 8:19am On Jan 08, 2023 |
Felixitie:if it's not that complex, why don't you just paste the source code here for him or the full instruction on how to do it ? admit it, it's something that demands quality attention not just something you can run over. |
Re: Scraping Jiji Ideas by bedfordng(m): 10:21am On Jan 08, 2023 |
YoungCabal: Jiji is not even as complex as most flight listing or betting website. selenium can get the job done with ease. Playwright is also good for the job. At least they have mentioned lots of tooling he can use. It is left for him to learn to use it regardless . As for pasting source code or script for the op, he needs to pay for the job whether it is complex or not. |
Re: Scraping Jiji Ideas by YoungCabal: 12:18pm On Jan 08, 2023 |
bedfordng:You get my point ! OP needs to pay for the job. Whether it is complex or not, the time the developer spent in acquiring the skill demands a befitting payment, if we keep emphasizing on it being simple, OP will want to underpay for the job or demand for it to be free. That's why you should never tag any job simple when you bid, it's like demarketing yourself, just highlight your experience and let them decide if they want it or not |
Re: Scraping Jiji Ideas by Felixitie(m): 1:37pm On Jan 08, 2023 |
YoungCabal: Nigga calm down, just tell me you need it. If it demands quality attention then it will not be free,otherwise he should do a personal search and learn how to do it if he can't pay for it. Besides, do you think the script is going to work for all the pages in jiji.. Abeg move. |
Re: Scraping Jiji Ideas by bedfordng(m): 2:05pm On Jan 08, 2023 |
YoungCabal:yeah I get the point. Nice reasoning. this is also why tools were mentioned for op to try it for himself. |
Re: Scraping Jiji Ideas by YoungCabal: 5:46pm On Jan 08, 2023 |
Felixitie:Lol! We are cool, man. Sure, it can work on every page, it depends on how much time you are willing to invest in coding it, there are selenium libraries for some languages which you can integrate with a crawler you build and use regex pattern matching to determine which page is which, that's why I was against tagging it simple as you did since we both don't know OP 's full intention |
Re: Scraping Jiji Ideas by Felixitie(m): 7:40pm On Jan 08, 2023 |
YoungCabal: I feel you bro, the script I developed won't work for all the pages cos it was for personal project. I said simple for the fact that I have seen many tough websites to scrape compared to the easier jiji type. Thanks brother. |
Re: Scraping Jiji Ideas by nnuReader: 11:39am On Feb 28, 2023 |
Here is a python script to scrape all the data, including phone number in less than an hour. The scripts works by directly fetching data from the jiji API endpoints and paginate: https:///api_web/v1/listing?slug=X&webp=true&page=Y where X is the category(vehicles, real-estate...) you want to scrape and Y is the page in the data(23 products returned per page), You just change keep changing the slug when you're done scraping a ctegory, and for every category, you keep increasing the page while you save the info and check for duplicates(a vendor can appear muliple times due to multiple product upload) This approach is miles faster than using tools like puppeteer, selenium or beautiful soup because you're not loading irrelevant files like css, js, images, html... You can run the script in CMD like the following: python3 scrape.py vehicles Vehicles The above scrape the vehicles category python3 scrape.py real-estate Properties The above scrape real estates. If you need more info, mail me at hello@feyitech.com The Script: import requests import time, sys from common import get_profile_id_list_and_profiles, update_profiles, dict_to_profile_row from coded_addesses import address_for_fresh, address_for_new, address_for_slider S = requests.Session() SCRAPE_TYPES = { "fresh": "fresh", "new": "new", "slider": "slider" } ACCEPTED_TYPES = [ 'vehicles', 'real-estate', 'mobile-phones-tablets', 'electronics', 'home-garden', 'health-and-beauty', 'fashion-and-beauty', 'hobbies-art-sport', 'seeking-work-cvs', 'services', 'babies-and-kids', 'animals-and-pets', 'agriculture-and-foodstuff', 'office-and-commercial-equipment-tools', 'repair-and-construction' ] if len(sys.argv) < 2 or sys.argv[1] not in ACCEPTED_TYPES: print('No category specified\n\n. Example: "python3 scrape.py vehicles"\n\n.Accepted categories are: %s' % ", ".join(ACCEPTED_TYPES)) else: type = sys.argv[1] name = type if len(sys.argv) > 2: name = sys.argv[2] profile_id_list_and_profiles = get_profile_id_list_and_profiles() #print(profile_id_list_and_profiles[1]) if profile_id_list_and_profiles is not None: def get_address(page): return "https:///api_web/v1/listing?slug=%s&webp=true&page=%d" % (type, page) profile_id_list = profile_id_list_and_profiles[0] keep_running = True total_pages = 0 page = 1 total_new_profiles = 0 while keep_running: res = S.get(get_address(page)) total = 0 counts = 0 if res.status_code == 200 and res.json()["status"] == "ok": new_profiles = [] body = res.json() data = body["adverts_list"] list = data["adverts"] total = len(list) counts = data["count"] total_pages = data["total_pages"] #print(list) print("Count: %d | Size: %d\n | Page: %d" % (counts, total, page)) for p in list: if p["user_id"] not in profile_id_list: new_profiles.append(p) profile_id_list.append(p["user_id"]) #print("phone:", p["id"]) update_profiles(new_profiles) total_new_profiles = total_new_profiles + len(new_profiles) page = page + 1 else: print("Error: %d\n" % res.status_code) if page >= total_pages: keep_running = False else: time.sleep(1.5) print("TotalNewEntry: %d" % total_new_profiles) |
Re: Scraping Jiji Ideas by LikeAking: 1:15pm On Feb 28, 2023 |
Please don't suggest a process you haven't used. U guys are the one killing tech in Nigeria. All your solutions will not work for jiji... Make una calm down.. Scraping data on jiji is not a small task.. Don't make it sound small, if e easy do am for op. |
(1) (Reply)
Google: Lagos Code Camp 10/2009 / Science: Could You Travel Back In Time? See What Scientists Achieved. / Frontend Devloper/designer Needed
(Go Up)
Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health religion celebs tv-movies music-radio literature webmasters programming techmarket Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 40 |