
Where to find datasets?
If you want to practice your skill as a data scientist, you need data. There are some excellent free publicly available sources of data:
- Kaggle datasets, ranging from medical, financial, business, etc.
- Socrata.com, also data from across multiple fields
- Quandl - financial, economic, and alternative data: https://www.quandl.com/
- The UCI Machine Learning Repository contains reference to some classical datasets used in the literature; this is particularly helpful if you would like to compare your results with results obtained using more classical statistical methods and published in the literature: https://archive.ics.uci.edu/ml/datasets.php
- Google now features a (beta) Google Dataset Search: https://toolbox.google.com/datasetsearch
- If you are at an academic institution, check (possibly with your institution's library) whether your institution provides access to Wharton Research Data Services (WRDS): https://wrds-web.wharton.upenn.edu/wrds/
- World development data on World Bank Open Data: https://data.worldbank.org/
- Eurostat: https://ec.europa.eu/eurostat/data/database
If you are into high-frequency data analysis, there are some commercially available:
LOBSTER - high-frequency, easy-to-use and latest limit order book data for your research: https://lobsterdata.com/
- Cross-asset class high-frequency financial data (quotes and trades) from tickdata.com: https://www.tickdata.com/
- FINRA TRACE data on corporate bonds, real-time and historical: http://www.finra.org/industry/trace-data-licensing (note that FINRA TRACE data is available on WRDS)
- PortaraCQG: supplies Daily, Intraday, Tick and Level 1 Futures, Forex, Cash Commodities, Fixed Income, Calendar Spreads, ETF’s