Записки программиста, обо всем и ни о чем. Но, наверное, больше профессионального.

2013-08-22

Хрящик

В закромах нашлось — DataGristle — интересный инструментарий для анализа данных в CSV файлах, написан на Python:

Usage Scenarios
  • Operational Diagnostics 1 - a marketing sentiment analysis company uses it for quickly discovering problems in spreadsheets sent to them by their customers. The spreadsheets were often found to be subtly malformed, or had invalid values that could be difficult to find. gristle_determinator was used to quickly sanity-check and find outliers.
  • Operational Diagnostics 2 - a large data warehousing team uses it whenever their bulk load process breaks on invalid data. Their database's bulkloader does not provide much info in this kind of a case, so they use gristle_freaker to quickly size up the nature of the data in a few problematic columns, gristle_viewer to examine individual records, and gristle_determinator to sanity-check the file structure. This has speed up the problem determination and resolution steps enormously.
  • Feed Analysis - a large data warehousing team uses it whenever they have new potential data sources to analyze. The gristle_determinator quickly finds data quality issues, and identifies characteristics useful for data modeling. On some large complex feeds, it can sometimes perform 8-20 hours of initial analysis in just five minutes.

What's Included
  • gristle_determinator - Analyses csv files and prints information about the file structure and each field within it.
  • gristle_slicer - Selects rows and columns out of csv file.
  • gristle_freaker - Creates frequency distributions of one or more columns of a csv file.
  • gristle_viewer - Displays a single record from a csv file organized in two columns, with labels to the left and values to the right.



Автор DataGristle работает на IBM в теме Data Warehouse, если вам это о чем-то говорит.

original post http://vasnake.blogspot.com/2013/08/blog-post_22.html

Комментариев нет:

Отправить комментарий

Архив блога

Ярлыки

linux (241) python (191) citation (186) web-develop (170) gov.ru (159) video (124) бытовуха (115) sysadm (100) GIS (97) Zope(Plone) (88) бурчалки (84) Book (83) programming (82) грабли (77) Fun (76) development (73) windsurfing (72) Microsoft (64) hiload (62) internet provider (57) opensource (57) security (57) опыт (55) movie (52) Wisdom (51) ML (47) driving (45) hardware (45) language (45) money (42) JS (41) curse (40) bigdata (39) DBMS (38) ArcGIS (34) history (31) PDA (30) howto (30) holyday (29) Google (27) Oracle (27) tourism (27) virtbox (27) health (26) vacation (24) AI (23) Autodesk (23) SQL (23) Java (22) humor (22) knowledge (22) translate (20) CSS (19) cheatsheet (19) hack (19) Apache (16) Manager (15) web-browser (15) Никонов (15) Klaipeda (14) functional programming (14) happiness (14) music (14) todo (14) PHP (13) course (13) scala (13) weapon (13) HTTP. Apache (12) SSH (12) frameworks (12) hero (12) im (12) settings (12) HTML (11) SciTE (11) USA (11) crypto (11) game (11) map (11) HTTPD (9) ODF (9) Photo (9) купи/продай (9) benchmark (8) documentation (8) 3D (7) CS (7) DNS (7) NoSQL (7) cloud (7) django (7) gun (7) matroska (7) telephony (7) Microsoft Office (6) VCS (6) bluetooth (6) pidgin (6) proxy (6) Donald Knuth (5) ETL (5) NVIDIA (5) Palanga (5) REST (5) bash (5) flash (5) keyboard (5) price (5) samba (5) CGI (4) LISP (4) RoR (4) cache (4) car (4) display (4) holywar (4) nginx (4) pistol (4) spark (4) xml (4) Лебедев (4) IDE (3) IE8 (3) J2EE (3) NTFS (3) RDP (3) holiday (3) mount (3) Гоблин (3) кухня (3) урюк (3) AMQP (2) ERP (2) IE7 (2) NAS (2) Naudoc (2) PDF (2) address (2) air (2) british (2) coffee (2) fitness (2) font (2) ftp (2) fuckup (2) messaging (2) notify (2) sharepoint (2) ssl/tls (2) stardict (2) tests (2) tunnel (2) udev (2) APT (1) CRUD (1) Canyonlands (1) Cyprus (1) DVDShrink (1) Jabber (1) K9Copy (1) Matlab (1) Portugal (1) VBA (1) WD My Book (1) autoit (1) bike (1) cannabis (1) chat (1) concurrent (1) dbf (1) ext4 (1) idioten (1) join (1) krusader (1) license (1) life (1) migration (1) mindmap (1) navitel (1) pneumatic weapon (1) quiz (1) regexp (1) robot (1) science (1) serialization (1) spatial (1) tie (1) vim (1) Науру (1) крысы (1) налоги (1) пианино (1)